developers

Transcription API vs Self-Hosted Whisper: When to Choose Which

Honest cost and engineering comparison: running Whisper on your own GPU vs using a transcription API at €0.18/hour. Total cost of ownership, latency, accuracy, when self-hosting actually pays off.

DeepScript Team9. Juni 20268 min

Transcription API vs Self-Hosted Whisper: When to Choose Which

OpenAI released Whisper as open source in September 2022. Since then, "just run Whisper yourself" has become the default suggestion any time someone asks about transcription on a developer forum. It's free, the weights are public, and the model is genuinely good.

So why does the transcription API market still exist? Why would anyone pay €0.18 per audio hour when they could spin up a GPU and run Whisper for what looks like nothing?

The honest answer is that "free" is rarely free, and the actual decision is more nuanced than the forum threads suggest. This article walks through the real cost structure of both options, where each one wins, and how to make the choice for your own workload.

We build a transcription API ourselves, so we have skin in the game. We've also operated GPU clusters for transcription at scale, so we know what self-hosting actually costs. The comparison below is the one we wished existed when we started — not a vendor pitch.

The naive comparison and why it's misleading

Most "should I self-host?" debates start with a calculation like this:

Whisper large-v3 runs in roughly 0.1x real-time on an A100. An A100 costs about €1.50/hour on RunPod. So 10 hours of audio costs €1.50 — that's €0.15/hour, cheaper than any API.

The math is correct. The conclusion isn't.

This calculation ignores everything that turns a model into a service: queueing, retries, error handling, monitoring, model loading time, GPU utilization rate, on-call rotation, scaling, format conversion, language detection, and the engineering hours someone has to spend on all of it. Once you add those in, the picture changes considerably.

Total cost of ownership: a realistic breakdown

Let's price out an actual self-hosted Whisper deployment that can handle, say, 100 hours of audio per day with reliable turnaround.

GPU rental

A100 80GB on a reputable cloud provider runs €1.50–€2.50 per hour. To process 100 hours of audio per day at 0.1x real-time, you need 10 GPU-hours per day. Sounds cheap.

But GPU utilization isn't 100%. Audio arrives in bursts. Models need to load (15–30 seconds for large-v3). Files need to be downloaded, decoded, resampled. Realistic utilization for an unbatched single-worker deployment is 30–50%. Now you need 20–30 GPU-hours per day to cover your 100 audio-hours. At €2/hour, that's €40–€60 per day, or €1,200–€1,800 per month.

Reserved instances or your own hardware change the math but introduce capex and opportunity cost.

Engineering time

This is the line item self-hosters consistently underestimate. A production-grade transcription service requires:

A queueing layer (Redis + BullMQ, RabbitMQ, SQS — pick one and learn it)
A worker that loads Whisper, monitors GPU memory, handles OOM
Format conversion (ffmpeg pipeline for the 20 audio/video formats users actually send)
Voice activity detection so you're not transcribing silence
Language detection (Whisper does this, but you need to decide when to trust it)
Speaker diarization (Whisper doesn't do this — you'll be integrating pyannote)
Word-level timestamps (Whisper outputs them, but they drift; you need WhisperX or similar)
Retry logic for failed jobs
Observability: metrics, logs, alerting on stuck jobs
A status API for clients to poll
File storage with lifecycle policies

Conservatively, getting all of that to production quality is 2–3 months of one senior engineer's time. At loaded cost, that's €30,000–€50,000 just to launch. Ongoing maintenance is roughly 20% of an FTE forever.

Accuracy gap

Whisper large-v3 is good, but it's not the state of the art anymore. AssemblyAI Universal-2, Deepgram Nova-3, and proprietary models from established transcription vendors outperform vanilla Whisper on most benchmarks — particularly on:

Speaker diarization (Whisper has none)
Domain-specific vocabulary (medical, legal, financial terminology)
Noisy audio (call-center recordings, field interviews)
Heavy accents and dialects
Word error rate on numbers and proper nouns

Closing this gap on your own means fine-tuning, building custom vocabulary injection, and integrating a separate diarization model. That's another multi-month project.

When the math works for self-hosting

We've now spent €1,500/month on GPU plus €50,000 in one-time engineering plus ongoing maintenance — to do worse transcription than a €0.18/hour API.

The math only flips in specific cases:

Very high volume: above roughly 10,000 audio-hours per month, raw compute starts to dominate and APIs become expensive.
On-premise requirement: regulatory or contractual mandates that data physically cannot leave your hardware. Common in defense, intelligence, and some government work.
Air-gapped environments: no internet connectivity at all. Self-hosting is the only option.
Latency-critical real-time: sub-200ms streaming latency for low-volume use cases where API round-trips are the bottleneck.
Research and experimentation: you actually need to modify the model, not just use it.

For everything else, an API is cheaper *and* better. That sounds like a vendor pitch but it's also just true.

Latency: a closer look

Self-hosting wins on latency, but only in narrow conditions.

For batch transcription (file in, transcript out), API latency is dominated by upload time. A 1-hour MP3 is ~50 MB; upload over a typical office connection is 5–15 seconds. Compared to the 5–10 minutes of actual transcription work, the network round-trip is noise. Self-hosting saves you nothing.

For streaming transcription (live audio, partial results), things differ. A round-trip to an API server adds 50–150 ms per chunk. If you need sub-100ms partial results — say, for live captioning during a broadcast — a local model wins.

But streaming Whisper isn't a solved problem. Vanilla Whisper is a batch model; you need significant engineering (WhisperLive, faster-whisper, custom chunking) to get streaming behavior. And modern transcription APIs offer WebSocket streaming with sub-300ms partial latency, which is good enough for almost every interactive use case.

Accuracy: the dirty secret

Whisper is trained on web-scraped audio of variable quality. It's excellent on English read speech, very good on major European languages, and progressively weaker on under-represented languages and accents.

The areas where it consistently underperforms:

DACH dialects: Swiss German, Austrian German, Bavarian — Whisper hallucinates or falls back to Standard German.
Numbers and dates: "vierundzwanzigster Februar zweitausendsechsundzwanzig" gets transcribed correctly maybe 70% of the time.
Company names and product names: "AssemblyAI" becomes "Assembly AI" or "a simply AI" depending on the accent.
Overlapping speech: Whisper drops words when speakers overlap. No diarization, no recovery.
Long silences: Whisper sometimes hallucinates content during silence.

These problems are solvable, but not by running the base model. They require post-processing, custom vocabularies, and either fine-tuning or hybrid pipelines with a diarization model. By the time you've built all of that, you've reproduced what a specialized provider offers as an API call.

When self-hosting actually wins

We don't want to be one-sided. There are real scenarios where self-hosting is the right answer:

Regulatory air-gap: Your contracts or regulations literally prohibit external transcription. This is rare but absolute when it applies.
Massive volume with predictable load: At 50,000+ audio-hours per month with steady throughput, dedicated GPU infrastructure is cheaper than API pricing — even after engineering costs.
Research workloads: You're modifying the model, training custom variants, or building something Whisper-adjacent. Self-hosting is just part of the job.
Strict latency requirements that APIs can't meet: sub-50ms streaming for specific industrial or broadcast applications.
Established ML team with bandwidth: You already operate GPU infrastructure for other ML work. Adding transcription is marginal cost.

If none of those describe your situation, the API is almost certainly the right call.

What an API should give you that DIY can't easily

A good transcription API isn't just "Whisper as a service." It bundles a lot of work:

Format flexibility: MP3, WAV, FLAC, OGG, M4A, AAC, MP4, MKV, WebM, MOV all work without you running ffmpeg.
Auto-scaling: bursts of 100 files at once don't queue up for an hour.
Speaker diarization out of the box.
Word-level timestamps that actually line up with the audio.
Custom vocabulary injection without retraining.
Multiple export formats (TXT, SRT, VTT, JSON) without you writing the converters.
An SLA and someone on call when it breaks at 3 AM.

For DeepScript specifically, you also get EU-only data residency, ISO 27001-certified infrastructure, and a GDPR Data Processing Agreement — which solves the compliance problem that self-hosting Whisper on AWS would create anyway.

The honest decision framework

Ask yourself three questions:

Do I have a regulatory reason that audio cannot leave my infrastructure? If yes, self-host. The decision is made.
Am I processing more than ~10,000 hours per month with predictable load and an existing ML team? If yes, run the numbers — self-hosting might be cheaper.
Anything else? Use an API. Spend the engineering time on your actual product instead of on rebuilding what's already commoditized.

Whisper is a remarkable open-source release. It made the whole transcription industry better, including ours — we benchmark against it constantly. But the model is only one component of a transcription service. Everything around it is where the real engineering lives, and that's what you're actually paying for when you use an API.

If you want to see how DeepScript stacks up specifically — pricing, accuracy on DACH dialects, latency, the data-residency story — we keep a detailed comparison page with real numbers. No marketing fluff, just the engineering details.

Whispertranscription APIself-hostedGPUinfrastructurecost analysis

Transcription API vs Self-Hosted Whisper: When to Choose Which

Transcription API vs Self-Hosted Whisper: When to Choose Which

The naive comparison and why it's misleading

Total cost of ownership: a realistic breakdown

GPU rental

Engineering time

Accuracy gap

When the math works for self-hosting

Latency: a closer look

Accuracy: the dirty secret

When self-hosting actually wins

What an API should give you that DIY can't easily

The honest decision framework

Weiterlesen

Building AI Agents with MCP and Transcription Data

Giving AI Agents Access to Your Audio: Transcription via MCP

Speech-to-Text API for Developers: Getting Started with DeepScript

Selbst ausprobieren?