Questions & answers
Clear answers to common transcription questions
Real questions people ask about automatic transcription — from GDPR and confidentiality to API patterns and Swiss German. Substantive answers, no marketing fluff.
Basics
What is automatic transcription?
Automatic transcription is the AI-driven conversion of spoken audio into written text — in seconds, without a human typist in the loop.
What's the difference between transcription, captions, and subtitles?
Transcription is the raw text; captions display it in sync with video; subtitles also translate it into another language.
How long does it take to transcribe one hour of audio?
AI typically transcribes one hour of audio in 1-3 minutes; a skilled human transcriptionist needs 4-6 hours plus review time.
Accuracy & quality
How accurate is AI transcription?
On clean studio audio, modern AI achieves 95-98% accuracy (2-5% word error rate); on noisy, multi-speaker, or accented audio it can drop to 70-85%.
How can I improve transcription accuracy?
Better mics, less reverb, custom vocabulary for jargon, a premium model, and disciplined speaker setup typically lift accuracy by 5-15 percentage points.
Compliance & law
Is AI transcription GDPR-compliant?
Yes, but only with a Data Processing Agreement under GDPR Art. 28, EU data residency, an explicit no-training clause, and clear deletion timelines — otherwise no.
Is it legal to transcribe a meeting?
In Germany only with the consent of every participant — § 201 StGB criminalizes secretly recording private spoken conversations.
Where are my audio files stored during transcription?
It depends on the provider — and it's the question that matters most. Reputable EU providers store in German or EU data centers; US cloud APIs in the US.
Will my audio be used to train AI models?
With some US providers, yes, unless you actively opt out. Reputable providers contractually exclude it — ask, and read the small print.
Can I have medical or patient conversations transcribed?
Yes, but under strict requirements: GDPR Art. 9 (health data), professional-secrecy laws, a DPA with confidentiality undertakings — and EU data residency.
How-to
How do I properly transcribe an interview?
Clean recording + AI first pass + 30-60 minutes of editing per interview hour produces publication-ready transcripts in a fraction of the time.
How do I add timestamps to a transcript?
Modern AI transcription emits word-level timestamps automatically; for readability, markers every 30-60 seconds or at speaker changes are usually enough.
Which export format should I use for my transcript?
TXT for reading, SRT for YouTube and LinkedIn, VTT for HTML5 web video, JSON for code and downstream processing — match the format to the use case.
What is custom vocabulary and when do I need it?
A word list you give the model before transcription with jargon and proper nouns — it lifts recognition of those terms from roughly 30% to 95%.
Developers
Which transcription API is best for developers?
Depends on the use case: AssemblyAI for US workflows, Deepgram for low latency, OpenAI Whisper for multilingual, DeepScript for GDPR and EU data residency.
Should I use webhooks or polling for a transcription API?
Webhooks for primary delivery, polling as a backup — the most robust production setup. Polling alone wastes requests; webhooks alone risk lost events.
Should I self-host Whisper or use an API?
Under ~500 audio hours/month, a managed API almost always wins; above that self-hosting can be cheaper, but only with GPU experience and DevOps budget.
How does live transcription work technically?
Audio is streamed in small chunks over WebSocket; the model returns interim results within 300-800ms and finalizes them after each speech pause.
Languages & dialects
Can AI transcribe Swiss German?
Partly: dialect-tuned models reach 75-85% accuracy, general models often below 50%. Output is usually normalized to Standard German, not written in Swiss dialect.
How many languages does AI transcription support?
The best models (Whisper, AssemblyAI, DeepScript) cover 99 languages — but quality ranges from excellent for the top 10 to barely usable for rare languages.