Speaker diarization in every transcription — not just the top tier
Automatic answer to "who said what?" — for meetings, interviews, podcasts and focus groups.
3 free transcriptions · no credit card · data stays in Germany
Sprecher-Timeline
Schön, dass es mit dem Termin geklappt hat.
Sehr gerne. Sollen wir direkt starten?
Ja — ich nehme das Gespräch auf, ist das ok?
Speaker diarization is the automatic identification of which speaker is talking at any moment. At DeepScript it's included in both tiers — Standard and Premium alike. Many competitors gate diarization behind an enterprise plan or charge for it separately. We think that's wrong: a transcript without speaker attribution is almost worthless for meetings and interviews. Premium offers finer granularity — especially for voices that sound alike or when speakers overlap. Standard reliably handles 2–6 speakers, Premium scales to 10+ without trouble.
Proof
Why we can claim this
Included in Standard and Premium
No upgrade trap, no "enterprise tier only". Diarization runs in every transcription.
Word-level granularity
Every single word carries a speaker label — not just whole sentences. Mid-sentence speaker changes are caught.
Typically 2 to 10+ speakers
Even large rounds — board meetings, panels, focus groups — are reliably separated.
Renameable in the editor
"Speaker 1" → "Dr Meier" in a single click. The rename is applied across every occurrence.
In practice
What this looks like in practice
Automatic answer to "who said what?" — for meetings, interviews, podcasts and focus groups.
- Word timestamps including speaker label in the JSON export — directly usable in NVivo, MAXQDA and other qualitative-analysis tools.
- SRT/VTT subtitles with speaker prefix: every subtitle starts with the speaker name, e.g. "Dr Meier: …"
- Synchronised with the audio player in the editor — click any word to jump to the audio position and hear the original voice.
- Anonymous speakers stay anonymous: you don't need to assign a single name — "Speaker 1/2/3" is a valid final state.
- The Premium model is better at overlapping speech (cross-talk) and similar-sounding voices (e.g. two young women).
Sprecher-Timeline
Schön, dass es mit dem Termin geklappt hat.
Sehr gerne. Sollen wir direkt starten?
Ja — ich nehme das Gespräch auf, ist das ok?
How to use it
Up and running in a few steps
- 1
1. Upload multi-speaker audio
Meeting recording, interview, podcast — any format with multiple speakers. Mono or stereo, the model detects speaker changes itself.
- 2
2. Get back transcript with speaker labels
The result in the editor shows every utterance with a speaker prefix: "Speaker 1: Good morning. Speaker 2: Hi everyone." The total speaker count is in the header.
- 3
3. Rename speakers
Click "Speaker 1" → real name. Automatically propagated to every occurrence in the transcript. No separate voice models required.
- 4
4. Export with speaker labels
SRT/VTT for subtitles, JSON for downstream pipelines, TXT for a clean reading version. Speaker information is preserved in every format.
FAQ
Frequently asked questions
How many speakers can the model distinguish?+
Typically 2 to 10+. When you go beyond 10 very similar-sounding voices (e.g. a school class), confusion can occur. For board meetings, panels and focus groups the limit is a non-issue in practice.
What happens with overlapping speech?+
On cross-talk the model attributes the dominant voice and marks the section with a low confidence score. Premium is noticeably better here than Standard. In the editor, affected passages are visible via confidence colouring.
Do I have to train speakers in advance?+
No. Diarization works without voice enrolment — the model separates speakers purely from audio features, not from pre-registered voice profiles. Privacy benefit: no biometric voice models are stored.
Do speaker labels also appear in subtitle files?+
Yes. SRT and VTT exports prefix every subtitle with the speaker name: "Dr Meier: Let's get started." If you've renamed speakers, the real names appear; otherwise "Speaker 1/2/3".
Is this suitable for qualitative research with NVivo or MAXQDA?+
Yes. The JSON export carries `start`, `end`, `confidence` and `speaker` per word. Import into NVivo/MAXQDA via their JSON or plain-text-with-speaker-markers workflow. If you need a specific export shape, let us know.
See it for yourself
Upload a file and see the result in minutes. Three transcriptions free, no credit card.