DeepScript
developers

Giving AI Agents Access to Your Audio: Transcription via MCP

How to make a year of meeting recordings, interviews, and calls queryable by AI agents like Claude and ChatGPT – using transcription, MCP, and the right retention policy.

DeepScript TeamMay 5, 20269 min czytania

Giving AI Agents Access to Your Audio: Transcription via MCP

Most teams accumulate audio. Meeting recordings, sales calls, interview tapes, podcast episodes, internal trainings – files that nominally contain the company's collective memory but practically live in folders nobody opens twice. Search the audio? Sure, by filename, if you remember it.

In 2026, this is solvable. Audio gets transcribed cheaply enough (under €0.30/hour) that it makes sense to transcribe everything by default. AI agents read text fluently. The Model Context Protocol (MCP) gives those agents structured access to external systems. Stitch the three together and you get a setup where Claude or ChatGPT can answer questions like "what did we tell the Hamburg client about the rollout schedule" with citations from the actual recordings.

This article walks through the architecture: what to record, how to transcribe, how to expose it to an agent, and what privacy decisions matter.

The shape of the problem

Audio is hostile to retrieval. You cannot grep an MP3, you cannot link to a phrase in a meeting, you cannot ask a chatbot about a sales call from three months ago. The information is there, encoded in a waveform, but every consumer of it (humans, tools, agents) has to listen linearly.

Transcription flips this. A transcript is text. Text indexes, searches, summarizes, embeds, links. A 60-minute meeting becomes a queryable artifact instead of an unopenable file.

But "transcription" alone is not the goal. The goal is: agents that can act on what was said. That requires three layers:

  1. Transcribe – convert audio to time-stamped, speaker-labeled text.
  2. Store – keep transcripts queryable indefinitely (or as long as compliance allows).
  3. Expose – give agents a structured way to search and read those transcripts.

The first two are well-trodden. The third is what MCP enables.

A primer on MCP

The Model Context Protocol, introduced by Anthropic in late 2024 and now broadly adopted, is a standard way for LLMs to talk to external systems. An MCP server exposes a set of tools (read this resource, search this index, run this query) and an agent's runtime calls those tools mid-conversation.

The win over earlier "function calling" approaches is interoperability. An MCP server you build once works with Claude, ChatGPT (via the Claude-compatible bridges), Cursor, Zed, and any other agent that speaks MCP. You stop building bespoke tool integrations per agent.

For transcription, MCP is a natural fit. The agent says "search transcripts for X," gets back matching segments with speakers and timestamps, and weaves that into a response. The user does not need to know which meeting the answer came from until they ask for citations.

Architecture: agent-readable audio archive

A working setup looks like this:

[Recording sources]                  [DeepScript]                    [Agent]
                                                                           
Zoom/Meet recordings  ─►  ┌────────────────────┐    MCP query    ┌─────────────┐
Phone calls           ─►  │  Transcribe        │  ◄────────────  │  Claude     │
Interviews            ─►  │  Speaker label     │                 │  (or other) │
Field recordings      ─►  │  Store + index     │  ─ JSON ──────► │             │
                          └────────────────────┘                 └─────────────┘
                                   │                                    │
                                   └─ permanent retention via Pro plan ─┘

Three components:

Source-to-DeepScript. Whatever produces audio (Zoom recording bot, call recorder, hand-held device) drops files into DeepScript via the REST API. A small uploader script can watch a folder and POST new files.

DeepScript transcription + storage. Audio is processed, transcripts are stored. The default retention is 30 days for audio; the Pro plan extends this to permanent retention for transcripts and audio. For an "agent-readable archive" use case, you almost always want Pro.

MCP exposure. DeepScript's MCP server runs as part of the API and gives agents these capabilities:

  • list_directories – top-level grouping (e.g., "Sales calls Q1", "Customer interviews")
  • list_transcriptions – transcripts within a directory, filterable by date and speaker
  • read_transcription – full text + word-level timestamps
  • search_transcriptions – full-text search across the archive, returns matching segments with context

The agent only sees what its credentials allow. Each MCP token is scoped to a directory or set of directories.

Setting it up: the 15-minute version

Walking through the actual setup with DeepScript:

1. Subscribe to Pro on the directory you want agent-accessible

Permanent retention of transcripts requires the Pro plan (€22/month per directory). You can keep other directories on the default 30-day retention; pricing is per-directory, not blanket.

curl -X POST https://api.deepscript.com/v1/account/pro/subscribe \
  -H "Authorization: Bearer $DS_API_KEY" \
  -d '{"directoryId":"dir_xyz"}'

This redirects to Stripe Checkout for the first activation. Subsequent directories are added without re-checkout.

2. Mint an MCP token scoped to that directory

curl -X POST https://api.deepscript.com/v1/account/mcp-tokens \
  -H "Authorization: Bearer $DS_API_KEY" \
  -d '{"directoryId":"dir_xyz","name":"claude-research-archive"}'

The response contains a token starting with mcp_live_…. Treat it like a password – it grants the holder full read access to that directory.

3. Add it to Claude Desktop

In Claude Desktop's claude_desktop_config.json:

{
  "mcpServers": {
    "deepscript": {
      "command": "npx",
      "args": ["-y", "@deepscript/mcp"],
      "env": {
        "DEEPSCRIPT_MCP_TOKEN": "mcp_live_xxxxxxxxxx"
      }
    }
  }
}

Restart Claude Desktop and the new tool surface appears. You can now ask:

  • "List the most recent meetings in the archive"
  • "Search for everything we discussed about the Hamburg rollout"
  • "What did Sandra say about the Q3 budget?"
  • "Pull the part of last Tuesday's all-hands where we talked about hiring"

The agent calls search_transcriptions, gets back matching segments with citations (transcription ID + timestamp), and synthesizes an answer. If you ask for citations, it surfaces them.

4. Automate the upload

The most common gap is "I have audio but it never lands in DeepScript." The simplest fix is a folder-watcher script:

import os, time, requests
from pathlib import Path

WATCH = Path("~/Recordings").expanduser()
API_KEY = os.environ["DS_API_KEY"]
DIR_ID = "dir_xyz"

uploaded = set()
while True:
    for f in WATCH.glob("*.{mp3,wav,m4a,mp4}"):
        if f.name in uploaded:
            continue
        with open(f, "rb") as fh:
            r = requests.post(
                "https://api.deepscript.com/v1/transcriptions",
                headers={"Authorization": f"Bearer {API_KEY}"},
                files={"file": fh},
                data={"model": "premium", "directoryId": DIR_ID},
            )
        if r.ok:
            uploaded.add(f.name)
    time.sleep(60)

For Zoom-style recordings, the cloud-recording webhooks let you skip the local file: subscribe to recording.completed, fetch the file with the recording token, POST to DeepScript. About 50 lines of Python or Node.

What this enables

Practical use cases we see teams build:

Sales-call retrieval. "Find the call where we promised the customer X by Y" – instead of listening to 40 calls. Citations land you on the exact minute.

Research synthesis. A qualitative researcher with 30 anonymized interview transcripts can ask "what themes emerged around topic X" and get summaries grounded in actual quotes.

Compliance review. Legal asks "did anyone discuss [sensitive topic] in customer-facing meetings last quarter?" The agent runs a structured search and produces a report with timestamps.

Onboarding. "What did we decide about [project] in the last six months?" – new hires get a chronologically ordered summary instead of asking five colleagues.

Personal memory. Founders who record every external meeting get a queryable interface to everything they have said and heard.

The pattern is the same in each case: questions that previously required either listening to recordings or interrogating colleagues are now answerable from a chat interface.

What you should not do

A few things this architecture is not good at, and where teams get burned:

Real-time decision support. MCP-mediated retrieval has latency (a few seconds per query). It is not the right primitive for in-meeting assistants – for that, you want streaming transcription with on-the-fly summarization.

Highly sensitive content with broad agent access. If your transcripts contain attorney-client privileged content, medical records, or HR confidential material, scope tokens narrowly. Do not give a general-purpose agent access to a directory containing material it should not see – agents follow instructions but they also follow accidental ones.

Replacing memory. Agents querying a transcript archive are good at retrieval and summarization. They are not good at deciding what is important. The transcript is the source of truth; the agent's summary is a useful but lossy view.

Skipping the privacy review. Permanent retention of audio means permanent custody. Decide what the deletion path looks like *before* you start hoarding. DeepScript's DELETE /v1/transcriptions/{id} works on Pro-archived material the same way as on default-retention material, but you should have a documented retention/deletion policy before you accumulate a year of recordings.

Privacy implications you actually need to think about

Putting an agent in front of an audio archive concentrates a lot of organizational knowledge in one place. That is the point – but it raises the stakes for access control:

Who has the MCP token, and what scope does it have? A token to "all sales calls" is much more powerful than a token to "calls I personally was on." Scope tokens to the smallest directory the use case requires.

Where does the agent itself run? A self-hosted agent running on your infrastructure handles the transcripts on your hardware. A cloud-hosted agent (Claude Desktop talking to Anthropic's API) sends queries and retrieved content to a third party. That may or may not be acceptable depending on the content. For European customers with GDPR-bound data: read the agent vendor's data processing terms carefully, sign their DPA, and document the data flow in your record of processing activities.

Audit trail. DeepScript logs every MCP query (which token, which transcript IDs accessed, when). Periodic review of the audit log catches abuse early.

Data subject rights. If transcripts contain personal data of identifiable individuals (interview subjects, customers, employees), the GDPR right to erasure applies to the transcripts as well as the audio. DeepScript supports per-transcript deletion that propagates to all replicas.

Closing thought

The combination of cheap transcription, structured storage, and standardized agent protocols is genuinely new. Five years ago, "make our archive AI-queryable" was a six-figure consulting engagement. Today it is a weekend project – three components, each commodity, glued together with about 100 lines of code.

The bottleneck is not technology anymore. It is deciding what to archive, who gets access, and how to retire material when it becomes a liability. Those are organizational questions, not engineering ones.

If you want to try this with DeepScript, the API documentation is at api.deepscript.com/docs. Three free transcriptions on signup, no credit card required to evaluate.

MCPAI agentstranscriptionclaudeautomation

Chcesz spróbować samodzielnie?

Trzy transkrypcje za darmo, bez karty kredytowej. Dane pozostają w Niemczech.

Giving AI Agents Access to Your Audio: Transcription via MCP | DeepScript