Speak in your own voice — in a language you don't speak

Here is the part of real-time translation that almost everyone gets wrong, and that almost no one talks about: the voice you hear.

You can have excellent speech recognition and excellent translation, and still end up with a meeting that feels like a machine reading a list. Because the last step — turning the translated text back into sound — is where most tools quietly substitute you with a single generic synthetic narrator. Eight people in the room, one robot voice for all of them. You lose who is speaking, the emphasis, the personality. Intelligible, but not a conversation.

InterMIND does the last step differently. When you speak, the other participants hear the translation in a voice that's recognizably yours — carrying your timbre and your way of speaking — now saying the words in their language. It isn't a flawless impression yet; the point is that it's you rather than a stock narrator, and it's getting better. This works for every participant, in both directions, at the same time.

This post is the missing chapter of Inside the four translation pipelines that run InterMIND: that piece explained how audio becomes translated audio. This one is about whose voice comes out the other end.

The default everyone ships, and why it's flat

If you've used live translation in any of the big meeting platforms, you know the sound. A neutral, evenly-paced voice reads the translation. It's the same voice whether the speaker is your CEO opening a town hall or a colleague cracking a joke. The technology underneath is text-to-speech with one fixed voice model, and the design assumption is that intelligibility is enough.

In a real meeting it isn't. Half of what a meeting communicates is who is saying it and how. Strip the voice and you've turned a discussion into a transcript that happens to be spoken aloud. People stop reacting to each other and start waiting their turn.

What InterMIND does instead

The translation runs as a cascaded pipeline — three specialized stages in sequence rather than one model trying to do everything. The first two stages are covered in the pipelines post; the voice step is the one this post is about:

ASR — speech recognition. Your words are transcribed in your own language, in your browser, as you speak. (Running it locally saves a round-trip and gives the lowest possible delay before translation can even start.)
MT — translation. The transcript is grouped into stable sentence fragments — clauses — so translation can begin before you've finished the sentence, and each fragment is translated progressively into the listener's language.
Zero-shot TTS — voice synthesis. Each translated fragment is spoken back out using a sample of your own voice, and streamed to the listener.

It's that third stage — ASR → MT → zero-shot TTS — that produces the effect. "Zero-shot" means the system doesn't need a pre-recorded enrollment or a training session for your voice. It models your voice from the audio of the meeting you're already in.

The warm-up: how it starts sounding like you so fast

There's a chicken-and-egg problem hiding in "use a sample of your own voice." At the very start of a call, the system hasn't heard enough of you yet to model your voice well.

InterMIND handles this with a progressive warm-up:

For roughly the first 5–10 seconds, while it's still gathering enough of your speech, each translated fragment is synthesized using the audio fragment that matches what you just said in your source language. The voicing is anchored to your real, immediate speech.
Once there's a long enough sample — that 5–10 second mark — the system locks onto it and uses it to voice everything afterward.

In practice you don't hear a switch flip. The translation sounds more like you as the conversation gets going — not a perfect double of your voice, but clearly yours rather than a machine's, and improving as the model hears more. The combination of progressive translation (clause by clause, not sentence by sentence) and progressive voicing is what keeps the whole thing under the latency budget while still sounding human.

The voice sample is never stored

This is the part a security or legal team asks about immediately, so here it is plainly.

The voice sample used for synthesis is ephemeral. It exists only for the live conference session, in service of voicing the translation, and it is stored nowhere. The Mind API and SDK that power the real-time session retain no data — everything temporary dies when the conference session ends.

It's worth being precise about what this sample is not: it is not one of InterMIND's recording features. Recording a meeting's video and audio is a separate, deliberate action you take on purpose, with its own controls. The own-voice sample is not a recording — it's a transient input to the speech synthesizer that never outlives the call.

This matters beyond privacy hygiene. "Speak in your own voice" is exactly the kind of feature that sounds like it should involve storing a voiceprint somewhere. It doesn't. The honest version is the better story: your voice is modeled in the moment and gone when you hang up.

Why no one else ships this

It's not that voice cloning is a secret. It's that doing it live, per-participant, in both directions, under a one-second budget, across 21 languages, without storing anything is a different problem than cloning a voice offline for a podcast.

The big platforms optimize their translation for caption coverage and a single safe narrator voice — that's the cheap, robust default at scale. Keeping each speaker's own voice means the synthesis stage has to track every participant independently and stay inside the same latency budget the rest of the pipeline lives under. We built the voice engine ourselves, on our own infrastructure, which is what makes that trade-off ours to make. (More on why the engine is our own code: What one InterMIND meeting is built from.)

Where this is going: lip-sync

Keeping your voice is one half of a bigger goal. The other half is your face.

Right now you hear the other person in their own voice, but if you're on camera, their lips still move to the words they actually said — in a language you don't read. The next step is lip-sync: re-timing the speaker's mouth to the translated audio, so that on your screen they appear to be speaking your language.

Put the two together and the whole point of this work comes into focus. Two people who share no common language sit across a video call and see and hear each other as if each were a native speaker of the other's language — same voice, same face, no interpreter in the middle, no robot reading a script.

To be clear about status: voice is live today; lip-sync is on the roadmap, not shipped. We're calling out the destination because it's why the voice work matters — own-voice translation isn't the feature, it's the first half of "talk to anyone, in any language, as yourself."

Where to hear it

Own-voice translation is live today, across all 21 voice languages — the same languages listed in the docs. There's nothing to turn on separately: when translation is enabled in a meeting, participants automatically hear each other in their own voices. We'll be honest about where it stands: today the voice is already recognizably you, and the resemblance is something we're actively pushing closer. Go listen and judge for yourself.

Try the demo — runs the live voice pipeline against your audio in any of the 21 languages.
See the quality numbers — the same production pipeline, scored monthly against FLORES-200, with the full distribution published per language pair.
How it works, in the docs — the short version of this post.

A translated meeting should feel like the people who are actually in it talking to each other. Keeping your voice is how it gets there.