Your Own Voice
Your Own Voice
When InterMIND translates your speech for another participant, they don't hear a robotic text-to-speech narrator. They hear a voice that's recognizably yours — carrying your timbre and your way of speaking — now saying the words in their language.
This works in both directions and for every participant independently. In a meeting where five people speak five languages, each person hears the other four in their own language, and each of those four still sounds like themselves.
What It Sounds Like
Most live-translation tools replace the speaker with a single generic synthetic voice. The result is intelligible but flat — you lose who is talking, the emphasis, the personality. InterMIND keeps the speaker's voice, so a translated meeting feels like a conversation between the people who are actually in it, not a queue of announcements read by a machine.
How It Works
InterMIND uses a cascaded pipeline, and the voice step is the last stage:
- Speech recognition — your words are transcribed in your own language, as you speak.
- Segmentation — the transcript is grouped into stable sentence fragments (clauses) so translation can begin before you finish the sentence.
- Translation — each fragment is translated progressively into the listener's language.
- Voice synthesis — each translated fragment is spoken back using a sample of your own voice, and sent to the listener.
While the meeting is still gathering enough of your speech to model your voice (roughly the first 5–10 seconds), the synthesis uses the audio fragment that matches what you just said in your source language. Once there's a long enough sample, it switches to using that sample for everything after. In practice you don't notice a switch — the translation sounds more like you as the call goes on. It won't be a flawless impression of your voice, but it's recognizably you rather than a generic narrator — and it keeps improving as the model hears more of you.
Languages
Your-own-voice translation is available for all 21 voice languages — the same set listed in Choosing Languages. There's nothing to enable separately: when translation is on, participants automatically hear you in your own voice.
Privacy
The voice sample used for synthesis is ephemeral. It exists only for the duration of the live meeting and is not stored anywhere — the Mind API and SDK that power the real-time session keep no data once the conference session ends. This voice sample is unrelated to InterMIND's video-and-voice recording features, which are separate, explicit recordings you start on purpose.
On the Roadmap: Lip-Sync
Hearing the translation in your own voice is the first half of a larger goal. The next step we're working toward is lip-sync — re-timing the speaker's mouth on camera to match the translated audio, so each participant appears to be speaking the other's language. Combined with own-voice translation, the aim is a call where people who share no common language see and hear each other as if each spoke the other's language natively.
This is a roadmap item, not a shipped feature yet — own-voice translation above is live today.