[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"blog-post-en-/inside-the-translation-pipelines":3},{"page":4,"surround":626},{"id":5,"title":6,"authors":7,"badge":10,"body":11,"date":616,"description":617,"extension":618,"image":619,"meta":620,"navigation":621,"path":622,"seo":623,"stem":624,"__hash__":625},"blog/blog/inside-the-translation-pipelines.md","Inside the four translation pipelines that run InterMIND",[8],{"name":9},"The Mind.com Team","Architecture",{"type":12,"value":13,"toc":591},"minimark",[14,18,27,35,38,64,67,72,75,107,110,113,115,119,125,131,136,143,146,175,179,182,185,208,210,214,219,228,232,235,269,273,276,279,305,312,319,321,325,330,335,357,361,368,373,376,381,388,391,393,397,402,410,414,425,440,454,457,461,468,471,475,482,489,491,495,498,518,521,523,527,547,549,553,581,588],[15,16,6],"h1",{"id":17},"inside-the-four-translation-pipelines-that-run-intermind",[19,20,21,22,26],"p",{},"The old ",[23,24,25],"code",{},"/product/overview/how-it-works"," page on mind.com is several major releases out of date. It describes a single \"translation engine\" the way most vendor pages do — one big arrow from \"you speak\" to \"they hear.\" That picture was already a simplification two years ago. Today it is wrong.",[19,28,29,30,34],{},"The truth is that InterMIND runs ",[31,32,33],"strong",{},"four separate translation pipelines",", each solving a different problem with a different engine, a different latency budget, and a different quality envelope. They share a language picker. They do not share an engine.",[19,36,37],{},"This is the updated answer to \"how does it work.\"",[39,40,41],"blockquote",{},[19,42,43,46,47,55,56,59,60,63],{},[31,44,45],{},"A companion piece:"," ",[48,49,51],"a",{"href":50},"/blog/how-many-languages-do-you-support",[52,53,54],"em",{},"\"How many languages do you support?\""," covers what each pipeline ",[52,57,58],{},"covers"," (24 / 21 / 30 / 6). This post covers what each pipeline ",[52,61,62],{},"does"," — and why it is its own thing.",[65,66],"hr",{},[68,69,71],"h2",{"id":70},"why-one-engine-for-everything-is-a-lie","Why \"one engine for everything\" is a lie",[19,73,74],{},"A live meeting platform has at least four jobs to do at once, and they pull in incompatible directions:",[76,77,78,85,91,97],"ol",{},[79,80,81,84],"li",{},[31,82,83],{},"Real-time voice"," — audio in, translated audio out, under one second, every viewer in their own language. The hard constraint is latency.",[79,86,87,90],{},[31,88,89],{},"Real-time chat text"," — short messages, fast, with edits and quotes and HTML structure preserved.",[79,92,93,96],{},[31,94,95],{},"Real-time shared notes"," — character-by-character collaborative typing, with structural hierarchy (lists, headings, checkboxes) that has to survive translation.",[79,98,99,102,103,106],{},[31,100,101],{},"Asynchronous document files"," — a 40-page PDF dropped into chat. No latency budget. The hard constraint is ",[52,104,105],{},"fidelity"," — formatting, tables, page numbers, font.",[19,108,109],{},"You can build one giant LLM call that tries to do all four. We tried. It is bad at all four. The latency budget for voice means the model can't think; the fidelity budget for documents means the model has to. A chat edit needs a diff in the viewer's language; a 40-page PDF needs format preservation that no token-streaming model gives you.",[19,111,112],{},"So we run four. Here is each one.",[65,114],{},[68,116,118],{"id":117},"pipeline-1-real-time-voice-translation","Pipeline 1: Real-time voice translation",[19,120,121,124],{},[31,122,123],{},"The problem:"," A participant speaks French. Another participant joined in German, a third in Brazilian Portuguese, a fourth in Japanese. Each one needs to hear the speaker in their own language, in their own ear, with a delay short enough to keep eye contact possible.",[19,126,127,130],{},[31,128,129],{},"The budget:"," Sub-second end-to-end. Anything past ~1.2 seconds and the conversation breaks — people start talking over the translation, and the meeting drifts toward \"let's just switch to English.\"",[132,133,135],"h3",{"id":134},"how-the-audio-actually-moves","How the audio actually moves",[19,137,138],{},[139,140],"img",{"alt":141,"src":142},"Voice translation pipeline: the speaker's browser does ASR locally via Mind SDK, ws-server fans the transcript out to the translation engine over one WebSocket per target language present in the room, and each viewer receives their own translated audio track.","/blog/inside-the-translation-pipelines-voice.svg",[19,144,145],{},"A few things worth naming explicitly:",[147,148,149,155,165],"ul",{},[79,150,151,154],{},[31,152,153],{},"ASR runs in the speaker's browser",", not on a central server. We use the Mind SDK locally; this saves a round-trip and gives us the source-language transcript with the lowest possible delay before translation can even start.",[79,156,157,160,161,164],{},[31,158,159],{},"Translation is not one fan-out."," We hold a pool of WebSocket connections to our translation engine, ",[31,162,163],{},"one per target language present in the room",". If three participants picked German, German shares one connection. If nobody picked Arabic, no Arabic connection is opened. The pool drops idle connections after five minutes. This is why a four-language meeting costs the same as a forty-language meeting up to the point of who actually showed up — we never translate to languages no participant is listening in.",[79,166,167,170,171,174],{},[31,168,169],{},"Synthesized speech is per-viewer."," Each participant receives their own translated audio track, mixed against the original speaker's video. They are not watching a master \"translated meeting\" — they are watching the ",[52,172,173],{},"same meeting",", with their personal audio channel translated to their picked language. This is why two people in the same physical room can each plug in headphones and hear different languages.",[132,176,178],{"id":177},"why-this-matters-when-a-meeting-goes-sideways","Why this matters when a meeting goes sideways",[19,180,181],{},"In a 40-minute call with eight languages, things break in interesting ways: WebSockets drop, ASR temporarily mis-transcribes a proper noun, one participant's network gets jittery. The architecture above is what lets us isolate failures: one viewer's audio glitching does not affect the other seven, because the translation engine never produced \"the translation\" in the first place — it produced eight, in parallel, and only the affected one has to recover.",[19,183,184],{},"The engine itself is ours, hosted on our own infrastructure. We do not route real-time voice through third-party general-purpose LLMs. The latency budget rules them out; the data-residency story rules them out for the regulated customers who actually care.",[39,186,187],{},[19,188,189,46,192,195,196,202,203,207],{},[31,190,191],{},"What we publish about voice quality:",[48,193,194],{"href":194},"/benchmark"," runs the production voice pipeline against ",[48,197,201],{"href":198,"rel":199},"https://github.com/facebookresearch/flores",[200],"nofollow","FLORES-200"," sentences for every published language pair, monthly. The judge is named (Gemini 2.5 Flash primary, Claude Sonnet 4 fallback). The full distribution — median, p10, p90, min, max, sample size — is on the page. See ",[48,204,206],{"href":205},"/benchmark/methodology","the methodology"," for what those numbers do and don't measure.",[65,209],{},[68,211,213],{"id":212},"pipeline-2-real-time-chat-translation","Pipeline 2: Real-time chat translation",[19,215,216,218],{},[31,217,123],{}," Every chat message in the meeting, translated for every participant in their own language, as it is sent. Plus edits — and edits need to look like edits, not like re-translations.",[19,220,221,223,224,227],{},[31,222,129],{}," Fast, but not sub-second. A chat message can take half a second to appear in another language without anyone caring. What people care about is whether the translation is ",[52,225,226],{},"right"," and whether edits make sense.",[132,229,231],{"id":230},"what-the-chat-pipeline-actually-does","What the chat pipeline actually does",[19,233,234],{},"Each message goes through the same translation engine the voice pipeline uses — but with different pre- and post-processing:",[147,236,237,243,253,263],{},[79,238,239,242],{},[31,240,241],{},"HTML structure is preserved."," Chat supports rich text (paragraphs, lists, quotes, bold, italic). We convert to plain text for the model, translate, then re-wrap the result in the original tags. The model never sees the HTML — it sees clean prose.",[79,244,245,248,249,252],{},[31,246,247],{},"Quotes are translated independently."," If you reply to a message and quote it, the ",[23,250,251],{},"[QUOTE]…[/QUOTE]"," block and the new content are translated as separate units, so the model can't confuse the two.",[79,254,255,258,259,262],{},[31,256,257],{},"Long messages get chunked."," We split on paragraph boundaries at 1,000 characters per chunk. Each chunk is its own translation call. We do ",[52,260,261],{},"not"," feed 4,000-character novels to the model in one shot — the failure modes (truncation, lost paragraphs, mid-sentence cut-offs) are too ugly.",[79,264,265,268],{},[31,266,267],{},"Translation is lazy."," We use an IntersectionObserver: a message is only translated when it scrolls into the viewer's viewport. Switching languages in a long-running channel used to replay every translation API call from the history. Now it doesn't.",[132,270,272],{"id":271},"the-interesting-part-edits-as-diffs","The interesting part: edits as diffs",[19,274,275],{},"In v1.2 we changed how chat edits behave for viewers in another language. The old behavior was: someone edits a message, we re-translate the whole thing, you see a fresh paragraph and have to spot what moved.",[19,277,278],{},"The new behavior:",[76,280,281,284,291,302],{},[79,282,283],{},"The original message was already translated to your language.",[79,285,286,287,290],{},"When the sender edits, we re-translate the ",[52,288,289],{},"new"," version.",[79,292,293,294,297,298,301],{},"We compute the diff between ",[31,295,296],{},"your previous translation"," and ",[31,299,300],{},"your new translation",", in your language.",[79,303,304],{},"We show that diff inline — same way Git shows you what changed.",[19,306,307,308,311],{},"So when \"review by Tuesday\" becomes \"review by Thursday\" in English, your Spanish-reading colleague sees ",[31,309,310],{},"martes → jueves"," highlighted, not a re-translated paragraph they have to re-read.",[19,313,314,315,318],{},"This required treating the chat pipeline as a ",[52,316,317],{},"stateful"," per-viewer cache, not a stateless translate-on-request endpoint. Documents and voice don't need this. Chat does.",[65,320],{},[68,322,324],{"id":323},"pipeline-3-real-time-shared-notes-translation","Pipeline 3: Real-time shared-notes translation",[19,326,327,329],{},[31,328,123],{}," The host opens a shared-notes pane and starts typing. Every participant sees the notes in their language, character-by-character, with the structure of the document — headings, nested lists, checklists, code blocks — intact.",[19,331,332,334],{},[31,333,129],{}," Same as chat (~half a second), but with two extra constraints:",[147,336,337,347],{},[79,338,339,342,343,346],{},[31,340,341],{},"The thing being translated changes mid-translation."," The host is still typing. A naive system that translates \"the whole document\" each keystroke produces flicker and burns the API budget. We translate at the granularity of the ",[52,344,345],{},"changed unit",", not the whole document.",[79,348,349,352,353,356],{},[31,350,351],{},"Structure must survive."," If you ask a translation model to translate a markdown blob with three nested lists, you get back something that ",[52,354,355],{},"looks"," like the original but with subtly flattened hierarchy, renumbered items, or moved indentation. We do not let the model see the whole blob.",[132,358,360],{"id":359},"how-the-notes-pipeline-differs-from-chat","How the notes pipeline differs from chat",[19,362,363,364,367],{},"The structural preservation is the main thing. We translate ",[31,365,366],{},"each list item independently"," rather than as one document. The model sees:",[39,369,370],{},[19,371,372],{},"\"Compliance review — Q2 deliverables\"",[19,374,375],{},"— not:",[39,377,378],{},[19,379,380],{},"\"# Project plan\\n## Quarter\\n- Compliance review — Q2 deliverables\\n- Vendor scoring\\n  - Tier 1 vendors...\"",[19,382,383,384,387],{},"The wrapping document — the ",[23,385,386],{},"\u003Cul>",", the headings, the indentation — is rebuilt on the client side using the same structure the original document had, with each leaf node swapped for its translation. The model never gets to \"improve\" the hierarchy.",[19,389,390],{},"Notes also use the same per-viewer diff model as chat edits: if the host changes a line, viewers in other languages see the changed words highlighted, not a fresh paragraph.",[65,392],{},[68,394,396],{"id":395},"pipeline-4-asynchronous-document-translation","Pipeline 4: Asynchronous document translation",[19,398,399,401],{},[31,400,123],{}," Someone drops a 40-page PDF, a Word doc, a PowerPoint deck, or an Excel sheet into chat. Each participant can request a copy in their own language. The translated file must look like the original — same fonts, same tables, same page numbers, same headers, same charts in place.",[19,403,404,406,407,409],{},[31,405,129],{}," No real-time constraint. A minute is fine. Two minutes is fine. The constraint is ",[31,408,105],{}," — if the translated PDF doesn't look like the original, the recipient won't trust it.",[132,411,413],{"id":412},"why-this-pipeline-does-not-share-an-engine-with-voice","Why this pipeline does not share an engine with voice",[19,415,416,417,420,421,424],{},"A general LLM, even a very good one, will hand you back a translated ",[52,418,419],{},"text"," of a document. It will not hand you back a translated ",[52,422,423],{},"PDF"," with the same layout. The model has no concept of \"page break that has to line up with the source\" or \"table cell that has to keep its column width.\"",[19,426,427,428,431,432,435,436,439],{},"For this surface we use the ",[31,429,430],{},"DeepL Document API"," directly. It is purpose-built for translating ",[52,433,434],{},"files as files",", not ",[52,437,438],{},"prose extracted from files",". DeepL handles:",[147,441,442,445,448,451],{},[79,443,444],{},"PDF (with layout preservation)",[79,446,447],{},"DOCX, DOC",[79,449,450],{},"PPTX",[79,452,453],{},"XLSX",[19,455,456],{},"The document is uploaded to DeepL's pipeline, translated server-side with formatting intact, and returned as the same format. We then upload the result to our object storage and surface it back in chat as a downloadable attachment.",[132,458,460],{"id":459},"what-this-costs-and-why-we-dont-hide-it","What this costs and why we don't hide it",[19,462,463,464,467],{},"DeepL bills a minimum of 50,000 characters per document — roughly one US dollar per file on the Pro tier, regardless of whether the document is one page or thirty. We absorb that cost rather than charging per file; it shows up in the meeting's translation usage as ",[31,465,466],{},"billed characters",", converted to word-units that match the way the rest of the product reports translation activity.",[19,469,470],{},"We picked DeepL for this surface because it is the best-in-class engine for document translation specifically. We do not pretend to have built a better one. The same is not true the other way around — DeepL does not run a live-voice pipeline of the kind we built for meetings. Different problems; different tools. The honest version of \"what powers InterMIND translation\" is \"the right engine per pipeline\" — not \"our engine, everywhere.\"",[132,472,474],{"id":473},"languages-this-pipeline-covers-that-voice-does-not","Languages this pipeline covers that voice does not",[19,476,477,478,481],{},"The document pipeline reaches ",[31,479,480],{},"30 languages",", vs. 21 for voice. The extra nine include: Bulgarian, Greek, Estonian, Indonesian, Lithuanian, Latvian, Norwegian Bokmål, Slovak, Slovenian — plus Arabic and Turkish, which we hide from the real-time picker because the voice quality doesn't clear our bar but which DeepL handles well as documents.",[19,483,484,485,488],{},"That asymmetry is real. It means a French participant in a meeting can request the contract PDF in Estonian even though they cannot listen to the meeting in Estonian. We flag it in the picker rather than smooth it over with a single number. The reasoning is in the ",[48,486,487],{"href":50},"language-count post",".",[65,490],{},[68,492,494],{"id":493},"where-the-pipelines-meet","Where the pipelines meet",[19,496,497],{},"The four pipelines do not run in isolation. A meeting room is where they touch each other, and the seams matter:",[147,499,500,506,512],{},[79,501,502,505],{},[31,503,504],{},"A chat message with a document attachment"," triggers the chat pipeline for the text and the document pipeline for the file. The participant in another language sees the message translated immediately and the attachment translation arriving asynchronously as a downloadable.",[79,507,508,511],{},[31,509,510],{},"A shared note that quotes a transcript line"," crosses notes ↔ voice. The transcript is what the voice pipeline produced for the sender's language; the note translation produces a per-viewer copy of that quote in everyone else's language, with its source attribution preserved.",[79,513,514,517],{},[31,515,516],{},"A transcript exported after the meeting"," runs the chat-style text pipeline over the full conversation, producing a per-language file that participants can download. This is the same code path as chat translation, just batched.",[19,519,520],{},"The language picker is one piece of UI. The infrastructure underneath is four pipelines, talking to each other.",[65,522],{},[68,524,526],{"id":525},"what-we-deliberately-do-not-try","What we deliberately do not try",[147,528,529,535,541],{},[79,530,531,534],{},[31,532,533],{},"No \"unified translation model.\""," We are not building one model that does voice, chat, notes, and documents. The latency vs. fidelity trade-off doesn't have a winner. We use the right engine per surface.",[79,536,537,540],{},[31,538,539],{},"No silent re-routing."," If voice can't translate to Hindi today, we don't quietly fall back to the document engine and pretend it worked. Hindi is hidden from the picker on both surfaces because the result on either surface today is not shippable.",[79,542,543,546],{},[31,544,545],{},"No \"we translate to 200 languages.\""," Our engine emits 24. Our product ships 21 on the live surfaces and 30 on documents. The marketing-friendly bigger number is just the engine ceiling. The product number is what actually meets the bar in front of an auditor.",[65,548],{},[68,550,552],{"id":551},"try-it-yourself","Try it yourself",[147,554,555,567,574],{},[79,556,557,562,563,488],{},[48,558,560],{"href":559},"/demo",[23,561,559],{}," — runs the live voice pipeline against your audio, in any of the 21 product languages. The same pipeline that scores ",[48,564,565],{"href":194},[23,566,194],{},[79,568,569,573],{},[48,570,571],{"href":194},[23,572,194],{}," — per-pair, per-month quality on real traffic. Includes the pairs we deliberately hide from the picker, deep-linkable.",[79,575,576,580],{},[48,577,578],{"href":205},[23,579,205],{}," — what the numbers are, what they aren't, who the judge is.",[19,582,583,584,587],{},"Four pipelines, four engines, one meeting room. That is the honest replacement for the old ",[23,585,586],{},"how-it-works"," page.",[19,589,590],{},"— The Mind.com Team",{"title":592,"searchDepth":593,"depth":594,"links":595},"",2,3,[596,597,601,605,608,613,614,615],{"id":70,"depth":593,"text":71},{"id":117,"depth":593,"text":118,"children":598},[599,600],{"id":134,"depth":594,"text":135},{"id":177,"depth":594,"text":178},{"id":212,"depth":593,"text":213,"children":602},[603,604],{"id":230,"depth":594,"text":231},{"id":271,"depth":594,"text":272},{"id":323,"depth":593,"text":324,"children":606},[607],{"id":359,"depth":594,"text":360},{"id":395,"depth":593,"text":396,"children":609},[610,611,612],{"id":412,"depth":594,"text":413},{"id":459,"depth":594,"text":460},{"id":473,"depth":594,"text":474},{"id":493,"depth":593,"text":494},{"id":525,"depth":593,"text":526},{"id":551,"depth":593,"text":552},"2026-05-24","There is no \"the translation\" in InterMIND. There are four pipelines — voice, chat, notes, documents — each with its own engine, latency budget, and quality envelope. This is what actually happens between the moment you speak and the moment a participant in another language understands you.","md","/blog/inside-the-translation-pipelines.svg",{},true,"/blog/inside-the-translation-pipelines",{"title":6,"description":617},"blog/inside-the-translation-pipelines","H8Dnd0-NY9WxNUzTnAjcdtvHuFTVZzJkFzkkmcBHD7I",[627,631],{"title":628,"path":50,"stem":629,"description":630,"children":-1},"\"How many languages do you support?\" — and why our honest answer is six numbers, not one","blog/how-many-languages-do-you-support","Every vendor quotes one language count. We can't, because translation isn't one product. Here is the per-surface breakdown for InterMIND — what is filtered, why, and what we publish on the website.",{"title":632,"path":633,"stem":634,"description":635,"children":-1},"Why translation-quality marketing is broken — and what we publish instead","/blog/why-translation-quality-marketing-is-broken","blog/why-translation-quality-marketing-is-broken","Every translation vendor publishes language counts. None publishes verifiable per-pair quality on real traffic. Why that gap matters in your next procurement evaluation — and what we publish instead."]