Why translation-quality marketing is broken — and what we publish instead

Open any live-translation vendor's site. You will see the same kinds of numbers:

"200+ languages"
"6,000+ language pairs"
"World's first" / "Highest accuracy"
"99% accurate"

Now try to find — on any of those vendor pages — what those numbers mean for a meeting you are about to run. Per-language quality. Reproducible methodology. Sample size. Score over time. Honest disclosure of where the model is weak.

You will not find it. Not in the marketing copy, and rarely in the docs.

This is the equilibrium of the category. It exists because of three things:

Most vendors do not own their translation engine. They route through OpenAI, Google, DeepL, Microsoft, or some combination. Publishing per-pair quality data would be benchmarking someone else's model — there is no marketing value in that.
Honest quality data is hard to put on a billboard. A single score is noisy. A distribution is more useful but harder to compress. A last-six-months trend is more useful still, and even harder.
Procurement has not pushed back yet. Buyers accept the marketing numbers at face value, and so the equilibrium holds.

The equilibrium will not hold. The next class of buyer — pharma, legal, financial, audit, public sector — is going to ask harder questions than "how many languages." We built /benchmark because we think they should not have to take a vendor's word for it.

What the marketing numbers don't tell you

"200+ languages" means a vendor has a model that emits text in 200 languages. Quality across those languages ranges from production-grade for major pairs (EN↔DE, EN↔ES, EN↔FR) to barely usable for low-resource pairs. Without a per-pair breakdown, you cannot tell which side of that line your meeting will land on.

"6,000+ language pairs" is N × N combinatorics on 80 source languages. Saying you support 6,000 pairs is the easy part. Saying any specific pair is good enough for a CAPA review, a contract negotiation, or an earnings call — that is the part not in the brochure.

"99% accurate", without specifying what was measured, against what reference, on what sample, by what judge — is content-free. Translation quality has no universal scalar. It has a distribution that depends on language pair, content domain, audio quality (for voice), latency budget, and what "good enough" means for the specific use case.

What a buyer actually needs to know

The questions that show up in real DPA reviews and procurement evaluations:

Per-pair quality — how does this perform on DE↔EN, EN↔AR, JA↔KO, specifically?
Sample size — how many runs is your reported number based on? Ten? Ten thousand?
Methodology — who is judging the translations, against what reference, with what rubric?
Distribution, not average — what does the worst-case 10% look like? The best 10%? The median?
Drift over time — has a given pair gotten better or worse since you last published a number?
What you don't measure — what does your benchmark explicitly not capture?

None of these are unanswerable. They are just not on anyone's marketing page.

What we publish

/benchmark is our answer. The methodology is at /benchmark/methodology — written before we knew you'd be reading this.

Three things separate it from category norms.

1. Real traffic, not a curated suite

Every score in the public benchmark comes from a real /demo test run. We do not pre-select pairs that perform well. The same pipeline that serves a buyer's demo is the one being measured.

2. The judge is named

Primary: google/gemini-2.5-flash. Fallback: anthropic/claude-sonnet-4-20250514. Both via Vercel AI Gateway. The judge is part of the methodology — disclosed by name. If we change the judge in the future, historical rows will carry the original judge identifier; old scores never get silently re-scored.

3. The distribution is the data, not the average

Every published row shows median, p10, p90, min, max, and sample size — not a single number. A single number for a translation pair is noise. The shape of the distribution is the signal.

Practices the category hasn't adopted

Low-score pairs are not hidden. The public index is gated on ≥ 10 distinct IPs, ≥ 10 runs, median ≥ 60 — but anyone can deep-link to any pair directly and see the real numbers, including the pairs that are doing badly this month.
Known issues are documented. When the chat-test harness was broken for a few weeks earlier in 2026, that period is suppressed from the index and noted in writing on the methodology page. History does not get silently rewritten.
What we deliberately do NOT claim is a full section on the methodology page. We say where the LLM judge itself is imperfect. We say what we do not measure (latency, cost, user satisfaction, ASR-side errors before translation even runs). We disclose that our own automated smoke tests are part of the traffic.

A filter for the next vendor evaluation

If you are evaluating any multilingual meeting platform — ours or another — the methodology is the page worth reading. The numbers themselves are the easy part.

A practical filter for any vendor in this category:

Ask for per-language-pair, per-month quality data on real traffic. Not a curated benchmark. Not an aggregate.
Ask what their judge is, what they explicitly do not measure, and what has changed in the last six months.
Ask what happens when a pair's score drops — do they tell anyone, or do they fix it silently?

If the vendor has all three answers in writing, evaluate them seriously. If they don't, you are buying marketing — not translation quality.

Try it yourself

/demo — runs the production translation pipeline on your audio, scores it against the same judge that scores the public benchmark, and shows you the output.
/benchmark — every published language pair, every month, with the full distribution.
/benchmark/methodology — how the numbers are computed, what they include, what they do not.

You will not need to take our word for any of it. That is the point.