Methodology

The InterMIND translation quality benchmark is computed automatically from real traffic on our public /demo endpoint. Nothing here is curated or cherry-picked.

How a single test works

A test run plays a reference audio file (voice test) or sends a reference text (chat test) from a source language through our live translation pipeline to a target language. The resulting translation is scored 0–100 by an LLM judge against a canonical reference translation.

LLM judge: google/gemini-2.5-flash primary, anthropic/claude-sonnet-4-20250514 fallback (via Vercel AI Gateway). The same prompt, rubric, and reference texts are used across all pairs.

How a month is aggregated

  1. Every completed demo run is recorded in demo_test_runs with per-target-language scores.
  2. For each (source language, target language, test type, month), we dedup by client IP — the latest score from a given IP counts. One enthusiastic tester cannot swing medians.
  3. From the deduplicated set we compute median, min, max, p10, p90, average, and sample size.
  4. We snapshot individual scores so that once a month closes, the numbers remain stable even after raw runs are deleted by TTL.

When a pair appears in the public index

A (pair × test type × month) row is eligible for the public index and sitemap when all of these hold:

  • At least 10 distinct client IPs contributed runs that month;
  • At least 10 total runs after dedup;
  • Median score is at least 60.

Pairs that fall short remain accessible by direct link — the thresholds gate only what search engines see. We do not hide low scores from anyone who asks for them directly.

What we deliberately do NOT claim

  • A single score is not trustworthy. A well-behaved pair can swing 30 points between runs because ASR and LLM judgment are both noisy. That is why every page shows a distribution, not a single number.
  • The LLM judge itself is imperfect. We may switch or dual-judge in the future; historical rows will carry the judge identifier when that happens.
  • The benchmark does not measure latency, cost, availability, or user satisfaction. Those live elsewhere.
  • Traffic from our own automated smoke tests is included. It is a constant contribution and we disclose this explicitly.

Honest calls

When a month drops in quality for a specific pair — it stays visible. When a bug is fixed and the next month jumps — the improvement is visible on the same page. We never rewrite historical aggregates. Admins can only hide individual months from the public index; they cannot alter the numbers that were already published.

Known issues affecting historical data

  • Chat test harness, before 2026-04-23: the automated chat-test pipeline had per-language failures. Chat aggregates from earlier months may show lower scores than the actual translation quality of those periods. Affected months are kept in the database but suppressed from the public index; the trend chart will show a step change at the fix.

Questions

Disagree with something here? Open an issue or write to us. We will update this page and note the change.