Static benchmark reports for comparing model translation quality across locales.
The GitHub Pages site serves the latest report from:
docs/index.html
It visualizes the reasoning-tier benchmark across:
- GPT-5.5 effort tiers
- Claude Sonnet 4.6 effort tiers
- Claude Opus 4.7 effort tiers
- Grok 4.3 effort tiers
- Higher aggregate, binary, and YiSi scores are better.
- Lower token and latency values are better for cost and speed.