March 27, 2026

Which translation engine is best for your language pair?

In 2023, when this article was first written, the answer to this question was relatively simple: use DeepL for European languages, Google for Asian and broad coverage, Yandex for Russian, Microsoft for Portuguese. Those recommendations reflected the state of neural machine translation at the time.

The field has moved substantially since then. Intento's State of Translation Automation 2025 (the most rigorous independent evaluation of MT engines and LLMs, covering 46 systems across 11 language pairs) found that LLMs now represent 89% of top performers across language pairs, up from around 55% the year before. The traditional NMT engines that dominated 2023 recommendations are being outperformed on most pairs. The gap between human translation quality and the best automated solutions has "virtually disappeared" for most high-resource language pairs.

This guide updates the language pair recommendations with current benchmark evidence, explains what drives performance differences across language groups, and addresses the structural problem that benchmark data reveals: no single engine consistently leads every language pair — and for professional use, that variability has a practical solution.

In this article

  1. What the 2025 benchmark data actually shows
  2. Best engine per language pair: updated recommendations
  3. Why no single engine wins every pair
  4. How to stop guessing which engine is right
  5. FAQs

What the 2025 benchmark data actually shows

Intento's 2025 evaluation (46 MT engines and LLMs, 11 language pairs, human LQA scoring) produced a set of findings that should change how anyone thinks about engine selection:

LLMs dominate across pairs. GPT-4.1 and Gemini 2.5 Pro consistently appear at the top across most language pairs evaluated. Among single-agent solutions, GPT-4.1 leads with 7 "best" performances across 11 language pairs, followed by Gemini 2.5 Pro (5) and Lara (4). Claude Opus 4 and Claude Sonnet 3.7 also appear among the top 14 solutions showing the best results. Source: Intento State of Translation Automation 2025.

DeepL next-gen leads among real-time/NMT solutions. DeepL's newer LLM-based engine (DeepL next-gen, distinct from the legacy DeepL NMT) shows the best results among dedicated real-time translation solutions, particularly for Dutch and Spanish. This is a different product from classic DeepL, and the performance gap between them matters for recommendations.

Asian languages remain the hardest tier. Japanese, Korean, and Chinese show significantly wider quality spreads across engines than European languages. More solutions perform poorly on these pairs, and the difference between the best and worst performers is larger. Choosing the right engine matters more here than it does for English–Spanish or English–French.

Arabic shows high variation. Arabic has a larger variation in quality than most other languages, largely due to syntactic complexity. Most solutions show only minor mistranslation issues according to human LQA reviewers, but the engines that struggle do so significantly.

Ukrainian is the hardest European pair. Ukrainian shows the widest quality spread of any language in the evaluation. Traditional NMT engines fail to approach human quality; GPT-4.1 and multi-agent solutions come closest.

The multi-agent solution achieves top-tier results. Intento's multi-agent solution (which mirrors consensus-based architecture) achieves near-human performance in Brazilian Portuguese, French, German, and Japanese in particular. Multi-agent and ensemble approaches consistently match or exceed single-solution performance.

Customization changes the picture. For teams that can apply glossaries, tone of voice, and tag handling requirements, customized solutions consistently outperform stock models — sometimes dramatically. For standard, uncustomized use, the rankings above apply. Source: Intento State of Translation Automation 2025.

Best engine per language pair: updated recommendations

All recommendations below reflect Intento's 2025 human LQA evaluation data combined with MachineTranslation.com's internal benchmarks. Where recommendations differ from the 2023 version of this article, the reason is stated.

English to French

Current best performers: GPT-4.1, Gemini 2.5 Pro, Claude (Opus 4 / Sonnet 3.7), DeepL next-gen

French is a high-resource language where most leading engines perform well and quality variation between top solutions is relatively low. For general business and professional content, the LLM tier (GPT-4.1, Gemini 2.5 Pro, Claude) now consistently outperforms legacy NMT engines. For real-time or API-based workflows where latency matters, DeepL next-gen is the strongest dedicated translation-optimized option.

What changed since 2023: ModernMT was the 2023 recommendation. Intento 2025 places it in the mid-tier. The LLM generation has moved past customized NMT for French.

[Note: The 2023 recommendation of ModernMT reflected that era's NMT dominance. The current recommendation reflects Intento's 2025 human LQA results.]

See MachineTranslation.com's English to French translation page for live comparison across models.

English to German

Current best performers: GPT-4.1, Gemini 2.5 Pro, Claude, DeepL next-gen

German sits in the high-resource European tier alongside French. LLMs lead on most text types. DeepL's legacy NMT remains strong for German and is a reasonable choice for teams already integrated with it, but DeepL next-gen (their LLM) outperforms it. For complex technical documentation or legal content in German, MachineTranslation.com's internal data shows consensus approaches maintaining 93% accuracy versus 84–87% for top single LLMs alone, due to the elimination of formatting errors and terminology drift. Source: Tomedes and Lokalise AI Translation Quality Research 2025 / MachineTranslation.com internal benchmarks.

What changed since 2023: DeepL legacy NMT was the 2023 recommendation. Now the recommendation is DeepL next-gen or LLMs outright, depending on workflow.

See MachineTranslation.com's English to German translation page.

English to Spanish

Current best performers: GPT-4.1, TREBE (Iberian specialist), Gemini 2.5 Pro, DeepL next-gen

Spanish (Latin American) is one of the best-supported language pairs across all engines. For Iberian Spanish specifically, TREBE (a specialist model evaluated by Intento) shows top performance. For general Spanish across Latin American variants, GPT-4.1 and Gemini 2.5 Pro lead. DeepL next-gen remains competitive.

Single LLMs plateau at roughly 84–87% accuracy for Spanish due to formatting errors and terminology drift on complex content. MachineTranslation.com's internal benchmarks show consensus across 22 models maintains 93–95% accuracy for Spanish. Source: Tomedes and Lokalise AI Translation Quality Research 2025 / MachineTranslation.com internal benchmarks.

What changed since 2023: DeepL legacy NMT was the recommendation. LLMs now lead for most Spanish use cases.

See MachineTranslation.com's English to Spanish translation page.

English to Italian

Current best performers: GPT-4.1, Gemini 2.5 Pro, Lara (Translated), DeepL next-gen

Italian is a well-supported high-resource European language. Lara by Translated (an LLM fine-tuned specifically for translation) performs among the best for Italian and Ukrainian per Intento 2025. The Google Translation LLM also shows strong results for Italian. For Italian content specifically, fine-tuned translation LLMs compete closely with general-purpose frontier models.

See MachineTranslation.com's English to Italian translation page.

English to Portuguese (Brazilian)

Current best performers: Multi-agent solution, GPT-4.1, Gemini 2.5 Pro, DeepL next-gen

Brazilian Portuguese is one of the pairs where Intento's multi-agent solution (ensemble architecture) shines particularly clearly, achieving top-tier results. Among single-engine solutions, GPT-4.1 and Gemini 2.5 Pro lead.

What changed since 2023: Microsoft was the 2023 recommendation. Current benchmark data does not place Microsoft at the top for Brazilian Portuguese. LLMs and multi-agent approaches now lead.

See MachineTranslation.com's English to Portuguese translation page.

English to Polish

Current best performers: GPT-4.1, Gemini 2.5 Pro, Claude

Polish is one of the clearest examples of the LLM advantage over traditional NMT. Polish morphological complexity (a large inflectional paradigm, complex case system) causes traditional NMT engines to produce errors that LLMs handle better. MachineTranslation.com's internal data shows single LLMs falling to around 76% accuracy for Polish, while consensus across 22 models boosts this to 88%. Source: Tomedes and Lokalise AI Translation Quality Research 2025 / MachineTranslation.com internal benchmarks.

What changed since 2023: No strong 2023 recommendation existed for Polish. The LLM shift makes this pair particularly important to update.

See MachineTranslation.com's English to Polish translation page.

English to Japanese

Current best performers: GPT-4.1, Gemini 2.5 Pro, Claude — with significant variance across engines

Japanese is in the harder tier. Asian languages display much wider spreads in both quality and score deviation than European pairs — there is clearer separation between good and poor performers, and more solutions score poorly. The difference between using a top LLM versus a mid-tier NMT engine for Japanese is more consequential than it would be for English to French. GPT-4.1 and Gemini 2.5 Pro consistently appear in the "best" category for Japanese per Intento 2025. DeepSeek-V3 also appears as a top performer across multiple pairs including Japanese.

For Japanese professional content (legal, technical, or formal business), the recommendation is to use a top LLM and cross-check rather than accepting a single-engine output at face value.

English to Chinese (Simplified)

Current best performers: GPT-4.1, Gemini 2.5 Pro, HiThink RoyalFlush (en-zh specialist), DeepSeek-V3

Chinese is high-volume but technically demanding. HiThink RoyalFlush (a model specializing in English–Chinese translation) performs in the "best" category per Intento 2025 for this specific pair. DeepSeek-V3 also consistently appears as a top performer. For general-purpose Chinese translation, GPT-4.1 leads among general-purpose models.

What changed since 2023: Google was the 2023 recommendation. LLMs and specialist models now lead for Chinese.

English to Arabic

Current best performers: Tarjama (Arabic specialist), GPT-4.1, Lara

Arabic shows high quality variation across solutions (more than most other languages) due to syntactic complexity in full-text translation. Tarjama, a specialist Arabic translation model, performs in the top category per Intento 2025 for this pair. Human reviewers score Arabic outputs quite favorably among the better solutions, with only minor mistranslation and terminology issues. The engines that struggle with Arabic do so significantly; the gap between top and bottom performers is wider than for European pairs.

English to Russian

Current best performers: GPT-4.1, Gemini 2.5 Pro

Data sovereignty note: Yandex (the 2023 recommendation) presents data sovereignty concerns that make it inappropriate for most professional and enterprise use cases. Russian-language content processed through Yandex is subject to Russian jurisdiction. For organizations translating sensitive business, legal, or client content, Yandex should not be the recommended engine regardless of translation quality. GPT-4.1 and Gemini 2.5 Pro provide strong Russian translation quality without these concerns.

English to Ukrainian

Current best performers: GPT-4.1, Multi-agent solution — with the widest quality gap of any European language

Ukrainian shows the widest quality spread across all solutions evaluated by Intento 2025. Traditional NMT engines largely fail to achieve near-human quality for Ukrainian. Only GPT-4.1 and multi-agent approaches come close to the quality frontier. This makes Ukrainian one of the language pairs where engine choice matters most, and where relying on a single mid-tier engine carries the highest quality risk.

Why no single engine wins every pair

The data above shows a consistent pattern: GPT-4.1 and Gemini 2.5 Pro lead across most pairs, but specific pairs have specialist models that outperform general-purpose LLMs — HiThink RoyalFlush for English–Chinese, TREBE for Iberian Spanish, Tarjama for Arabic, Lara for Italian. For teams translating across multiple language pairs, this creates a practical decision problem: the optimal engine for your English–French content is not the optimal engine for your English–Japanese content.

This is precisely the problem MachineTranslation.com's SMART system was designed to resolve. Rather than requiring users to select the right engine for each language pair, SMART runs 22 models simultaneously (including GPT-4.1, Gemini, Claude, DeepL, Google, DeepSeek-V3, and 16 others) and selects the output the majority agrees on. The consensus mechanism functions as an automatic quality filter: errors specific to any single model get outvoted before they reach the output.

In MachineTranslation.com's internal benchmarks, individual top-tier models score 93–94 out of 100 on translation quality. SMART's consensus output reaches 98.5. The practical implication: instead of tracking which engine is best for each of your language pairs and content types, you get the cross-model agreed output in one step — along with a Translation Quality Score showing how strongly the models agreed. Source: MachineTranslation.com internal benchmarks and WMT24 General Machine Translation Findings.

Users who switched to SMART from single-engine use spent 27% less time verifying and correcting outputs. Source: MachineTranslation.com internal data.

How to stop guessing which engine is right

For most professional translation workflows, the practical answer is not to pick one engine and commit — it is to run multiple engines and take the consensus. MachineTranslation.com's free plan gives you SMART across all 22 models for every translation, with no sign-up required. 

For high-stakes content (contracts, regulatory filings, clinical documentation), the two-layer workflow is SMART consensus first, then Human Verification from a certified professional reviewer in-platform, with a 100% accuracy guarantee. No external agency required.

Start free at MachineTranslation.com, translate with 22 AI models and see which output they agreed on.

FAQs

1. Which translation engine is best overall?

No single engine leads across all language pairs. Per Intento's 2025 evaluation of 46 engines across 11 language pairs, GPT-4.1 leads among single-agent solutions with top performance on 7 of 11 pairs, followed by Gemini 2.5 Pro. However, specialist models outperform general LLMs on specific pairs: HiThink RoyalFlush for English–Chinese, TREBE for Iberian Spanish, Tarjama for Arabic. For multi-pair workflows, a consensus approach across 22 models eliminates the per-pair selection problem.

2. Is DeepL still the best for European languages?

DeepL's legacy NMT remains competitive for European pairs, but DeepL next-gen (their newer LLM-based engine) outperforms it. For most European language pairs, GPT-4.1 and Gemini 2.5 Pro now lead per Intento's 2025 human LQA evaluation. DeepL next-gen leads among dedicated real-time translation solutions for Dutch and Spanish specifically.

3. Which engine is best for Japanese translation?

Japanese is among the harder language pairs, with wider quality variation between engines than European languages. GPT-4.1 and Gemini 2.5 Pro consistently appear in the top tier for Japanese per Intento 2025. The difference between using a top LLM versus a mid-tier engine for Japanese is more consequential than for a high-resource European pair, engine choice matters more.

4. Is Google Translate still good for Chinese translation?

Google NMT handles Chinese adequately for general use, but specialist models and frontier LLMs now lead for English–Chinese. HiThink RoyalFlush (a specialist en-zh model) and DeepSeek-V3 appear in the top category per Intento 2025. GPT-4.1 leads among general-purpose LLMs for Chinese.

5. Should I still use Yandex for Russian translation?

Not recommended for professional use. Yandex presents data sovereignty concerns (content processed through Yandex is subject to Russian jurisdiction) that make it inappropriate for sensitive business, legal, or client content. GPT-4.1 and Gemini 2.5 Pro provide strong Russian translation quality without these concerns.

6. Why does engine performance vary so much by language pair?

Language pair performance is driven by training data depth and architectural fit. High-resource languages like English–Spanish and English–French have deep training data across all major engines, so quality variation between top tools is relatively small. Lower-resource languages (Ukrainian), morphologically complex languages (Polish, Arabic), and languages with structurally different writing systems (Japanese, Chinese) show wider quality spreads. The best engine for a high-resource European pair may be a poor choice for an Asian or Slavic pair.

7. What is the most reliable way to translate across multiple language pairs without tracking which engine leads each one?

Run a consensus system. MachineTranslation.com's SMART compares 22 models (including the top performers named in each language pair section above) and selects the output the majority agrees on. Instead of selecting per-pair engines, you get the cross-validated output in one step. Individual top models score 93–94/100; SMART consensus reaches 98.5.

Which translation engine is best for your language pair?