27/04/2023 /Category

Evaluating Machine Translation for Public Health Information: Insights from the COVID-19 MT Evaluator

Exposing Shortfalls

The outbreak of the COVID-19 pandemic in early 2020 has exposed the critical need for accurate and timely public health information, particularly for non-English speaking populations. As the pandemic spread globally, the role of machine translation became increasingly vital in disseminating health information across different languages. However, while machine translation has offered a promising solution, its limitations in accurately conveying complex medical concepts and nuances of language cannot be ignored. 

In this article, we delve into the effectiveness of machine translation for public health information during COVID-19, exploring the challenges and opportunities of using this technology to promote health equity and improve health communication. We introduce COVID-19 MT Evaluator, a novel tool designed to assess the quality of machine translation for health information. By offering insights from a real-life case study and expert analysis, we aim to shed light on the potential of machine translation for promoting health equity and reducing health disparities during the ongoing COVID-19 crisis.

The Case for Human Evaluation

As machine translation becomes an increasingly common tool for disseminating health information during the COVID-19 pandemic, it is essential to recognize its limitations and the need for human evaluation. One of the major concerns with machine translation is the potential for errors that can significantly impact the accuracy and quality of the translated information.

For instance, a study published in the Journal of Medical Internet Research highlighted the significant errors in machine-translated COVID-19 information on government websites in various countries, including Japan, South Korea, and Spain. The study found that machine translation was unable to accurately convey complex medical concepts and technical terms, leading to potentially harmful misunderstandings among non-English speakers.

In another example, a report by the World Health Organization (WHO) found that machine translation of COVID-19 health information could result in confusion and mistrust, particularly for communities that are already marginalized or vulnerable. The report emphasized the importance of human evaluation in ensuring the accuracy and cultural appropriateness of translated information.

Machine translation post-pditing (MTPE) offers several advantages over solely relying on machine translation, including the ability to understand the cultural context and nuances of language, identify errors and inconsistencies, and ensure that the translated information is relevant and accessible to diverse communities. MTPE involves a human editor reviewing and correcting machine-translated content to improve accuracy, fluency, and style.

Unlike machine translation, MTPE allows for a more nuanced understanding of language and cultural context. Machine translation can often miss subtle nuances and cultural references that can dramatically impact the meaning of a translated text. By having a human editor review and edit the machine-translated content, these nuances and cultural references can surely be properly understood and accurately translated.

In light of these challenges and concerns, it is crucial to prioritize MTPE and invest in the development of tools and resources that can support the creation of accurate and culturally appropriate health information for all populations.

COVID-19 MT Evaluator: An Evaluation Tool for Machine Translation

Sai Cheong Siu from Hang Seng University of Hong Kong wrote the paper “COVID-19 MT Evaluator: A Platform for the Evaluation of Machine Translation of Public Health Information Related to COVID-19,” which talks about the need for machine translation in the field of public health during the COVID-19 pandemic in 2020, where the accurate and timely dissemination of information in multiple languages was very important. It emphasizes the limitations of MT and the much needed importance of evaluating its performance to ensure that the translated information is of high quality and accessible to all. 

Siu writes about the COVID-19 MT Evaluator, a groundbreaking platform they developed at the Hang Seng University of Hong Kong, that evaluates the quality of MT output for COVID-19-related content (especially for public health content). It focuses on evaluating the fluency and adequacy of the translation and aims to provide feedback to developers, researchers, and policymakers to improve the quality of MT output. 

The COVID-19 MT Evaluator is essentially an automatic machine translation evaluation system with three main features. Firstly, it has built-in assessment tools tailored to evaluate specialized translation related to COVID-19, such as bilingual test datasets and a terminology module for public health terms. This definitely makes it unique from other general MT evaluation tools because it focuses on translation scores rather than specialized documents.

Next, the system integrates a range of MT evaluation metrics, including string comparisons and computation of sentence embeddings using pre-trained language models.

Finally, the technology gives interactive visualization tools to help users analyze translation scores, compare results with reference translations, and identify issues for deeper evaluation. The COVID-19 MT Evaluator has the potential to greatly improve the quality of translated health information for non-English speaking populations during the COVID-19 pandemic and beyond.

Evaluation Metrics of the MT Evaluation Tool

The system has two evaluation metrics. One metric relies on string comparisons and the other is based on sentence embeddings. The former compares the machine-generated translation with a reference translation by analyzing the number of shared words, or 𝑛-grams, in both texts. Meanwhile, the latter calculates the sentence embeddings of both the MT output and the translation source using a pre-trained language model and measures their similarity. To know more about this in-depth, consider reading our previous blog about it called “Machine Translation Evaluation: The Ultimate Guide.” 

The evaluator uses five metrics: BLEU, CHRF, BERTScore, BLEURT, and COMET. BLEU and CHRF are string-based, while the other three are embedding-based.

Let’s quickly run through what each of these metrics are:

  1. BLEU

The Bilingual Evaluation Understudy (BLEU) metric is one of the most widely used evaluation metrics in the machine translation research community. It is a precision-based metric that measures the degree of similarity between a machine translation output and one or more reference translations. BLEU calculates the weighted geometric mean of modified 𝑛-gram precision scores, where the precision score of each 𝑛-gram is clipped by the maximum count of that 𝑛-gram in the reference translation. The final score is also penalized for brevity, to avoid overestimating the similarity of short translations. BLEU scores range from 0 to 1, with higher scores indicating better translation quality. While BLEU has been criticized for being too simplistic and not always correlating well with human judgments, it remains a popular benchmark for machine translation systems.

  1. CHRF

The Character n-gram F-score (CHRF) is another precision-based metric that compares the character-level n-gram overlap between the machine translation output and the reference translation. Unlike BLEU, CHRF calculates F-scores instead of geometric means, which makes it more robust to outliers. CHRF scores also range from 0 to 1, with higher scores indicating better translation quality. However, CHRF can be affected by the choice of n-gram size and by the quality of the reference translations.

  1. BERTScore

BERTScore is an embedding-based metric that measures the degree of semantic similarity between the machine translation output and the reference translation. It is based on the BERT language model, which is a powerful pre-trained model that can generate high-quality contextual embeddings for individual words or sentences. BERTScore computes the cosine similarity between the embeddings of each token in the machine translation output and the corresponding token in the reference translation, and then aggregates the similarities into an overall score. BERTScore has been shown to correlate well with human judgments of translation quality, especially for longer sentences and more complex language.


BLEURT is a regression-based metric that fine-tunes a pre-trained multilingual BERT model on a training set of reference translations, machine translation outputs, and human scores. The fine-tuned model can then be used to predict the quality score of new machine translation outputs, based on their embeddings and the embeddings of the corresponding reference translations. BLEURT can handle multiple reference translations and can also be trained on specific domains or language pairs. BLEURT has been shown to achieve state-of-the-art performance on several machine translation benchmarks.

  1. COMET

The Crosslingual Optimized Metric for Evaluation of Translation (COMET) is a recent metric that combines several pre-trained language models, including XLM-RoBERTa and a large-scale multilingual masked language model, to generate sentence embeddings for the machine translation output and the reference translation. COMET then uses a regression model to predict the quality score of the machine translation output, based on the cosine similarity between the embeddings and other features such as language model perplexity and translation fluency. COMET has been shown to outperform other metrics on several machine translation benchmarks and can also be fine-tuned for specific domains or language pairs.

SacreBLEU is used for BLEU and CHRF scores, while BERTScore uses the BERT model for generating contextual embeddings. BLEURT is a regression model, and COMET uses a pre-trained multilingual masked language model.Overall, The COVID-19 MT Evaluator is a system that evaluates the quality of machine translation (MT) for public health information related to COVID-19. It features multiple evaluation metrics, domain-specific tools, and deep learning-driven metrics. Additionally, it provides visualization tools for the detection and analysis of translation issues.

The COVID-19 MT Evaluator is a useful tool for medical translators, healthcare professionals, researchers, and developers who need a quick assessment of MT performance. Future research can be done to evaluate the quality of common MT engines available on the market and compare their translation results.

To improve the evaluation platform, the support for other languages can be enhanced, and more evaluation metrics and analysis tools can be added. With the COVID-19 pandemic's long-lasting and irreversible impact, the need for automatic translation of public health information is expected to remain high. The COVID-19 MT Evaluator provides a starting point for streamlining the evaluation of MT of COVID-19 information and other medical publications.

Expert Analysis on Machine Translation for Public Health Information

Tom Kocmi, co-author of “To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation” and Machine Translation Researcher at Microsoft sheds light on MT evaluation metrics. He says, “the evolution of machine translation has led to the development of diverse evaluation metrics that can be categorized into two primary types, string-based and pretrained models. The first type encompasses string-based metrics, which analyze usually n-grams against human reference translations (e.g., BLEU and ChrF). For years, the machine translation community relied heavily on BLEU, which resulted in discarding promising ideas due to a lack of improvement according to BLEU. 

Recently, there has been a surge in a second type of pretrained automatic metrics built on pretrained language models and often fine-tuned with human judgments, such as COMET and BLEURT. Although these metrics have proven to be very effective, they can also be perceived as black boxes. Therefore, I would recommend to use ChrF alongside a pretrained metric to detect significant discrepancies. 

Additionally, pretrained metrics opened a world of possibilities for non-reference based evaluation (quality estimation), exemplified by COMET-QE, which eliminates the need for obtaining expensive human reference. This enables evaluation on monolingual data and facilitating the exploration of various domains and genres. Pretrained metrics are becoming the default choice for machine translation evaluation.

Lastly, in recent months, there is an emergence of a third type of metric based on prompting large language models (LLMs), our work GEMBA sets a new state of the art in pairwise system level evaluation. However, I would advise caution utilizing such metrics in production or in research papers until further research has been conducted on their strengths and weaknesses.”

Ricardo Rei of Unbabel provides some interesting insights into the continuous developments of machine translation and how evaluation metrics adapt to keep up with the changes in the technology. He says, “As machine translation technology continues to evolve, it is essential that evaluation metrics keep pace with these changes. With the advent of large language models, we are currently in a breakthrough phase that will shape the future of research in this field, similar to what occurred in 2017 with the release of BERT. It is becoming increasingly important to have high-quality metrics to measure MT performance as the quality of MT improves. Fortunately, MT evaluation research is thriving, with many research groups dedicated to advancing the field and improving the quality of MT metrics. This is reflected in the 2022 Metrics shared task where the best metric for en-de and en-ru was MetricX XXL, a very large internal metric from Google. In addition, Microsoft Research recently published a paper showing that ChatGPT achieves state-of-the-art accuracy in system-level accuracy, which involves determining which of two MT systems is better. Our team has also been experimenting with larger encoders for COMET, and we are seeing continued improvement in results.”


As we conclude our journey through the world of machine translation and its potential impact on public health, it is very clear that we have only scratched the surface of what this technology can offer. The COVID-19 MT Evaluator has opened a world of possibilities for healthcare professionals, medical translators, and researchers to assess the quality of machine-translated public health information.

But the potential benefits go far beyond evaluation. Machine translation can bridge the communication gap between patients and healthcare providers, especially for underserved communities with limited access to language services. With accurate machine translation, non-English speakers can receive the same quality of healthcare as their English-speaking counterparts, leading to improved health outcomes and greater health equity.

Moreover, machine translation can aid in the dissemination of critical public health information during times of crisis, such as the ongoing COVID-19 pandemic. Accurate and timely information can mean the difference between life and death, and machine translation can help deliver that information to those who need it most, regardless of language barriers.

As we look to the future, the potential for machine translation to promote health equity and improve health communication is tremendous. However, we must continue to refine and improve the technology to ensure that it is accurate, reliable, and accessible to all. With the right tools and resources, we can unlock the full potential of machine translation and use it as a powerful tool to promote health equity and improve health outcomes for all.