03/05/2023

Why Evaluation Metrics are Not As Objective As You Think

Are you tired of poorly translated documents that leave you scratching your head? Machine translation has come a long way in recent years, but it's not perfect yet. One of the challenges that machine translation developers face is ensuring that the translations are not biased or inaccurate. This is where evaluation metrics come in. These metrics are used to measure the effectiveness of machine translation systems, but are they as objective as we think? To get more in-depth about this, do consider reading our previous blog about it called “Bias in machine translation: A wake-up call for the translation industry?” 

In an interview with Google’s Macduff Hughes, he discusses how the machine learning algorithms used by Google Translate can be biased, as they learn from the language that people use on the web, which can perpetuate stereotypes and inaccuracies. In this article, we'll explore the hidden biases that can impact evaluation metrics in machine translation, and discuss alternative approaches that can help create more accurate and inclusive translations. Get ready to discover the fascinating world of machine translation evaluation, and why it matters for bridging linguistic and cultural barriers in our globalized world.

Hidden Bias in Google Translate's Evaluation Metrics

Google Translate has become a ubiquitous tool for breaking down language barriers and connecting people from all corners of the globe. However, like all machine learning algorithms, Google Translate's evaluation metrics can be vulnerable to hidden bias. This refers to the existence of implicit biases or stereotypes in the data used to train the algorithm, which can lead to inaccuracies in translation.

Hidden bias in Google Translate's evaluation metrics can manifest in a variety of ways. For instance, if the algorithm is trained on a limited dataset, it may struggle to translate texts that fall outside of its established norms. Another example is the use of a "majority rules" approach in evaluating translations, which can overlook valid alternative translations. Linguistic and cultural biases in the training data can also creep into the evaluation metrics and become perpetuated in the algorithm.

The consequences of hidden bias in Google Translate's evaluation metrics can be significant. Inaccurate translations can lead to misunderstandings and breakdowns in communication between individuals from different cultural and linguistic backgrounds. A study has shown that Google Translate's evaluation metrics are biased towards masculine pronouns, leading to translations that reinforce gender stereotypes. 

As machine translation continues to play such an important role in connecting people worldwide, it is crucial to address hidden bias in evaluation metrics. Efforts must be made to expose the algorithm to diverse texts and introduce human moderation to correct errors. By mitigating hidden bias in Google Translate's evaluation metrics, we can work towards more objective and unbiased translations that bring people together, rather than driving them apart.

Gender Bias: A  Popular Example of Bias in MT Evaluation Metrics

When we translate phrases from languages that are gender-neutral, such as Hungarian into English, we can observe the issue of gender bias in machine translation. A glimpse into this problem can be seen in output by Google Translate. Occupations that are usually associated with male-dominated fields like scholar, engineer, and CEO are automatically assumed to be male, while occupations like nurse, baker, and wedding planner are assumed to be female. (Prates, da Costa Avelar, and Lamb '23).

This highlights the challenges of developing machine translation systems that are capable of accurately translating names, particularly those that are unique to specific cultures and languages. It also underscores the importance of addressing biases in machine learning algorithms to ensure that they do not perpetuate harmful stereotypes or misinterpretations.

Factors Contributing to Hidden Bias

Machine translation has come a long way in breaking down language barriers and connecting people from all over the world. However, hidden biases in machine translation evaluation metrics can perpetuate linguistic and cultural biases, leading to inaccurate translations that perpetuate stereotypes and misunderstandings (Prates et al. 23).

One contributing factor to hidden bias is the lack of diversity in training data. If the dataset used to train the algorithm only represents a specific culture or language, the algorithm may struggle to accurately translate texts outside of this dataset. So, context is everything. Dr. Sheila Castilho, a researcher in machine translation and professor of translation studies at Dublin City University, said it best: “Misevaluation, in my definition, is when a metric (human or automatic) deems an MT sentence as accurate and fluent because there is no context to tell why the sentence is not accurate or fluent. This makes NMT output seem like they are correct in isolation, but in reality, when put back in the context, the translation is incorrect.”

Furthermore, linguistic and cultural biases may be present in the language used in the training data, which can surely perpetuate biases in the algorithm's translations. Human evaluation bias is another factor that may contribute to hidden bias in machine translation evaluation metrics. Biases in the evaluators' preconceptions and implicit biases can impact the selection and prioritization of certain translations over others, even if they are not the most accurate.

The impact of hidden bias in machine translation evaluation metrics can be significant, with inaccurate translations leading to breakdowns in communication between people from different cultural and linguistic backgrounds. To mitigate the impact of hidden bias, it is essential to expose the algorithm to diverse training data and introduce human moderation to correct errors (Prates et al. 23). 

By acknowledging and addressing hidden bias in machine translation evaluation metrics, we can work towards more objective and unbiased translations that promote mutual understanding and communication across cultures.

Importance of Objective Metrics

Language is a powerful tool that connects us all, but achieving accurate and unbiased translations across cultures and languages can be challenging. That's where objective metrics come in. By relying on measurable and quantifiable data, objective metrics ensure that machine translation systems are evaluated without bias and produce accurate translations.

Alternative approaches to evaluation metrics, such as automatic metrics, have emerged to provide a more objective assessment of translation quality. These metrics utilize advanced algorithms to analyze text and compare it to a reference translation, reducing the risk of human bias.

In addition, current efforts to improve objectivity in machine translation evaluation include developing machine learning models that account for linguistic and cultural biases, and integrating more diverse and representative training data. 

Dr. Maja Popovic, a researcher fellow at ADAPT Centre at Dublin City University, spoke with us about how new technology is playing a major role in such endeavors. “As quality of machine translation evolves, lexical-based automatic metrics based on (dis)similarity between words or characters in the MT output and reference human translation (such as BLEU, TER, chrF, etc.) are being more and more suppressed with the new type of metrics based on neural networks (such as COMET, BLEURT, BERTScore, etc.). These new metrics correlate better with human evaluation scores, however, they require training data consisting of source text, MT output, reference translations and human scores. By incorporating natural language processing techniques and human feedback into the evaluation process, researchers are working to improve the reliability of automatic metrics.”

In the end, ensuring that machine translation systems are evaluated using objective metrics is crucial for bridging language barriers and promoting understanding across cultures. By creating accurate and unbiased translations, we can help build a more inclusive and equitable world where communication knows no boundaries.

Conclusion

As we've seen, evaluation metrics play a crucial role in determining the effectiveness of machine translation systems. However, it's clear that these metrics are not always as objective as we might hope. Hidden biases can influence translation accuracy, leading to harmful stereotypes and misunderstandings.

But there's reason for hope. New approaches to evaluation, such as automatic metrics and natural language processing techniques, are emerging that offer a more objective assessment of translation quality. At the same time, efforts to diversify training data, eliminate linguistic and cultural biases, and integrate human feedback are helping to improve the accuracy and reliability of machine translation.

By recognizing the limitations of traditional evaluation metrics and embracing these new approaches, we can pave the way for more accurate, equitable, and inclusive translations that foster communication and understanding across languages and cultures. So let's continue to push the boundaries of machine translation evaluation, and work together to build a more connected and compassionate world.