Once again, tech giant Meta has made waves in the machine translation community. The company recently released a major research update on its No Language Left Behind project, which now accommodates 200 languages.
Meta calls the NLLB project “a first-of-its-kind, AI breakthrough project that open-sources models capable of delivering evaluated, high-quality translations between 200 languages—including low-resource languages like Asturian, Luganda, Urdu, and more.” What does that mean? Let’s break it down:
First of its kind. Multilingual machine translation models exist, but none on the scale of what Meta has done. NLLB-200 far surpasses Meta’s own previous M2M-100 model, which could translate among 100 languages without using English as an intermediary.
Open-sourced models. This means that the code for NLLB-200 is freely available for anyone, particularly researchers, to examine and develop.
Evaluated, high-quality translations. Benchmarks for assessing the quality of multilingual MT models are necessary for comparing different kinds. Meta has created one capable of accommodating NLLB-200’s massive linguistic scope.
Low-resource languages. These are languages which don’t have much language data available on the web, and thus don’t receive as much attention in the development of MT. NLLB-200 may be the single most concerted effort to include them to date.
The most obvious gain from this development is the attention it brings to languages underrepresented on the internet, or what the MT community refers to as “low-resource languages”.
MT research and development has tended to focus on a small subset of languages for which data is readily available, and for which there is more economic incentive.
This means that as the technology for MT develops, the gains from it will be distributed unevenly among languages, with the high-resource languages gaining more of an advantage and higher-quality translations.
With No Language Left Behind, Meta is making a massive effort to include more languages into the mix than ever before.
From the outset, NLLB’s different parts have been made open-source. Among the things Meta has made freely available aside from the MT model itself are improvements to its encoder LASER (Language-Agnostic Sentence Representation), the FLORES (Facebook Low-Resource) benchmark used for evaluating the quality of translations, and professionally translated datasets used in training the AI.
This is important as it means that complete access is open for research and development.
The FLORES-101 dataset, precursor of the current FLORES-200, was released open-source in June 2021 to create a benchmark for evaluating MT of low-resource languages. It has quickly been put to use since then, including during the 2021 Conference on Machine Translation. FLORES-200 improves upon it by extending its language coverage from a hundred languages to two hundred, and will continue to serve in this capacity.
In making the No Language Left Behind project open-source, Meta recognizes the development of MT and AI technology as a collective responsibility. Researchers are able to build on its gains instead of risking redundancy of efforts, allowing them to participate in developing the tech in a more meaningful capacity.
According to their research paper, “NLLB could motivate more low-resource language writers or content creators to share localized knowledge or various aspects of their culture with both cultural insiders and outsiders through social media platforms or websites like Wikipedia.”
This is important as low-resource languages face not only the danger of extinction but also the erosion of culture. Language and culture are inextricably tied together, and one of the benefits of the NLLB project is preserving the cultural heritage of its speaking communities.
Another dimension of the ethical approach that Meta is taking with the NLLB project is extensive consultation with the communities of these low-resource languages to understand how the work might impact their day-to-day lives, with an eye toward making sure that it doesn’t exacerbate inequalities in the digital sphere.
As such, the No Language Left Behind project also serves as a blueprint for an ethical approach to MT development with respect to the language communities involved.
The No Language Left Behind project is definitely ambitious in its scope, especially with the recent development of NLLB-200. But there’s more to it than just the number of languages. The NLLB project is a way of contributing in a major way to the different sectors that have a stake in the development of MT. It is a pioneering effort not only in terms of its tech developments but also in terms of building a community around machine translation’s ultimate goal, which is to break down barriers between languages all over the world.
© Copyright 2023 Tomedes All Rights Reserved.