Neutral Machine Translation is the Next Big Thing!
It’s been almost nine years since Koehn et al. published Moses: Open Source Toolkit for Statistical Machine Translation1 in 2007, which fundamentally changed the way machine translation (MT) was done. But this was not the first fundamental shift in MT, and it looks like it won’t to be the last. To ensure our clients receive world-class levels of innovation in the area of language technology, we are working with what we are pretty sure will be the next big thing in MT. More about that to follow, but first a little context about how MT has evolved.
Brief History of MT
The field of MT began in earnest in the 1950s, first with bilingual dictionaries that permitted only word-by-word translation. Translations by this method are seldom fluent. They are easily tripped up by polysemous words which are words with more than one meaning like “bank” or “Java,” and are often very difficult to understand by someone who doesn’t know what the intended meaning is beforehand.
From this beginning, the Next Big Thing was the introduction of rule-based machine translation (RBMT). First there was direct RBMT, which used basic rules on top of the bilingual dictionaries. Those helped with word order problems, but still didn’t address the other problems. Next, we saw the introduction of transfer RBMT, which added more rules to deal with morphology and syntax to address those problems. These systems can give performance that is quite good, but because of the richness of language, the systems are often incomplete in vocabulary coverage, syntactic coverage, or both. RBMT is also expensive because it requires humans (linguists) to write all the rules and maintain the dictionaries that the systems use. Still, due in part to the high cost of computing resources, RBMT dominated the field between 2000 and 2010. There are still companies that offer good RBMT solutions today, often hybrid solutions combining RBMT with SMT.
Statistical Machine Translation (SMT)
Thanks to increased computing power at a lower cost and some pioneering research from IBM around 1990, work on statistical machine translation (SMT) began to take off in the late-1990’s and early-2000’s. In 2007, Moses was earmarked as the next big thing in MT; however, it wasn’t until 2010–2012 that it became the foundation upon which nearly every commercial SMT system was based. SMT shifted the focus from linguists writing rules to acquiring aligned corpora, which are required to train SMT systems. SMT has limitations as well. Language pairs that have different word order are particularly tricky and unless you have vast amounts of computing resources, modeling long-term dependencies between words or phrases is nearly impossible.
There have been incremental improvements to SMT over the past several years, including SMT using hierarchical models, and the introduction of linguistic meta-data for grammar-informed models. Nothing has come along that had such a huge impact as the jump from word-by-word to RBMT, or from RBMT to SMT, until now.
Neural Machine Translation (NMT)
Over the past two years, researchers have been working on using sequence-to-sequence mapping with artificial neural networks to develop what’s being called neural machine translation (NMT). Essentially, they use recurrent neural networks to build a system that learns to map a whole sentence from source to target all at once, instead of word-by-word, phrase-by-phrase, or n-gram-by-n-gram. This eliminates the problems of long-term dependencies and word-ordering, because the system learns whole sentences at once. Indeed, some researchers are looking at extending beyond the limitations of the sentence to whole paragraphs or even documents. Document-level translation would theoretically eliminate our need for aligned files and allow us to train on transcreated material, which is unthinkable in any system available today.
NMT has shortcomings as well. Neural networks require a lot of training data, on the order of one million sentence pairs, and there’s currently no good solution to translating rare or unseen words and out of vocabulary (OOV) words. There have been a few proposals on how to address this problem, nothing firm yet. At Welocalize, we’re actively pursuing ideas of our own on how to fix the OOV problem for client data and we’re also working on how to overcome the amount of client data necessary to train a good NMT system.
The other major shift is that in order to train large neural networks efficiently, this requires a different set of hardware. SMT requires a lot of memory to store phrase tables and training can be “parallelized” to work better on CPUs with multiple cores. NMT on the other hand requires high-end GPUs (yes, video cards) for training. We’ve invested in the infrastructure necessary to do the work and we’re working hard to get this exciting new technology ready for our clients to use. Our early results with a variety of domain-specific data sets are very promising.
We’re not alone in our excitement. Many talks and posters at MT conferences are dedicated to advancement and progress in NMT. Google and Microsoft are both working on ways to use NMT in their translation products, with a special interest in how NMT can significantly improve fluency in translation between Asian and European languages. Watch this space in the weeks and months to come for updates on our progress with this exciting technology.