Saturday, June 6, 2015

Automated Language Translation

Automated Language Translation Breaking the Language Barrier to Cross-lingual Information Access Accessing Information From Foreign Content More than half the content on the Internet is in a language other than English, and three out of four Internet users are not native speakers of English. Although much useful information— some of it critical to economic success and physical security — can be gleaned from these foreign language sources, the value of this information often decreases over time.

This presents a challenge because current translation by humans is too slow and too expensive to provide the quick, reliable access to foreign language information that governments and businesses need. Foreign language translation is changing, however. Advances in automated language translation technology have opened the possibility of breaking the language barrier for both information access and in-person communication. Figure 1 illustrates the central role that automated language translation plays in enabling communication and information access across different languages. The goal of an automated language translation system is to ingest sentences in a source language and produce a correct, fluent, semantically equivalent sentence in the target language. Automated Language Translation Challenges The automatic translation process is easy to depict, but difficult to achieve.

A general property of all human languages is the prevalence of ambiguity in the meaning of individual words as well as in the relationship between parts of a sentence. Humans are usually very efficient at resolving these ambiguities when interpreting linguistic input, often without being aware of their existence. They rely on past experience and the context surrounding the speech or text to perform the task. This knowledge is very hard to model in a computer system. Without access to such knowledge, an automated language translation system still has to meet the challenges of selecting the correct translation of a word, rearranging the translated words according to the grammar of the target language, and producing a correct and natural-sounding translation. History of Automated Language Translation Despite the difficulty of automated translation problems, early attempts were made to tackle it dating back to the 1950s. Figure 2 shows a timeline of the major milestones in the evolution of automated translation.

The prevailing approach in the early decades was to analyze the structure of the input sentence and determine the possible senses of its ambiguous words, and then apply translation rules crafted by expert linguists to generate the translation. A drawback of this rule-based approach is a lack of flexibility for adding new translation rules and ensuring consistency with the existing rules. Moreover, using this approach to build a translation system for a new language pair requires a linguist who is an expert in both languages. Also, the process of writing the requisite set of translation rules is a slow, difficult and labor-intensive undertaking. These are major disadvantages.

A data-driven approach to automated translation began in the early 1990s. Instead of specifying translation rules manually, this methodology uses a parallel corpus that consists of sentences from the source language along with their translations in the target language. Example sentence translations from the corpus are used to derive automatically a large set of translation rules between smaller units (e.g., words or phrases), together with an associated likelihood for each rule.

The rules and likelihoods are then applied to translate a new input sentence from the source language. This approach, called statistical machine translation (SMT), revolutionized automated language translation by enabling translation systems for new languages and domains to be developed quickly and cheaply. With the advent of SMT, the need for linguists who are experts in two or more languages to design translation rules manually is no longer necessary. Translation rules are now automatically derived from sentences translated by bilingual speakers who are not necessarily linguistic experts. This improves development time and reduces translation cost. Early SMT models focused on learning the translations of individual words.

Thanks to the increased availability of data and cheap computational power, more complex models that can learn the translation of phrases or syntactic structures are being developed. The current approach to SMT, shown in Figure 3, uses statistical models dependent on linguistic information and context to produce translations that preserve sentence structure. SMT is still a fledgling technology despite the significant advances made over the last two decades.

Current research continually incorporates advances in machine learning theory and linguistic modeling to improve the state of the art in SMT technology. SMT Research at Raytheon BBN Technologies Research in machine translation started at Raytheon BBN Technologies in 2003. In two years, Raytheon established itself as a leading player in the field of language translation by leveraging its experience in statistical modeling for speech recognition.

Most of the research was done under Defense Advanced Research Projects Agency (DARPA) sponsored programs that had a great impact in advancing SMT technology. The alignment of these programs to the major milestones in SMT evolution can be seen in Figure 2. Between 2005 and 2011, Raytheon BBN Technologies participated in the DARPA Global Autonomous Language Exploitation (GALE) program, whose goal was to develop technologies to absorb, analyze and interpret huge volumes of speech and text in multiple languages. Raytheon was consistently ranked top performer in the program’s official evaluations organized by the National Institute of Standards and Technology (NIST). Raytheon was also the top performer in DARPA’s Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program, which aimed to develop technology for real-time speech-to-speech translation from English to a foreign language and vice versa.

As part of these and other research programs, Raytheon made several significant innovations to improve the state of the art of SMT. For example, Raytheon BBN Technologies developed a translation model that produces translations with improved semantic coherence by using information about the grammatical relationship between words that occur far apart in sentences. Raytheon also developed a procedure for combining the outputs of multiple automated translation systems to produce a better translation than any of the individual outputs. Raytheon researchers have likewise developed techniques for detecting names and handling names properly in translations, and for using confidence scores on the alignment between phrases to improve the translation quality. Raytheon BBN Technologies is currently part of the DARPA Broad Operational Language Translation (BOLT) program. DARPA launched BOLT in 2011 to address the U.S. Department of Defense’s need for quick, reliable access to the large volume of foreign language information generated by users online. One of the program’s goals is to create SMT technologies that can correctly translate informal text generated by online users, which often contains spelling and grammatical anomalies. Another goal is to deal with the problem of communicating with non-English-speaking local populations in foreign countries in person. Raytheon researchers have made significant progress in the short time since the program started, including developing abilities to robustly process errors in input text, better model syntax and semantics, and improve the statistical models using neural networks.

Advances in speech-to-speech translation include modeling of a conversation’s context and detecting speech recognition errors during translation to limit any harmful effects on the translation output. The BBN team ranked first in the formal evaluations of all BOLT machine translation tasks. Raytheon BBN Technologies Automated Translation Solutions In addition to conducting leading-edge research in SMT, Raytheon BBN Technologies has created several turnkey solutions for both the government and commercial markets based on its translation technology. TransTalk™, a two-way speech-to-speech translation solution, runs completely on a smartphone without the need to call a remote server. It currently supports translation between seven languages (including Arabic, Pashto and Dari) and English, and it has been deployed for testing in Afghanistan. The Multimedia Monitoring System (described in Technology Today, 2012, Issue 2, pp. 52–55) uses BBN’s SMT technology for some of the foreign languages it supports. The Multilingual Document Analysis and Translation System (MDATS) uses BBN’s optical character recognition and SMT technologies to translate Arabic document images into English. All these systems have been deployed in a number of government locations for 24/7 use. - See more at:

http://www.raytheon.com/news/technology_today/2014_i1/autolang.html#sthash.wjqwIzbI.dpuf

http://mymemory.translated.net/

No comments:

Post a Comment