Few hours back, Microsoft announced the first phase of the Skype Translator preview program which will start with English and Spanish as the first two languages. Skype Translator is the product which is a result of decades of research in speech recognition, automated translation, and general machine learning technologies. One of the main technology behind Skype Transaltor is the recent improvements in speech recognition, made possible by the introduction of deep neural networks combined with Microsoft’s proven statistical machine translation technology, allow for better translation outcomes, making meaningful one-on-one conversation possible.
How it works?
Machine Learning is the capability of software learning from training data examples, and Skype Translator is built on a robust Machine Learning platform. By learning from the training data during this preview stage, along with all of its nuances, the software can learn to better recognize and translate the diversity of topics, accents and language variation of actual Skype Translator users.
Skype Translator’s machine learning protocols train and optimize speech recognition (SR) and automatic machine translation (MT) tasks, acting as the glue that holds these elements together. This “glue” transforms the recognized text to facilitate translation. This process includes the removal of disfluencies (i.e. ‘ahs’ and ‘umms’ as well as re-phrasings), division of the text into sentences, as well as addition of punctuation and capitalization.
The training data for speech recognition and machine translation comes from a variety of sources, including translated web pages, videos with captions, as well as previously translated and transcribed one-on-one conversations. Skype Translator records conversations in order to analyze the scripts and train the system to better learn each language. We have also had many people donate data from previous conversations, which we also analyze and use to create training material for the statistical models that teach the Speech Recognition and Machine Translation engines how to map the incoming audio stream to text, and then the text to another language. Skype Translator participants are all clearly notified as the call begins that their conversation will be recorded and used to improve the quality of Microsoft’s translation and voice recognition services.
After the data is prepared and entered into the machine learning system, the machine learning software builds a statistical model of the words in these conversations, and their context. When you say something, the software can find something similar in its statistical model, and apply the previously learned transformation from audio to text and from text into the foreign language.
While speech recognition has been an important research topic for decades, widespread adoption of the technology had been stymied by high error rates and sensitivity to speaker variation, noise conditions etc. The advent of Deep Neural Networks (DNNs) for speech recognition, pioneered by Microsoft Research, dramatically reduced error rates and improved robustness, finally enabling the use of this technology in broad contexts such as Skype Translator. At the same time, the dream of global human-to-human communication was a major motivating factor and driving force for the MSR researchers working on this technology.
Continue reading about it in detail here.