One of the main features that was announced today by Azure Media Services is the video indexing service. The Microsoft Audio Video Indexing Service (MAVIS) is the technology behind the new feature that uses speech recognition technology developed at Microsoft Research to enable searching of audio and video files with speech. Additionally, MAVIS automatically generates closed captions and keywords which can increase accessability and discoverability of audio and video files with speech content. MAVIS is available as a cloud service running on the Windows Azure platform.
Search audio for spoken words – MAVIS generates a binary file which can be searched in Microsoft SQL server using full text search. The user experience is much like searching for text in documents and on the web as demonstrated on the MAVIS trial site. Users type in search terms, the result is a set of links, which when clicked on, will start playing the audio or video from where those terms were spoken.
Highly accurate audio search Results – MAVIS uses state of the art Deep Neural Net (DNN) based speech recognition technology developed at Microsoft Research to convert digital audio signals into words. Furthermore, MAVIS reduces errors in speech recognition by automatically expanding its vocabulary, and storing word alternatives using a technique referred to as Probabilistic Word-Lattice Indexing, explained in the technical background. These techniques help provide highly accurate search results.
Closed Captions – Closed captions can make audio and video content accessible to the hearing impaired, or translated so that the content can be used by a broader audience in different languages. MAVIS generates closed captions in the SAMI and TTML formats. The accuracy of closed captions generated by MAVIS will depend mainly on the clarity of speech in the media content. There are a number of subtitle editing tools available on the web which can be used to edit the closed captions generated by MAVIS for improved accuracy. MAVIS provides an estimated level of accuracy to help determine if post editing is required.
Keyword generation – MAVIS generates keywords from the speech content. The keywords are stored in an XML file with frequency and offset information. The keywords generated by MAVIS can be used to perform speech analytics, or exposed to search engines such as Bing, Google or Microsoft SharePoint to make the media files more discoverable, or used to deliver more relevant ads.