Microsoft announces several new features in Azure AI including text-to-speech avatar

Reading time icon 3 min. read


Readers help support MSpoweruser. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more

Microsoft Azure AI Voice

Azure AI services allow developers to create AI applications with out-of-the-box and pre-built and customizable APIs and models. Azure AI Services include Vision service, Speech service, Translator service and more. At Ignite 2023, Microsoft today announced several new features in Azure AI including text-to-speech avatar, personal neural voice, new improved machine translation mode and more. Find the details below.

  • A new task-optimization summarization capability in Azure AI Language, powered by large language models (GPT-3.5-Turbo, GPT-4, Z-Code++ and more).
  • A new machine translation model capable of translating from one language to another without translating in English as an intermediary. In addition, it can be customized using customer data to better align translations to the industry’s context.
  • Named entity recognition, document translation and summarization in containers will allow government agencies and industries, such as financial services and healthcare, with strict data residency requirements to run AI services on their own infrastructure.
  • Personal voice, a new custom neural voice feature that will enable businesses to create custom neural voices with 60 seconds of audio samples for their users. Personal voice is a limited access feature. 
  • Text-to-speech avatar, a new text-to-speech capability that will generate a realistic facsimile of a person speaking based on input text and video data of a real person speaking. Both prebuilt and custom avatars are now in preview, however, custom avatar is a limited access feature. 

Azure AI Vision service is getting the following updates:

  • Liveness functionality and Vision SDK: Liveness functionality will help prevent face recognition spoofing attacks and conforms to ISO 30107-3 PAD Level 2. Vision SDK for Face will enable developers to easily add face recognition and liveness to mobile applications. Both features are in preview. 
  • Image Analysis 4.0: This API introduces cutting-edge Image Analysis models, encompassing image captioning, OCR, object detection and more, all accessible through a single, synchronous API endpoint. Notably, the enhanced OCR model boasts improved accuracy for both typed and handwritten text in images. Image Analysis 4.0 is generally available. 
  • Florence foundation model: Trained with billions of text-image pairs and integrated as cost-effective, production-ready computer vision services in Azure AI Vision, this improved feature enables developers to create cutting-edge, market-ready, responsible computer vision applications across various industries. Florence foundation model is generally available.

Finally, the new updates in Azure AI Services will make the process of extracting insights from videos easier than ever. You can now use Azure AI to get a text summary of a video content. Also, you can search now search for specific topics, moments or details within extensive videos using natural language. Find the details below.

  • Video-to-text summary: Users will be able to extract the essence of video content and generate concise and informative text summaries. The advanced algorithm segments videos into coherent chapters, leveraging visual, audio and text cues to create sections that are easily accommodated in large language model (LLM) prompt windows. Each section contains essential content, including transcripts, audio events and visual elements. This is ideal for creating video recaps, training materials or knowledge-sharing.
  • Efficient Video Content Search: Users will be able to transform video content into a searchable format using LLMs and Video Indexer’s insights. By converting video insights into LLM-friendly prompts, the main highlights are accessible for effective searching. Scene segmentation, audio events and visual details further enhance content division, allowing users to swiftly locate specific topics, moments or details within extensive videos.

User forum

0 messages