Microsoft Research Is Building A Smart Virtual Assistant With Its Situated Interaction Project

Microsoft Monica Virtual Assistant

While we are all waiting to try Cortana digital assistant on Windows Phone 8.1, Microsoft Research is working on virtual assistants that can handle your work in real life dealing with other people and more. Microsoft Research’s long term goal of the Situated Interaction project is to enable a new generation of interactive systems that embed interaction and computation deeply into the natural flow of everyday tasks, activities and collaborations. Example scenarios include human-robot interaction, e-home, interactive billboards, systems that monitor, assist and coordinate teams of experts through complex tasks and procedures, etc.

Such an assistant could coordinate with the assistants of other people, helping to schedule social engagements, work commitments, and travel. It could anticipate your needs based on past activities—such as where you have enjoyed dining—and coordinate with businesses that offer special deals. It could help you select a movie based on which ones your friends liked.

“Intelligent, supportive assistants that assist and complement people are a key aspiration in computer science,” Horvitz says, “and basic research in this space includes gathering data and observing people conversing, collaborating, and assisting one another so we can learn how to best develop systems that can serve in this role.”

Elevating human-computer interaction to a new level of sophistication

Microsoft’s current Monica Virtual assistant has the following interaction features.

Basic interaction: illustrates a basic single-participant interaction with the system. Notice the various layers of scene analysis (system tracks user’s face and pose, infers information about clothing, affiliation, task goals, etc.) and the natural engagement model (system engages as the user approaches)
Scene inferences and grounding: systems infers user goals from scene analysis (user is dressed formally, hence most likely external, hence probably wants registration), but grounds this information through dialog. Notice also the grounding of the building number.
Attention modeling and engagement: systems monitors the user’s attention (using information from the face detector and pose tracker) and engages the user accordingly.
Handling people waiting in line: system monitors multiple users in the scene and acknowledges the presence of a waiting user with a quick glance (red dot shows system’s gaze) and by engaging them temporarily towards the end of the conversation
Re-engagement: same as above, only that when system turns back the initial user is no longer paying attention. Knowing that a person is waiting in line, the system draws the user’s attention and re-engages by saying “Excuse me!”
Multi-participant dialog: system infers from the scene (and confirms through dialog) that the two participants are in a group together. System then carries on a multi-participant conversation. Notice the gaze model (red dot) that is information by who is the speaking participant and also certain elements in the discourse structure.
Multi-participant dialog with side conversation: similar to the previous interaction; at the end the users engage in a side conversation. The system understands that the utterances are not addressed to it and, after a while, interrupts the two users to convey the shuttle information. Notice also the touch-screen interaction that is used as a fallback for cases when speech recognition fails.
Multi-participant dialog with a third person waiting: that also illustrates how the system handles a waiting participant while interacting with a group of two users.

Read more about this project from Microsoft Research.