Scary Microsoft patent would have Windows watch everything you do, send it to Bing, for better search results

Home » News

Surur Davids

Smartphone Expert

Bing

6 min. read

Updated on November 12, 2024

Browsing through Microsoft patent library we often come across ideas which we wish they implemented, but which never made it to a product.

Today we came across the opposite – an idea whose time we hope never comes.

The patent, “QUERY FORMULATION VIA TASK CONTINUUM”, published yesterday (22/9/2016), notes that efficient searching is enhanced if more information is available regarding the user intent, giving the example of someone doing a school report on dancing, and noting that despite the user having done some work already, when they hit the browser to search the search engine would not have any idea what the user is working on except for what they have typed into the search bar.

They note:

People use multiple desktop applications in order to complete a single task. For example, if a user is researching the topic of “dancing” for school, the user will use a first application to write things down as well as a second application such as a browser, to search different styles of dancing. However, in existing systems, the two applications are completely disconnected from each other. The first application does not provide the browser implicit hints as to what the user might be seeking when there is a switch from the first application to the second application. The user perceives tasks in the totality. However, since applications are typically disconnected, and not mediated in any way by the operating system (OS), the computing system has no idea as to the overall goal of the user.

Microsoft’s solution to this conundrum is to have an agent or “mediator” watching what the user is doing in “active 3rd party applications” such as a word processor PDF reader, recognizing images or text from the photos they are looking at, recognizing music or sound, their location and other contextual data, removing personally identifiable information from this data, and adding it in some way to the search query to produce better ranked and more focussed results.

The patent notes:

The disclosed architecture comprises a mediation component (e.g., an API (application program interface) as part of the operating system (OS)) that identifies engaged applications—applications the user is interacting with for task completion (in contrast to dormant applications—applications the user is not interacting with for task completion), and gathers and actively monitors information from the engaged applications (e.g., text displayed directly to the user, text embedded in photos, fingerprint of songs, etc.) to infer the working context of a user. The inferred context can then be handed over to one of the applications, such as a browser (the inferred context in a form which does not cross the privacy barrier) to provide improved ranking for the suggested queries through the preferred search provider. Since the context is inferred into concepts, no PII (personally-identifiable information) is communicated without user consent—only very high-level contextual concepts are provided to the search engines.
The architecture enables the capture of signals (e.g., plain text displayed to the user, text recognized from images, audio from a currently playing song, and so on), and clusters these signals into contextual concepts. These signals are high-level data (e.g., words) that help identify what the user is doing. This act of capturing signals is temporal, in that it can be constantly changing (e.g., similar to running average of contextual concepts). The signals can be continuously changing based on what the user is doing at time T (and what the user did from T-10 up to time T).
When using the browser application as the application that uses the captured signals, the browser broadcasts and receives (e.g., continuously, periodically, on-demand, etc.) with the mediation component through a mediation API of the mediation component to fetch the latest contextual concepts.
When the user eventually interacts with, or is anticipated to interact with, the browser (as may be computed as occurring frequently and/or based on a history of sequential user actions that results in the user interacting with the browser next), the contextual concepts are sent to the search provider together with the query prefix. The search engine (e.g., Bing™ and Cortana™ (an intelligent personal digital speech recognition assistant) by Microsoft Corporation) uses contextual rankers to adjust the default ranking of the default suggested queries to produce more relevant suggested queries for the point in time. The operating system, comprising the function of mediation component, tracks all textual data displayed to the user by any application, and then performs clustering to determine the user intent (contextually).
The inferred user intent sent as a signal to search providers to improve ranking of query suggestions, enables a corresponding improvement in user experience as the query suggestions are more relevant to what the user is actually trying to achieve. The architecture is not restricted to text, but can utilize recognized text in displayed photos as well as the geo-location information (e.g., global positioning system (GPS)) provided as part of the photo metadata. Similarly, another signal can be the audio fingerprint of a currently playing song.
As indicated, query disambiguation is resolved due to the contextual and shared cache which can be utilized by various applications to improve search relevance, privacy is maintained since only a minimally sufficient amount of information is sent from one application to the another application, and the inferred user context can be shared across applications, components, and devices.
The mediation component can be part of the OS, and/or a separate module or component in communication with the OS, for example. As part of the OS, the mediation component identifies engaged non-OS applications on the device and, gathers and actively monitors information from the engaged applications to infer the working context of the user. The inferred context can then be passed to one of the applications, such as the browser in a secure way to provide improved ranking for the suggested queries through the preferred search provider.

In short, Clippy on steroids.

The main concern with such a system is of course personal data leaking despite Microsoft’s supposed privacy safeguards or reading the user’s context wrong leading to more frustration (another Clippy problem).

On the other hand a very intelligent agent would definitely be better if it knew everything about me, and there are many who say privacy is dead already.

The patent is in some ways similar to Google’s Now on Tap or Screen Search, which scrapes an application screen for text and other information and then launches a contextual Google Search. It does however sound a bit more far reaching and a lot more autonomous.

What do our readers think of this patent? Let us know below.