Gottfried Wilhelm Leibniz was a mathematician, contemporaneous with Newton. It turns out Leibniz was also a philosopher, and there is something called the Leibniz Principle that equates to data conflation. In colloquial terms, the Leibniz Principle says, ‘If it looks like a duck, walks like a duck, quacks like a duck, chances are it’s a duck.’ That principle is the basis of Microsoft’s new Leibniz Platform.
Currently search engines are thought of as tools to find text but Ashok Chandra, Microsoft distinguished scientist and general manager of the Interaction and Intent Group at Microsoft Research Silicon Valley, believes people soon will think of search engines as “task engines.”
“Search technology began with words,” says Chandra. “We built a whole search infrastructure around words. But in this new era of search, we are working with entities, because people think in terms of them, such as a hotel, a movie, an event, a hiking trail, or a person. The Leibniz platform is designed from the ground up to deal in entities, with the goal of making it easier for people to accomplish the tasks they set out to do.”
The Leibniz entity-resolution system is now the underlying platform used in the task of booking hotel rooms, a feature of the Windows 8 Travel app. The hotel-booking feature of the travel app is the result of collaboration between the Microsoft Bing Applications Experience (AppEx) group in Bellevue, Wash., and a team from the Interaction and Intent Group, consisting of Chandra, researcher Bo Zhao, senior program manager Dhyanesh Narayanan, and contract developer George Puchalski. The Travel-app project, which began in January, has been one of the most challenging deployments of Leibniz to date.
Research for Leibniz began with a large-scale case study in resolving movie entities, described in the technical paper Improving Entity Resolution with Global Constraints, co-authored by Chandra and Silicon Valley lab colleagues Jim Gemmell and Benjamin I. P. Rubinstein. The system automatically resolved entities across movie-database websites such as IMDb, Netflix, iTunes, and AllMovie by conflating the data—matching and combininginformation from disparate sourcesto create a data set that is more useful than the original data.
The movie-search functionality went into Bing in late 2010. Leibniz aggregated movie information and gave the search engine richer capabilities for supporting entity actions such as “rent,”“watch,”and “buy.”
“If you run into a few errors while searching for movies, the consequences are not serious,” Narayanan says. “For hotels, however, if a booking goes wrong, the user`s trip can be spoiled. Our biggest challenge was accurate conflation of hotel information. When we pulled together information about a hotel from various sources, we had to be sure it really was all about the same hotel.”
The challenge was to achieve data accuracy of 99.9 percent, in an industry in which accuracy typically runs between 90 to 95 percent. Another challenge: Because the Travel app is bundled as part of Windows 8, the project had to meet hard deadlines for supporting global markets, either at initial rollout or no more than a few weeks after. At the time, the AppEx group dealt with a single hotel provider, KAYAK.com, which only supported markets in the United States, Europe, and India. To achieve broader global coverage, the AppEx group added another hotel provider, Booking.com. This meant Leibniz had to conflate data from both partners.
The researchers began by running a simple model for matching, which then enabled Leibniz’s machine-learning algorithms to discover rules about how different providers described hotels and their attributes, such as the synonyms and semantics of various terms. As the rules were applied and put back into the model, the system could continue to learn more about how hotels were represented. The system became “smarter” with each iteration until it achieved the required level of accuracy. The researchers found that using hotel names and addresses to match information was no guarantee of success, a problem complicated by data errors and inconsistencies in both databases. Narayanan cites a typical example of inconsistency.
“For example, in Las Vegas, one provider listed the Bellagio hotel as ‘The Bellagio,’” he says. “The other provider called it ‘The Bellagio Casino Hotel.’ So initially, the app was not able to make the match. But then Leibniz helped us notice that Las Vegas hotels and casinos are pretty much the same thing, and the system augmented the model.”
“In some cases, the same entity looks quite different to the system,” Zhao explains. “An inn could be listed as a B&B by one provider and as a bed and breakfast by the other. We often had to resolve between different address formats or sometimes deal with addresses that were just plain wrong.”
In one memorable case, similar-sounding entities turned out to be quite different: two Marriott properties were at the same address and bore similar names, but operated as separate hotels.
“We had to arrive at 99.9 percent precision for each new market as quickly as possible,” Zhao says. “Whenever you add a new country, there are different conventions for hotel names, addresses, and attribute descriptions, so the system has to learn and apply some new rules. There is no existing list of equivalents or synonyms. Fortunately, Leibniz is always building on what it already understands so it just has to learn a few more.”
“With the first market we brought onboard, we had to run about 20 to 30 iterations to achieve 99.9 percent precision,” Narayanan recalls. “But a lot of those rules carried over to the next market, and the next, and so forth. The number of iterations started falling off pretty quickly. Some of the markets we onboarded near the end didn’t need any iterations at all. China and Japan were quite different, especially Japan, but even those were not too bad.”
In a previous version of the Travel app, people would see only the cheapest room deals for a hotel, even if it was not be the type of room the person wanted. With the new Travel app, the booking page has been redesigned to show information from Kayak as before, but it also brings in other options via Booking.com to provide rooms and rate-plan details, empowering people with richer information for making decisions. While Leibniz is the underlying engine for conflation, it is a platform that comes with a comprehensive set of tools—all of which proved extremely useful for the project’s administrators, program managers, labelers, developers, and testers. The tools simplified the work of labeling and training the models while also streamlining the deployment and management of the system during production.
“The key goal of the AppEx team is to build high-value experiences that help attract users to our device platforms,” Batterberry says. “With the Liebniz technology from Microsoft Research, we’re able to conflate content from multiple sources with an unprecedented three nines of precision to help the user more effectively complete their task with the best possible experience.”
What’s next for Leibniz?
“The Leibniz platform is domain-agnostic,” Zhao says. “The models for each domain are different, but the same code base can be used to conflate data, whether movies or TV shows or hotels. Therefore, Leibniz is ideal in any situation where applications need to gather data from multiple sources and require high-quality conflation.”
Source: Microsoft Research