Sunday, October 25, 2009

A contextual search experience for Wikipedia

Wikipedia users can now configure a Custom Search skin to customize their Wikipedia search experience. Once configured, the skin helps you to search Wikipedia, and for contextually relevant articles, from any Wikipedia page. This can make it easier to find relevant information, especially on Wikipedia pages with many links, and where the topics you are researching are ambiguous. You can find instructions to configure the Custom Search skin at Wikipedia. It works with Wikipedia's Monobook and the Beta Vector skins, and should work on Wikipedia domains globally. Remember that you need a user account and must log in to Wikipedia to use it.

With the skin configured, if you are reading the Wikipedia page on NASA, and do a search for the query [mars], you are presented inline results organized into 3 tabbed groups: All Wikipedia pages, Linked Wikipedia pages, and Linked non-Wikipedia pages. The first tab shows all Wikipedia articles that match, including those about the candy (Mars Bars) and the television series (Veronica Mars). The next 2 tabs provide contextually relevant results that are linked from the NASA page, such as information about various Mars rovers, orbiters, and space labs, as shown in the screenshot.



Here's what's going on under the covers:

Linked Custom Search enables the creation of dynamic search experiences, where the content being searched can be defined on the fly, and can change over time as new information becomes relevant. The Custom Search skin creates a Linked Custom Search engine on demand for every Wikipedia page that you navigate to.

The results from the current Wikipedia domain, as well as the results from the per-page dynamic search engine, are presented inline in tabbed categories via the AJAX search API. You can refine results by the category of choice, and quickly review the results without having to open a new browser window or tab. This happens through the Javascript code in the skin. The skin's CSS defines the look and feel of the results.

As for the page-specific Linked Custom Search engine, it computes the contextual results within the Linked Wikipedia pages (on-domain) and Linked non-Wikipedia pages(off-domain) categories. These two tabs are technically very similar, so we'll just describe how one of them works.

Suppose you're visiting the NASA article and search for [mars]. The Linked Wikipedia tab sends the search query to Google Custom Search, along with a parameter that indicates that the search engine specification is at (view source in browser):


Google picks up this Linked CSE request and uses the above specification and the supplied query. You can simulate this process by visiting:


A different specification is generated for every Wikipedia page (based on url) by a tiny AppEngine application at http://googlecustomsearch.appspot.com. The specification defines a search engine with two facets, labeled "internal" (Linked Wikipedia pages) and "external" (Linked non-Wikipedia pages). The list of "internal" (and "external") webpages to search over is provided by this line in the specification:

<Include href="http://googlecustomsearch.appspot.com/wikipedia/annotations.do?url=en.wikipedia.org%2Fwiki%2FNASA" type="Annotations"/>

This causes Google to visit the webapp at a new URL (annotations.do). Our webapp now collects links from the NASA article, classifies them as "internal" or "external", and returns the annotations in an XML format. You can see the result at (view source in browser)


Now Google can finish building the Custom Search engine for the NASA article, and compute the results for [mars]. The results are returned to your web browser and displayed in the appropriate tab.

But wait! Our little AppEngine webapp doesn't have the CPU horsepower or bandwidth to scan Wikipedia pages on-demand or in nearly-real-time for thousands of Wikipedia users. Instead, the webapp asks Google to scan the page, via a Custom Search tool called makeannotations. The request looks something like this:


After makeannotations returns the list of links in the NASA article in XML, the webapp simply rewrites the XML according to the domain of each link.

Since we are creating the per-page search engines on demand, there can sometimes be a short delay in the creation of the search engine, e.g., for new or obscure pages. However, for popular Wikipedia pages, these definitions should be cached, and you should see no delays. In fact, we use a ping method to load up the Custom Search engine in advance before you search. Remember that if there are not many links on the Wikipedia page you are searching from, you may sometimes find no matches for linked pages.

We've open sourced the code for this application. Feel free to work with it. Feel free to extend the skin beyond Monobook and Vector. We built this skin with the help of Wikipedia, and hope that you will provide feedback on your experience. You can also provide your feedback directly to Wikipedia.