October 11, 2006

We are all natural language searchers

My Powerset CoFounder, Lorenzo
Thione
has written a nice article on his blog, in which he argues that
href="http://blog.lorenzothione.com/2006/10/we_are_all_natural_language_se.html">we
are all natural language searchers.
He surveyed the underlying themes in much of the criticism in the current

blogstorm about Powerset and natural language search
. Lorenzo groups the
arguments in support of keyword search into three clusters:

Lorenzo’s article addresses each of these points in turn, and it is good
reading so I won’t summarize all the key points here. I particularly like
his response to the “most queries today are short” critique. He introduces
the idea of the long tail of failed queries, in which users initially
try more natural queries stating what they want, but eventually learn that
it doesn’t help with the search, so they shorten the queries, which leads to
the observation that most queries today are short. It’s a bit like looking
at the fact that all Model-T cars were black, after Henry Ford decided
that’s all he would give them, and concluding that there was no market for
colorful cars. As Lorenzo says:

The data so far about short queries and past failures of natural language
attempts is no indication about what users will really do or not do, as
users have never yet been presented with the possibilities of true natural
language search.

Combining this with my previous post on href="http://www.barneypell.com/archives/2006/10/powerset_and_na.html">my
vision of natural language search, this gives a good view of our
perspectives on what we think is obvious: that users will ultimately want to
interact with search engines in natural language, not just keywordese.

Posted by barney on October 11, 2006 at 2:42 pm | No Comments

October 11, 2006

The Powerset Blogstorm: 1 week later

I wrote a week ago about how Powerset
had become the subject of a blog storm
, and shared my vision of natural
language search. Little did I realize that the storm had barely started. One
week later, there are now about 400 blog articles about
Powerset, according to Technorati
(over 100 with some authority). We got
covered by many of the leading writers on search and internet
technology. Below are a few comments on some of the articles by
high-authority bloggers.

continue reading the The Powerset Blogstorm: 1 week later

Posted by barney on October 11, 2006 at 1:09 am | 2 Comments

October 4, 2006

Powerset and Natural Language Search

Ever since I stated that Powerset was in “semi-stealth” mode about a year ago, I have been pretty quiet about the company on my blog. A few months ago we realized, after going through a fundraising process with a great set of angel investors, that much of Silicon Valley already knew that Powerset was building a natural language search engine. So we finally put some content up on the Powerset website and agreed to let some of our friends write about us. Some of the first articles about Powerset are those by:

But I have been so busy with the company that I just didn’t take the time to write up the vision on my own blog.

Powerset has now unexpectedly become the subject of a recent blogstorm, initiated by an article posted yesterday by Matt Marshall on VentureBeat. Since Matt wrote his initial version of the article before he was able to contact us, he expressed skepticism about what he inferred we are trying to do. (Update: Matt Marshall has just written a new article about Powerset, after meeting with me and Steve yesterday.) This article started a debate in the blogosphere, with people coming down on both sides of the “search is great, nobody can compete with Google” vs. “search is broken, go for it” divide (for the former, see Steve Bryant’s article, and for the latter, see this article by Richard Koman).

Given all the attention, I want to take time out to share my vision of natural language as the future of search. To start with, I will characterize the conventional thinking as expressed by various critics.

Search today

Search has become much better than it used to be, and users have become pretty familiar with using the keyword-oriented search input language. While query length is going up slowly but steadily, the average query is still 2-3 words. Even search engines like Ask Jeeves that encouraged users to enter questions still wound up with mostly short keyword queries.

From these facts it is easy to draw the following conclusions:

  • users don’t like typing and will not enter more than 2-3 words.
  • natural language search has been tried and found lacking

If that’s true, then we will have to settle for short keyword queries for the indefinite future. Unfortunately, the limited query length puts fundamental limits on how much information is communicated to the search engine. This in turn limits what a search engine, however intelligent, could possibly do to improve the results (using more information about the searcher and the search context can help, and is the subject of much active research). Looking at this situation, it is easy to see why it seems like the search industry has matured and hit a plateau. Future innovation will come from extending search in various ways, but not from any fundamental changes in the core.So there we are, right? Well, not exactly. First, note that this logic is very similar to the conventional wisdom before Google came on to the scene. Search was good enough, not a differentiator anymore, and the big players had turned their focus to innovate in other dimensions away from search (e.g. to becoming integrated media and technology companies).

Second, does the data so far really prove that users are generally satisfied with search, that they like to express themselves to search engines they way they do today, and they wouldn’t try searching in a new way even if it promised better results?

Who is satisfied communicating in an impoverished language?

An analogy with natural languages is helpful here. Suppose you live in France and don’t speak any French. Life is very difficult when you don’t speak the language. Even basics like getting food and finding a bathroom can be a challenge. Then you study French for a year. Life is much better — you can accomplish all your basic daily tasks and even have extended conversations. The difference is like night and day. But does that mean you would be fully satisfied with your first grade level French? Of course not. Even after studying for a year, what you can say in French pales in comparison to what you can think in your own native language. After living for a while in a country where you must speak a foreign language it is easy to stop trying to express complex thoughts, but that doesn’t mean the thoughts go away. Rather, each new word and construction you gain opens up new possibilities for communication. As you improve, your conversation partners (native speakers) gradually begin to see more of the true intelligence you have had all the time, even though you couldn’t express it until you learned more of their language.

I believe we are in much the same situation with respect to search today. But to see this, it is helpful to look more closely at the history and mechanisms of search.

A brief history of search query languages

What is search? At its most abstract, a user enters a query to a search engine, and the search engine displays a readable set of results to the user. Most search engines do not actually go out and find live documents in response to a user’s query. Instead, they find a large set of documents in advance, process those documents, and then build an index. They consult this index in response to a user’s query to find a set of matching documents. Then they rank the potential matches and present the top ranking results. The ranking of the matches, and possibly the short presentation of each result, are tailored to the query and potentially any other information available to the search engine.

As seen in this abstract description, the user experience in search starts with the query. The query language has a big impact on every aspect of the experience. The earliest search engines, called Information Retrieval (IR) systems, required users to enter queries in a boolean language (a bunch of keywords modified with AND, OR, and NOT). This was powerful, and pretty effective for people like librarians who were trained on these systems, but distinctly unnatural.

Later systems let users enter queries in less formal language: free-text queries. A free-text query is just a list of words with no operators. The boolean-centric IR community called this “natural language query”, and our modern search engines are direct descendents of this approach. A free-text search engine translates a free-text query into a boolean query (implicitly assuming the words are coordinated by AND or OR operators) and then sends it to the same kind of boolean search engine as before. But rather than using all the words in the query, the translator splits the query words into two sets: keywords and stopwords. The keywords are the meaningful content-bearing words, the ones that a trained boolean search user would have put into the query, as they only want documents that have these words (or not). The stopwords are the words that novice users enter because they are natural when entering text queries, but which are so frequent in most documents that they add no actual information usable by a boolean search engine (in fact, if they were included they would make the search results worse).

The result of this split is that from the perspective of the search engine, there is no difference at all between a user query that contains keywords and stopwords and a user query that just contains the keywords. The stopwords are completely ignored. As a consequence, as users gain experience with these search engines, they learn that the stopwords don’t have any value, and they just save themselves the trouble of typing them in their queries. The result is that the effective query language used by users trained in free-text queries is not natural language, but rather a keyword sequence language, which I call “keywordese”. Most novice searchers, and even skilled searchers who are frustrated in a search session, still put stopwords in their queries, but skilled searchers consider this to be just silly.

The expressive power of natural language search

But is it really silly to want to use stopwords, just something a person does until he learns how to search effectively? On the contrary, I think this perspective reveals something fundamental about the state of search today. To begin with, let’s look at the words that are “stopwords”, the words that don’t get any respect by the search engines. They are the little function words that put together the meaning of a phrase in a natural language like English. They are little because they are so frequent and useful in language. Words like “by”, “for”, “about”, “of”, and “in” are all stopwords. But consider how valuable they are to communicating intent among humans. To a keywordese search engine, “book for children”, “book by children”, and “book about children” are all equivalent to “book children”. Using only keywords, it is not even clear how one could possibly express these different queries.

This brings us back to the human language issue and the French analogy. “Keywordese” is a really impoverished language. It is much less expressive than even first year French. Normal people have learned human languages all their lives, and that language learning ability is based on aeons of biological and cultural evolution. We are all masters of communicating our intent to other people. But when it comes to search engines, we have to revert to an impoverished foreign language in which it is impossible to express anything but the most basic thoughts. It is akin to using a pidgin language, the kind invented by two groups of people who speak different languages so that they can communicate through a combination of individual words and gestures without any real syntactic structure.

This motivates the idea of true natural language search. Instead of keywordese or even advanced keywordese (which few people can remember how to use), true natural language queries have linguistic structure. This includes queries where the function words matter, where word order means something, and where relationships that should be explicitly stated easily are stated. Instead of ignoring the function words, a natural language search engine respects their meaning and uses it to give better results. Instead of being a waste of time for a user to add stopwords in a query, each little word added has a profound effect on the search quality.

Going beyond stopwords and even content words in documents, language can also be used to specify information about the type, nature or organization of the information that one is seeking. Understanding queries will allow search engines to separate the content being searched for from information about the type of content or its organization (meta-content). For example, “synopsis of books about the civil war” or “trailers of action movies by Steven Spielberg”.

By tapping into the expressive power that people already know and use everyday in natural language, all users can let their true intelligence come through in their interactions with a search engine. This benefits everyone. Natural language search has the potential to turn novice searchers into power searchers, and to enable all searchers to do things that are simply impossible with search today.

“Remember when we used to use keyword search?”

Seen in this light, there is enormous room for fundamental innovation in search, as the game has only just begun. I believe we are going to look back 5 to 10 years from now and say: “remember when we used to search using keywords?”. It will take hard work to get there, but that’s what we’re working on at Powerset!

Posted by barney on October 4, 2006 at 10:26 pm | 10 Comments