<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Barney Pell&#039;s Weblog &#187; Information retrieval</title>
	<atom:link href="http://www.barneypell.com/archives/information-retrieval/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.barneypell.com</link>
	<description></description>
	<lastBuildDate>Thu, 17 Dec 2009 09:20:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Wolfram Alpha: A New Kind of Question-Answering System</title>
		<link>http://www.barneypell.com/2009/03/wolfram-alpha-a-new-kind-of-question-answering-system/</link>
		<comments>http://www.barneypell.com/2009/03/wolfram-alpha-a-new-kind-of-question-answering-system/#comments</comments>
		<pubDate>Mon, 23 Mar 2009 22:03:15 +0000</pubDate>
		<dc:creator>Barney</dc:creator>
				<category><![CDATA[Collective Intelligence]]></category>
		<category><![CDATA[Human Language Technology]]></category>
		<category><![CDATA[Information retrieval]]></category>
		<category><![CDATA[Powerset]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Web/Tech]]></category>

		<guid isPermaLink="false">http://174.120.172.92/~barneype/?p=124</guid>
		<description><![CDATA[There has been much excitement recently over the upcoming launch of Wolfram Alpha. This is a new question-answering system developed by Stephen Wolfram, inventor of Mathematica, and it is scheduled for a beta launch in May. Wolfram has been providing demos to industry insiders. I haven’t had a demo yet, but I have learned what [...]]]></description>
			<content:encoded><![CDATA[<p>There has been much excitement recently over the upcoming launch of Wolfram Alpha. This is a new question-answering system developed by Stephen Wolfram, inventor of Mathematica, and it is scheduled for a beta launch in May. Wolfram has been providing demos to industry insiders. I haven’t had a demo yet, but I have learned what I could from reading articles by Nova Spivak (“<a href="http://www.techcrunch.com/2009/03/08/wolfram-alpha-computes-answers-to-factual-questions-this-is-going-to-be-big/">Wolfram Alpha computes answers to factual questions. This is going to be big”</a>) and Doug Lenat (<a href="http://www.semanticuniverse.com/blogs-i-was-positively-impressed-wolfram-alpha.html">“I was positively impressed with Wolfram Alpha”</a>). And this weekend I spoke with William Tunstall-Pedoe, CEO of <a href="http://www.trueknowledge.com/">True Knowledge</a>, who also got a demo.  Many of my examples and conclusions come from conversation with William (thanks!).  Since life is short and so is the attention of web readers, I&#8217;ll give the rest of my thoughts in bullet form.</p>
<p><strong>What it is: A new kind of question-answering system. </strong></p>
<p><strong>Examples</strong></p>
<ul>
<li> Math: &#8220;2+2&#8243; and then a few simple math questions: &#8220;integrate xsin^4xdx&#8221;, &#8220;what is the square root of 18&#8243; etc.</li>
<li> Business: “gdp france” showed amount and graph of how it changed over time. “gdp france/germany” showed graph with both amounts and the ratio</li>
<li> “internet users in Europe”: Showed total, and a chart of usage by country in Europe, at the current time, specifically highlighting the biggest and smallest</li>
<li> “ISS”: generates a graphic rendition of the international space station orbiting earth and updating in real-time</li>
<li> “tides in san Francisco”: showed a graph of tides over time, where the times were listed in the local time regime current in the late 19th century for those data points. “tide NYC 11/12/1922” gave a single answer.</li>
<li> “weather”: showed graph of average temperature in Cambridge, MA (where Stephen was when doing the demo). Based on reverse IP lookup.</li>
<li> Computational fluid dynamics: typing in the name of a specific aerofoil produced a picture of that aerofoil along with its differential equations.</li>
<li> stock prices:  “MSFT CSCO” showed comparison chart</li>
<li> chemicals: Substances at temperature or pressure, got physical properties calculated. “H2SO4” showed a diagram and chemical properties. &#8220;5 molar h2s04&#8243; did something cool, I don’t know what.</li>
<li> genome sequences: “AGTAG” shows sequences from the human genome that match that pattern</li>
<li> data about people: “How old is Barack Obama” gives his age now. “When was Alan Turing born” gives the answer. “How old is Alan Turing” (a trick question) gives an error message with no human-readable explanation (True Knowledge, by contrast, tells you exactly why this is a trick question).</li>
</ul>
<p><strong>Coverage of data: It answers questions over the following types of structured data:</strong></p>
<ul>
<li> static tables and databases (e.g. a database of internet usage by country by year)</li>
<li> dynamic data feeds (e.g. historical stock market data, position of space shuttle, weather)</li>
<li> numerical inference (e.g. math questions)</li>
<li> numerical computations and simulations (e.g. tides, astronomy, chemistry)</li>
</ul>
<p><span id="more-124"></span></p>
<div id="a000132more">
<div id="more">
<p><strong> Form of queries</strong></p>
<li> The queries are expressed in template-based natural language or corresponding abbreviated forms</li>
<li> NL syntax: “what is the gdp of france”</li>
<li> Template compressed: {attribute} of {object} {time}  (“gdp france 2008”)</li>
<li> Mathematical expressions, or NL versions of these (as one might do in an entry-level LISP class)</li>
<li> I can imagine the query language supports (or could support) restrictions on presentation (plot, chart) and other constraints one might express in SQL (order by, etc), though I haven’t seen any examples showing this exists at present.<strong> Presentation and Answers</strong>
<ul>
<li> Answers can be a single fact, a table, or a graphical display of a live simulation.  Usually it’s a combination of these.</li>
<li> For ambiguous queries, it always picks one interpretation. And you can switch to something else if that’s wrong. (A drop-down menu of other alternatives).</li>
</ul>
<p><strong> Domains and Generality</strong></li>
<li> Wolfram Alpha is described as an open domain question answering system on structured data. But how exactly is this open domain? I distinguish three levels of domain generality:
<ul>
<li> Closed domain: A specified domain</li>
<li> Multi domain: Multiple domains are covered, we try to add more domains, but still treats each one a closed. Note: this can be accomplished through a unified or disjoint treatment.</li>
<li> Open domain: Any domain is within scope</li>
</ul>
</li>
<li>For Wolfram Alpha they have taken a domain-by-domain approach. For each domain, they determined what type of questions to support, and which data, feeds, or simulations to incorporate, and did hand curation to enable these.</li>
<li> The domains are typically fact and data oriented, especially where simulations are available<strong> Architecture</strong></li>
<li> The system is coded in Mathematica, about 4.5M lines of code, developed by a large team (100 people at present).</li>
<li> From this <a href="http://www.wolfram.com/products/mathematica/quickoverview/">presentation on Mathematica </a>it is quite easy to extrapolate what Wolfram Alpha is like &#8211; essentially Mathematica + a vast library of mathematical models and data attached + some error-tolerant processing of the user&#8217;s input (thanks Peter Clark for pointing this out).</li>
<li> Piecing together the Mathematica approach and generalizing from the examples and my own knowledge, I believe they have a basic level of representational tools that gets shared for multiple domains. Here&#8217;s how I would think about this:
<ul>
<li> Define the objects in the domain</li>
<li> Make a table of function names and attributes in the domain, and for each function or attribute list the restrictions on the type of objects that this can apply to.</li>
<li> Standardize representations of time and place and charting elements associated with these.</li>
<li> Import and normalize data</li>
<li> Associate data fields to objects and attributes in the domain</li>
</ul>
<p><strong> Infrastructure</strong></li>
<li> The system runs on thousands of expensive servers (running mathematica in real-time).</li>
<li> Apparently 10 machines per query give 1 queries per second (qps), so they can do 100 qps on 1,000 machines.<strong> What is innovative about this</strong></li>
<li> Rich mathematical computational infrastructure (Mathematica) to support mathematical aspects of natural language queries</li>
<li> Integration of mathematical inference and simulations along with structured data in a single question-answering system</li>
<li> Unprecedented level of structured data aggregation and curation</li>
<li> Rich presentation including static and dynamic elements and multiple modalities</li>
<li> (Potentially) Deployment of NL-to-SQL query translation in a multi-domain system. The technology has existed to do this for several years But I don’t know if anyone has deployed it yet. I’m not sure if Wolfram has deployed this and haven’t seen enough examples to indicate if they have.<strong> What it doesn’t do</strong></li>
<li> Queries or presentation against unstructured data (neither keyword nor NL queries against unstructured data, which is a strength of <a href="http://www.powerset.com/">Powerset</a>)</li>
<li> Queries requiring ontological or commonsense inference (whether structured or unstructured, which is a strength of True Knowledge and <a href="http://www.cyc.com/">Cyc</a>)</li>
<li> Answers in support of transactions (e.g. price feeds from many merchants or airlines), which is shown in various stages in many major search engines</li>
<li> Cross-domain multiple domains (e.g. “what was the weather in San Francisco when Yahoo was founded”, which is a strength of True Knowledge)<strong> Implications for the field</strong>
<ul>
<li> Question answering has been an important part of search results the whole time, but it has often been a second class citizen and hardly promoted</li>
<li> By increasing the level of comprehensiveness of structured questions (in terms of data and domains), this can increase awareness and usage of question answering systems</li>
<li> This should move question answering to be more of a competitive feature across search engines</li>
<li> Users will want to ask questions for structured and unstructured queries, not just structured queries, which will increase perceived differentiation for technology like Powerset</li>
<li> If the use of structured data and simulations prove valuable to large number of users and search engines, then this will increase the need to transform and route queries to vertical experts, potentially developed by ecosystem partners</li>
<li> This will increase the need and value for ecosystem players to add semantic markup to their structured data and simulations, hence making it easier to offer more semantic question answering and integration with other services, and expanding the value of the services by search engines in a virtuous cycle</li>
</ul>
<p><strong>Conclusion</strong></p>
<p>In conclusion, Wolfram Alpha is not going to be a new search engine or a universal answer engine. It is not going to put the existing major players or semantic search startups out of business. But there appears to be real innovation here, leading to at least a <span style="text-decoration: underline;">new kind of system</span> that we have not seen before.  I am eagerly looking forward to my turn to try it out.</li>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.barneypell.com/2009/03/wolfram-alpha-a-new-kind-of-question-answering-system/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Microsoft to acquire Powerset</title>
		<link>http://www.barneypell.com/2008/07/microsoft-to-acquire-powerset/</link>
		<comments>http://www.barneypell.com/2008/07/microsoft-to-acquire-powerset/#comments</comments>
		<pubDate>Thu, 03 Jul 2008 15:50:32 +0000</pubDate>
		<dc:creator>Barney</dc:creator>
				<category><![CDATA[Human Language Technology]]></category>
		<category><![CDATA[Information retrieval]]></category>
		<category><![CDATA[Powerset]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://174.120.172.92/~barneype/?p=118</guid>
		<description><![CDATA[On Monday, Microsoft and Powerset announced that Powerset is being acquired by Microsoft. In terms of timing, the companies announced that the deal was signed. There is still the customary period before the deal is officially closed (at which point, I expect we&#8217;re going to have a great party). I&#8217;m including, below, the text of [...]]]></description>
			<content:encoded><![CDATA[<p>On Monday, Microsoft and Powerset announced that Powerset is being acquired by Microsoft.</p>
<p>In terms of timing, the companies announced that the deal was signed. There is still the customary period before the deal is officially closed (at which point, I expect we&#8217;re going to have a great party).</p>
<p>I&#8217;m including, below, the text of the announcements from the blogs of Powerset andMicrosoft.<br />
I think these sum up pretty well the logic behind the acquisition on both sides.</p>
<p>It took a lot of work by many people to make this happen. Most significant, of course, was the entire team at Powerset, who executed so well to build and launch a wonderful product that showed the world what is now possible.</p>
<p>Immediately following the announcement, we had a day of calls with members of the press, which resulted in a lot of coverage. I&#8217;ll try to post a collection of links next week.</p>
<p>One press meeting that I really enjoyed was a <a href="http://www.techcrunch.com/2008/07/02/interview-with-barney-pell-and-ramez-naam-about-microsoft%e2%80%99s-powerset-acquisition-integration-to-begin-this-year/">podcast with me, Ramez Naam (Group Program Manager for Microsoft Live Search), and Mike Arrington for TechCrunch</a>.  That link provides an article, transcript, and the full audio of the interview.</p>
<p>There is a lot more to say about Powerset, Microsoft, the acquisition, and what it means for the future of search, linguistic technology, semantic web, etc. I am excited to be staying on with Microsoft in a strategy and evangelist role and I am looking forward to the chance to talk and write a lot more about this, and from a whole new perspective, soon.</p>
<p>Here is the text of <a href="http://www.powerset.com/blog/articles/2008/07/01/microsoft-to-acquire-powerset">Powerset&#8217;s blog announcement</a>:</p>
<blockquote><p>We’re excited to announce officially that Microsoft has signed an agreement to acquire Powerset.Powerset has always been a small company with big dreams, with the ultimate goal of changing the way humans interact with computers through language. We set out to improve search by indexing Web pages based on the meaning expressed in them rather than just the literal words. Powerset licensed breakthrough technology from PARC, hired world-renowned computational linguists and search engineers, and recently released a search and discovery experience for Wikipedia articles. Our technology helps to improve search results and also makes new features possible, such as Factz, which aggregates information from many articles to summarize a topic.</p>
<p>With any startup, the challenge is to take the seeds of an idea and grow it into a viable company. At Powerset, we transformed our idea into a world-class semantic search platform, demonstrating the future of search with our Wikipedia search experience. But building a large-scale semantic search engine is expensive, requiring an engineering effort and computing resources beyond what most start-ups could ever imagine. Because our goals around improving search align so well, Powerset has decided to team up with Microsoft. We believe that this is the fastest way to bring our technology to market at a large scale.</p>
<p>Microsoft shares our goal to improve search through deeper analysis of queries and documents, and understands that our technology and expertise will play a key role in the evolution of search. With an existing search infrastructure, incredible capital resources, unlimited data, a leading search team, and clear mission to revolutionize the search landscape, Microsoft can rapidly accelerate our progress in building semantic search technology and bringing it to full Web scale. When we launched our first product, we heard: this is great, but when and how will we get Powerset to go beyond Wikpiedia? Microsoft accelerates our ability to move Powerset to the entire Web faster than anyone could have imagined.</p>
<p>Powerset will continue to operate much as we currently do, working in the same building, with the same organizational structure, and with the same uniquely talented and growing team (apply on our jobs page). We’ll continue to tackle the hardest problems in parsing, semantics, ranking, indexing, scalable computing, user experience and all of our other specialties. But now we’ll do it with the support of Microsoft and the vast resources of the entire Live Search team.</p>
<p>Over the past couple of years Powerset has made amazing progress. Starting with just a big idea, we licensed the best linguistic technology, recruited a top-notch team, built out our datacenter, engineered a world-class semantic search platform, tackled deep natural language issues, improved relevance, innovated an interface and launched a great product. So few start-ups ever tackle such deep, scientific problems successfully and create the kind of value we’ve delivered in such short order.</p>
<p>For now, Powerset.com will continue to host our Wikipedia Search &amp; Discovery and we’ll be continuing to experiment with our product, based on user feedback. But, expect many announcements from us in the coming months about how we’re integrating our technology and features into Live Search.</p></blockquote>
<p>And here&#8217;s the text of <a href="http://blogs.msdn.com/livesearch/archive/2008/07/01/powerset-joins-live-search.aspx">Microsoft&#8217;s blog announcement</a>:</p>
<blockquote><p>Powerset joins Live SearchWe&#8217;re excited to announce that we&#8217;ve reached an agreement to acquire Powerset, a San Francisco-based search and natural language company.</p>
<p>Powerset will join our core Search Relevance team, remaining intact in San Francisco. Powerset brings with it natural language technology that nicely complements other natural language processing technologies we have in Microsoft Research.</p>
<p>More importantly, Powerset brings to Live Search a set of talented engineers and computational linguists in downtown San Francisco. This is a great team with a wide range of experience from other search engines and research organizations like PARC (formerly Xerox PARC).</p>
<p>We&#8217;re buying Powerset first and foremost because we&#8217;re impressed with the people there. Powerset CTO and cofounder Barney Pell is a visionary and incredible evangelist. When he introduced our senior engineers to some of the most senior people at Powerset — Search engineers and computational linguists like Tim Converse, Chad Walters, Scott Prevost, Lorenzo Thione, and Ron Kaplan — we came away impressed by their smarts, their experience, their passion for search, and a shared vision.</p>
<p>That shared vision is to take Search to the next level by adding understanding of the intent and meaning behind the words in searches and webpages.</p>
<p>We know today that roughly a third of searches don&#8217;t get answered on the first search and first click. Usually searchers find the information they want eventually, but that often requires multiple searches or clicks on multiple search results. Two specific problems are the most common reasons for this:</p>
<p>* Differences in phrasing or context between a user&#8217;s search and the way the same information is expressed on webpages. Search engines don&#8217;t understand today that &#8220;shrub&#8221; and &#8220;tree&#8221; are similar concepts. We don&#8217;t understand that &#8220;cancer&#8221; sometimes refers to a disease and sometimes refers to a horoscope and when a query or a webpage refers to which.<br />
* Lack of clarity in the descriptions for each webpage in the search results. Sometimes a result looks relevant from its short description on the results page but turns out to be not so relevant when you visit the actual page. As a result, searchers frequently click results and then rapidly click back when they realize they aren&#8217;t what they&#8217;re looking for.</p>
<p>These problems exist because search engines today primarily match words in a search to words on a webpage. We can solve these problems by working to understand the intent behind each search and the concepts and meaning embedded in a webpage. Doing so, we can innovate in the quality of the search results, in the flexibility with which searchers can phrase their queries, and in the search user experience. We will use knowledge extracted from webpages to improve the result descriptions and provide new tools to help customers search better.</p>
<p>Working with our existing Search team and other Microsoft teams that focus on natural language, Powerset will help us address all of those problems and opportunities.</p>
<p>We&#8217;re looking to add even more talented engineers to the San Francisco team to accelerate our shared progress. If you&#8217;re interested in joining the team, drop us a line.</p>
<p>We&#8217;ll have more to say about the things we&#8217;re doing in understanding searches and webpages through natural language technology in the coming months. In the meantime, please join me in welcoming Powerset to Microsoft!</p>
<p>Satya Nadella, Senior Vice President, Search, Portal, and Advertising</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.barneypell.com/2008/07/microsoft-to-acquire-powerset/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Powerset in Forbes article on the Language of Search</title>
		<link>http://www.barneypell.com/2008/02/powerset-in-forbes-article-on-the-language-of-search/</link>
		<comments>http://www.barneypell.com/2008/02/powerset-in-forbes-article-on-the-language-of-search/#comments</comments>
		<pubDate>Mon, 25 Feb 2008 00:16:54 +0000</pubDate>
		<dc:creator>Barney</dc:creator>
				<category><![CDATA[Human Language Technology]]></category>
		<category><![CDATA[Information retrieval]]></category>
		<category><![CDATA[Powerset]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://174.120.172.92/~barneype/?p=109</guid>
		<description><![CDATA[Forbes.com has a special issue on language, including interesting articles and interviews by some of my favorite writers on Language. I&#8217;m happy that natural language and semantic search was included in the special issue. Andy Greenberg from Forbes.com published his piece on language and search engines devoting a good portion of the article to Powerset [...]]]></description>
			<content:encoded><![CDATA[<p>Forbes.com has a special issue on language, including interesting articles and interviews by some of my favorite writers on Language.</p>
<p>I&#8217;m happy that natural language and semantic search was included in the special issue. Andy Greenberg from Forbes.com published his piece on language and search engines devoting a good portion of the article to <a href="http://www.powerset.com/">Powerset</a> and <a href="http://www.hakia.com/">Hakia</a>, featuring interviews with me and with Hakia&#8217;s founder Riza Berkan. The article, entitled <a href="http://www.forbes.com/business/2008/02/21/search-engine-semantic-tech-cx_ag_language_sp08_0221hakia.html">&#8220;Language Web-lish&#8221;</a> starts off with Andy using Powerset&#8217;s metaphor comparing people&#8217;s current use of search engines to communicating like cavemen:</p>
<blockquote><p>A question in English, like &#8220;What year was Hillary Clinton born?&#8221; becomes what he calls a primitive &#8220;keywordese&#8221;: &#8220;Hillary Clinton born year.&#8221;"We have this great gift of human intelligence based around language,&#8221; says Pell, &#8220;and now we have to translate it into a grunting pidgin language to interact with machines.&#8221;</p></blockquote>
<p>Andy described an example I showed him from Powerset:</p>
<blockquote><p>When a user enters the question, &#8220;In what year was Hillary Clinton born?,&#8221; Powerset&#8217;s algorithm doesn&#8217;t simply scour the Web for this collection of words in close proximity. Instead, it looks at pages with an eye for their meaning. Reading the sentence &#8220;Born to Dorothy and Hugh Rodham in 1947, Hillary Clinton is a New York senator,&#8221; Powerset will disassemble the sentence&#8217;s grammar and extract the fact of Hillary Clinton&#8217;s birth date. That fact is then connected with the user&#8217;s question, even if the word order of the result and the query didn&#8217;t originally match.</p></blockquote>
<p>Andy also went through an example from Hakia:</p>
<blockquote><p>Taking the question &#8220;What drug is best for treating a urinary tract infection?&#8221; Riza Berkan points to the word &#8220;drug.&#8221; Hakia&#8217;s algorithm, he says, understands that the word contains a massive subset of concepts including synonyms and specific names of medicines. When it spots a term that falls into that subset, like &#8220;Amoxicillin,&#8221; Hakia can substitute the medicine&#8217;s name for the word &#8220;drug&#8221; in the result.&#8221;You don&#8217;t want the word &#8216;drug,&#8217; you want the name of the drug,&#8221; says Berkan. &#8220;That&#8217;s a hidden failure in search engines, and people don&#8217;t even know what they&#8217;re missing.&#8221;</p></blockquote>
<p>Other natural language and semantic search companies mentioned included <a href="http://www.cognitionsearch.com/">Cognition Search</a> and <a href="http://www.lexxe.com/">Lexxe</a>.</p>
<p>As is typical, my friend Peter Norvig at Google gets the last word in the article:</p>
<blockquote><p>Google&#8217;s Peter Norvig, the search giant&#8217;s director of research, knows just how complex semantic algorithms can be: His Berkeley Ph.D. thesis tried to develop one in 1978. Every sentence of text, he says, took weeks to analyze. &#8220;The result was kind of like a dancing bear,&#8221; he says. &#8220;It was amazing that it could dance at all, but we didn&#8217;t expect it to star in the Moscow Ballet.&#8221;But that doesn&#8217;t mean Google&#8217;s engineers are idly watching semantic search from a distance, says Norvig. The company&#8217;s thousands of engineers are looking at how to incorporate semantic analysis into a search algorithm. But semantic analysis is just one of many directions that Google&#8217;s teams are exploring&#8230; &#8220;Basically, we just do whatever works,&#8221; says Norvig. &#8220;Instead of trying to understand everything, we&#8217;re trying to understand something about billions of pages a week.&#8221;</p>
<p>But does that pragmatic approach leave Google vulnerable to an innovative start-up willing to risk its fate on building meaning-based search from scratch?</p>
<p>&#8220;It&#8217;s unlikely,&#8221; says Norvig. &#8220;But even car companies have to worry about anti-gravity machines.&#8221;</p></blockquote>
<p>I think that analogy is quite a stretch. It&#8217;s more like big car companies having to worry about smaller companies focused on electric cars. They don&#8217;t have to worry about this immediately but, at some point, this is going to be the future of their industry.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.barneypell.com/2008/02/powerset-in-forbes-article-on-the-language-of-search/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Powerset and Natural Language Search</title>
		<link>http://www.barneypell.com/2006/10/powerset-and-natural-language-search/</link>
		<comments>http://www.barneypell.com/2006/10/powerset-and-natural-language-search/#comments</comments>
		<pubDate>Wed, 04 Oct 2006 22:26:19 +0000</pubDate>
		<dc:creator>Barney</dc:creator>
				<category><![CDATA[Human Language Technology]]></category>
		<category><![CDATA[Information retrieval]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://174.120.172.92/~barneype/?p=76</guid>
		<description><![CDATA[Ever since I stated that Powerset was in &#8220;semi-stealth&#8221; mode about a year ago, I have been pretty quiet about the company on my blog. A few months ago we realized, after going through a fundraising process with a great set of angel investors, that much of Silicon Valley already knew that Powerset was building [...]]]></description>
			<content:encoded><![CDATA[<p>Ever since I stated that Powerset was in &#8220;semi-stealth&#8221; mode about a year ago, I have been pretty quiet about the company on my blog. A few months ago we realized, after going through a fundraising process with a great set of angel investors, that much of Silicon Valley already knew that Powerset was building a natural language search engine.  So we finally put some content up on the <a href="http://www.powerset.com/">Powerset</a> website and agreed to let some of our friends write about us. Some of the first articles about Powerset are those by:</p>
<ul>
<li> <a href="http://news.com.com/Spying+an+intelligent+search+engine/2100-1032_3-6107048.html">Stefanie Olsen</a>,</li>
<li> <a href="http://blogs.zdnet.com/Dyson/?p=12">Esther Dyson</a> (also an investor), and</li>
<li> <a href="http://datamining.typepad.com/data_mining/2006/09/powerset_update.html">Matt Hurst</a>.</li>
</ul>
<p>But I have been so busy with the company that I just didn&#8217;t take the time to write up the vision on my own blog.</p>
<p>Powerset has now unexpectedly become the subject of a recent blogstorm, initiated by an  <a href="http://venturebeat.com/2006/10/02/bold-start-up-powerset-about-to-raise-10m-to-take-on-google/">article posted yesterday</a> by Matt Marshall on <a href="http://www.venturebeat.com/">VentureBeat</a>. Since Matt wrote his initial version of the article before he was able to contact us, he expressed skepticism about what he inferred we are trying to do. (Update: Matt Marshall has just written a <a href="http://venturebeat.com/2006/10/04/powerset-that-secretive-little-search-engine-company/#more-2087">new article about Powerset</a>, after meeting with me and Steve yesterday.) This article started a debate in the blogosphere, with people coming down on both sides of the &#8220;search is great, nobody can compete with Google&#8221; vs. &#8220;search is broken, go for it&#8221; divide (for the former, see <a href="http://googlewatch.eweek.com/blogs/google_watch/archive/2006/10/03/13557.aspx">Steve Bryant&#8217;s article</a>, and for the latter, see <a href="http://www.siliconvalleywatcher.com/mt/archives/2006/10/search_startup.php">this article by Richard Koman</a>).</p>
<p>Given all the attention, I want to take time out to share my vision of natural language as the future of search. To start with, I will characterize the conventional thinking as expressed by various critics.</p>
<div id="a000084more">
<div id="more">
<h2>Search today</h2>
<p>Search has become much better than it used to be, and users have become pretty familiar with using the keyword-oriented search input language. While query length is going up slowly but steadily, the average query is still 2-3 words.  Even search engines like Ask Jeeves that encouraged users to enter questions still wound up with mostly short keyword queries.</p>
<p>From these facts it is easy to draw the following conclusions:</p>
<ul>
<li>users don&#8217;t like typing and will not enter more than 2-3 words.</li>
<li>natural language search has been tried and found lacking</li>
</ul>
<p>If that&#8217;s true, then we will have to settle for short keyword queries for the indefinite future. Unfortunately, the limited query length puts fundamental limits on how much information is communicated to the search engine. This in turn limits what a search engine, however intelligent, could possibly do to improve the results (using more information about the searcher and the search context can help, and is the subject of much active research). Looking at this situation, it is easy to see why it seems like the search industry has matured and hit a plateau. Future innovation will come from extending search in various ways, but not from any fundamental changes in the core.So there we are, right? Well, not exactly. First, note that this logic is very similar to the conventional wisdom before Google came on to the scene. Search was good enough, not a differentiator anymore, and the big players had turned their focus to innovate in other dimensions away from search (e.g. to becoming integrated media and technology companies).</p>
<p>Second, does the data so far really prove that users are generally satisfied with search, that they like to express themselves to search engines they way they do today, and they wouldn&#8217;t try searching in a new way even if it promised better results?</p>
<h2>Who is satisfied communicating in an impoverished language?</h2>
<p>An analogy with natural languages is helpful here. Suppose you live in France and don&#8217;t speak any French. Life is very difficult when you don&#8217;t speak the language. Even basics like getting food and finding a bathroom can be a challenge.  Then you study French for a year. Life is much better &#8212; you can accomplish all your basic daily tasks and even have extended conversations.  The difference is like night and day. But does that mean you would be fully satisfied with your first grade level French?  Of course not. Even after studying for a year, what you can say in French pales in comparison to what you can think in your own native language. After living for a while in a country where you must speak a foreign language it is easy to stop trying to express complex thoughts, but that doesn&#8217;t mean the thoughts go away. Rather, each new word and construction you gain opens up new possibilities for communication. As you improve, your conversation partners (native speakers) gradually begin to see more of the true intelligence you have had all the time, even though you couldn&#8217;t express it until you learned more of their language.</p>
<p>I believe we are in much the same situation with respect to search today. But to see this, it is helpful to look more closely at the history and mechanisms of search.</p>
<h2>A brief history of search query languages</h2>
<p>What is search?  At its most abstract, a user enters a query to a search engine, and the search engine displays a readable set of results to the user. Most search engines do not actually go out and find live documents in response to a user&#8217;s query. Instead, they find a large set of documents in advance, process those documents, and then build an index. They consult this index in response to a user&#8217;s query to find a set of matching documents. Then they rank the potential matches and present the top ranking results.  The ranking of the matches, and possibly the short presentation of each result, are tailored to the query and potentially any other information available to the search engine.</p>
<p>As seen in this abstract description, the user experience in search starts with the query. The query language has a big impact on every aspect of the experience. The earliest search engines, called Information Retrieval (IR) systems, required users to enter queries in a boolean language (a bunch of keywords modified with AND, OR, and NOT). This was powerful, and pretty effective for people like librarians who were trained on these systems, but distinctly unnatural.</p>
<p>Later systems let users enter queries in less formal language: free-text queries. A free-text query is just a list of words with no operators. The boolean-centric IR community called this &#8220;natural language query&#8221;, and our modern search engines are direct descendents of this approach. A free-text search engine translates a free-text query into a boolean query (implicitly assuming the words are coordinated by AND or OR operators) and then sends it to the same kind of boolean search engine as before.  But rather than using all the words in the query, the translator splits the query words into two sets: keywords and stopwords. The keywords are the meaningful content-bearing words, the ones that a trained boolean search user would have put into the query, as they only want documents that have these words (or not). The stopwords are the words that novice users enter because they are natural when entering text queries, but which are so frequent in most documents that they add no actual information usable by a boolean search engine (in fact, if they were included they would make the search results worse).</p>
<p>The result of this split is that from the perspective of the search engine, there is no difference at all between a user query that contains keywords and stopwords and a user query that just contains the keywords. The stopwords are completely ignored.  As a consequence, as users gain experience with these search engines, they learn that the stopwords don&#8217;t have any value, and they just save themselves the trouble of typing them in their queries. The result is that the effective query language used by users trained in free-text queries is not natural language, but rather a keyword sequence language, which I call &#8220;keywordese&#8221;.  Most novice searchers, and even skilled searchers who are frustrated in a search session, still put stopwords in their queries, but skilled searchers consider this to be just silly.</p>
<h2>The expressive power of natural language search</h2>
<p>But is it really silly to want to use stopwords, just something a person does until he learns how to search effectively?  On the contrary, I think this perspective reveals something fundamental about the state of search today.  To begin with, let&#8217;s look at the words that are &#8220;stopwords&#8221;, the words that don&#8217;t get any respect by the search engines. They are the little function words that put together the meaning of a phrase in a natural language like English.  They are little because they are so frequent and useful in language. Words like &#8220;by&#8221;, &#8220;for&#8221;, &#8220;about&#8221;, &#8220;of&#8221;, and &#8220;in&#8221; are all stopwords. But consider how valuable they are to communicating intent among humans.  To a keywordese search engine, &#8220;book for children&#8221;, &#8220;book by children&#8221;, and &#8220;book about children&#8221; are all equivalent to &#8220;book children&#8221;. Using only keywords, it is not even clear how one could possibly express these different queries.</p>
<p>This brings us back to the human language issue and the French analogy. &#8220;Keywordese&#8221; is a really impoverished language. It is much less expressive than even first year French. Normal people have learned human languages all their lives, and that language learning ability is based on aeons of biological and cultural evolution. We are all masters of communicating our intent to other people. But when it comes to search engines, we have to revert to an impoverished foreign language in which it is impossible to express anything but the most basic thoughts.  It is akin to using a pidgin language, the kind invented by two groups of people who speak different languages so that they can communicate through a combination of individual words and gestures without any real syntactic structure.</p>
<p>This motivates the idea of true natural language search. Instead of keywordese or even advanced keywordese (which few people can remember how to use), true natural language queries have linguistic structure. This includes queries where the function words matter, where word order means something, and where relationships that should be explicitly stated easily are stated. Instead of ignoring the function words, a natural language search engine respects their meaning and uses it to give better results. Instead of being a waste of time for a user to add stopwords in a query, each little word added has a profound effect on the search quality.</p>
<p>Going beyond stopwords and even content words in documents, language can also be used to specify information about the type, nature or organization of the information that one is seeking. Understanding queries will allow search engines to separate the content being searched for from information about the type of content or its organization (meta-content). For example, &#8220;synopsis of books about the civil war&#8221; or &#8220;trailers of action movies by Steven Spielberg&#8221;.</p>
<p>By tapping into the expressive power that people already know and use everyday in natural language, all users can let their true intelligence come through in their interactions with a search engine.  This benefits everyone. Natural language search has the potential to turn novice searchers into power searchers, and to enable all searchers to do things that are simply impossible with search today.</p>
<h2>&#8220;Remember when we used to use keyword search?&#8221;</h2>
<p>Seen in this light, there is enormous room for fundamental innovation in search, as the game has only just begun. I believe we are going to look back 5 to 10 years from now and say: &#8220;remember when we used to search using keywords?&#8221;. It will take hard work to get there, but that&#8217;s what we&#8217;re working on at Powerset!</p></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.barneypell.com/2006/10/powerset-and-natural-language-search/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>AAAI Spring Symposium on  Computational Approaches to Analysing Weblogs</title>
		<link>http://www.barneypell.com/2006/04/aaai-spring-symposium-on-computational-approaches-to-analysing-weblogs/</link>
		<comments>http://www.barneypell.com/2006/04/aaai-spring-symposium-on-computational-approaches-to-analysing-weblogs/#comments</comments>
		<pubDate>Sat, 01 Apr 2006 15:22:56 +0000</pubDate>
		<dc:creator>Barney</dc:creator>
				<category><![CDATA[Information retrieval]]></category>
		<category><![CDATA[Weblogs]]></category>

		<guid isPermaLink="false">http://174.120.172.92/~barneype/?p=63</guid>
		<description><![CDATA[Group photo Originally uploaded by Barney Pell. This week I attended the AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. This photo is from a group dinner during the symposium. Present were Natalie Glance and Matt Hurst (from Whizbang, Intelliseek, Blogpulse, and now Nielsen Buzzmetrics), Niall Kennedy (most recently at Technorati), Nicolas Nicolov and [...]]]></description>
			<content:encoded><![CDATA[<div style="float: right; margin-left: 10px; margin-bottom: 10px;">
<a href="http://www.flickr.com/photos/barneypell/121516624/" title="photo sharing"><img src="http://static.flickr.com/42/121516624_7a7e7071f7_m.jpg" alt="" style="border: solid 2px #000000;" /></a></p>
<p><span style="font-size: 0.9em; margin-top: 0px;"><br />
<a href="http://www.flickr.com/photos/barneypell/121516624/">Group photo</a></p>
<p>Originally uploaded by <a href="http://www.flickr.com/people/barneypell/">Barney Pell</a>.<br />
</span>
</div>
<p>This week I attended the AAAI Spring Symposium on  <a href="http://caaw2006.blogspot.com/">Computational Approaches to Analysing Weblogs</a>.<br />
This photo is from a group dinner during the symposium. Present were Natalie Glance and Matt Hurst (from Whizbang, Intelliseek, Blogpulse, and now Nielsen Buzzmetrics), Niall Kennedy (most recently at Technorati), Nicolas Nicolov and Franco Salvetti (Umbria), Rada Mihalcea (U. North Texas), Kevin Burton (TailRank) and Barney Pell.<br />
<br clear="all" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.barneypell.com/2006/04/aaai-spring-symposium-on-computational-approaches-to-analysing-weblogs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Barney Pell&#8217;s Research on Information Retrieval</title>
		<link>http://www.barneypell.com/1995/01/barney-pells-research-on-information-retrieval/</link>
		<comments>http://www.barneypell.com/1995/01/barney-pells-research-on-information-retrieval/#comments</comments>
		<pubDate>Sun, 01 Jan 1995 21:36:54 +0000</pubDate>
		<dc:creator>Barney</dc:creator>
				<category><![CDATA[Information retrieval]]></category>
		<category><![CDATA[Papers]]></category>

		<guid isPermaLink="false">http://174.120.172.92/~barneype/?p=4</guid>
		<description><![CDATA[Barney Pell&#8217;s Research on Information Retrieval This page contains a set of papers based on research with Catherine Baudin, Smadar Kedar, and Barney Pell, on Learning and Information Retrieval. Using Induction to Refine Information Retrieval Strategies. Catherine Baudin, Barney Pell, and Smadar Kedar. Appears in the Proceedings of AAAI-94, Seattle, 1994. Abstract is here. Incremental [...]]]></description>
			<content:encoded><![CDATA[<h3 id="a000016">Barney Pell&#8217;s Research on Information Retrieval</h3>
<p>This page contains a set of papers based on research with Catherine Baudin, Smadar Kedar, and Barney Pell, on Learning and Information Retrieval.</p>
<ul>
<li> <a href="http://www.barneypell.com/papers/aaai94-dedal.pdf">Using Induction to Refine Information Retrieval Strategies.</a> Catherine Baudin, Barney Pell, and Smadar Kedar. Appears in the Proceedings of AAAI-94, Seattle, 1994.  Abstract is <a href="http://www.barneypell.com/papers/aaai94-dedal-abstract.html">here</a>.</li>
<li><a href="http://www.barneypell.com/papers/workshop.pdf">Incremental Acquisition of Conceptual Indices for  Multimedia Design Documentation.</a> Appears in Proceedings of the AAAI-94 Workshop on Indexing and Reuse in Multimedia Systems, Seattle, 1994. Abstract is <a href="http://www.barneypell.com/papers/workshop-abstract.html">here</a>.</li>
<li><a href="http://www.barneypell.com/papers/kaml94.pdf">Increasing Levels of Assistance in Refinement of Knowledge-Based Retrieval Systems.</a> Catherine Baudin, Smadar Kedar, and Barney Pell. In the Knowledge Acquisition Journal, 1994.  Abstract is <a href="http://www.barneypell.com/papers/kaml94-abstract.html">here</a>.</li>
<li><a href="http://www.barneypell.com/papers/book.pdf">Increasing Levels of Assistance in Refinement of Knowledge-Based Retrieval Systems.</a> Catherine Baudin, Smadar Kedar, and Barney Pell. In G. Tecuci and Y. Kodratoff (eds), Machine Learning and Knowledge Acquisition: Integrated Approaches, Academic Press, 1995.  Abstract is <a href="http://www.barneypell.com/papers/book-abstract.html">here</a>.  (This is a slightly longer (more tutorial) version of the Knowledge Acquisition Journal article by the same title.)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.barneypell.com/1995/01/barney-pells-research-on-information-retrieval/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
