February 25, 2008
Powerset in Forbes article on the Language of Search
Forbes.com has a special issue on language, including interesting articles and interviews by some of my favorite writers on Language.
I’m happy that natural language and semantic search was included in the special issue. Andy Greenberg from Forbes.com published his piece on language and search engines devoting a good portion of the article to Powerset and Hakia, featuring interviews with me and with Hakia’s founder Riza Berkan. The article, entitled “Language Web-lish” starts off with Andy using Powerset’s metaphor comparing people’s current use of search engines to communicating like cavemen:
A question in English, like “What year was Hillary Clinton born?” becomes what he calls a primitive “keywordese”: “Hillary Clinton born year.”"We have this great gift of human intelligence based around language,” says Pell, “and now we have to translate it into a grunting pidgin language to interact with machines.”
Andy described an example I showed him from Powerset:
When a user enters the question, “In what year was Hillary Clinton born?,” Powerset’s algorithm doesn’t simply scour the Web for this collection of words in close proximity. Instead, it looks at pages with an eye for their meaning. Reading the sentence “Born to Dorothy and Hugh Rodham in 1947, Hillary Clinton is a New York senator,” Powerset will disassemble the sentence’s grammar and extract the fact of Hillary Clinton’s birth date. That fact is then connected with the user’s question, even if the word order of the result and the query didn’t originally match.
Andy also went through an example from Hakia:
Taking the question “What drug is best for treating a urinary tract infection?” Riza Berkan points to the word “drug.” Hakia’s algorithm, he says, understands that the word contains a massive subset of concepts including synonyms and specific names of medicines. When it spots a term that falls into that subset, like “Amoxicillin,” Hakia can substitute the medicine’s name for the word “drug” in the result.”You don’t want the word ‘drug,’ you want the name of the drug,” says Berkan. “That’s a hidden failure in search engines, and people don’t even know what they’re missing.”
Other natural language and semantic search companies mentioned included Cognition Search and Lexxe.
As is typical, my friend Peter Norvig at Google gets the last word in the article:
Google’s Peter Norvig, the search giant’s director of research, knows just how complex semantic algorithms can be: His Berkeley Ph.D. thesis tried to develop one in 1978. Every sentence of text, he says, took weeks to analyze. “The result was kind of like a dancing bear,” he says. “It was amazing that it could dance at all, but we didn’t expect it to star in the Moscow Ballet.”But that doesn’t mean Google’s engineers are idly watching semantic search from a distance, says Norvig. The company’s thousands of engineers are looking at how to incorporate semantic analysis into a search algorithm. But semantic analysis is just one of many directions that Google’s teams are exploring… “Basically, we just do whatever works,” says Norvig. “Instead of trying to understand everything, we’re trying to understand something about billions of pages a week.”
But does that pragmatic approach leave Google vulnerable to an innovative start-up willing to risk its fate on building meaning-based search from scratch?
“It’s unlikely,” says Norvig. “But even car companies have to worry about anti-gravity machines.”
I think that analogy is quite a stretch. It’s more like big car companies having to worry about smaller companies focused on electric cars. They don’t have to worry about this immediately but, at some point, this is going to be the future of their industry.
Posted by barney on February 25, 2008 at 12:16 am | No Comments
November 19, 2007
Natural Language and the Semantic Web: ISWC Keynote talk
I gave an invited keynote talk last week at The 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference, 2007. The abstract for the talk is below. The image below links to the original video and presentation slides.
The live presentation (and video) contains technical demos that aren’t in the slides. Some of the demos are already available inside Powerlabs (e.g. Powermouse, which lets you browse and query our semantic database of facts extracted from Wikipedia), while some of these are still internal (e.g. an open search box, and output of our natural language system on full sentences). I also gave some detailed walk-through showing how Powerset takes advantage of external semantic resources like Wordnet and Freebase.
For me, the most fun part of the talk was toward the end, where I got to speculate on how ecosystem effects can make natural language search and the semantic web become deeper and more powerful more quickly than people might expect. For example, advertisers, publishers, and vertical search sites will be able to contribute ontologies that enable them to get more users, better internal search, and more revenue, while having as a side effect that the broad search engines get more knowledgeable about different domains.
The questions afterward were also challenging and interesting.
![]()
POWERSET – Natural Language and the Semantic Web
continue reading the Natural Language and the Semantic Web: ISWC Keynote talk
Posted by barney on November 19, 2007 at 8:29 pm | No Comments
September 12, 2007
Tim Converse on Proximity is a Hack
Powerset’s Tim Converse wrote a great article entitled: Proximity is a Hack.
In the article, Tim says that the two biggest improvements in web search were the use of links (including anchor text) and term proximity. The article explores the benefits of term proximity and argues that works to the extent that it approximates linguistic relationships in the text.
He concludes that natural language processing of the documents should have the ability to more accurately capture linguistic relationships even if the query itself is in keywordese (as opposed to a natural language query with internal linguistic structure).
To recap: proximity is both a wonderfully powerful relevance feature, and a total hack. It helps enormously, but it’s not what you really want, it’s just sorta somewhat correlated with what you really want. What you need for what you really want is the underlying structure of all that web content: the real syntactic structure of the sentences, how the sentences connect to each other, how the facts relate, and (maybe) how the discourse flows and the topics connect. We’ve squeezed all the juice we can out of webpages considered as word-vectors; now it’s time to parse this stuff and get at the real structure.
Can that be done? A couple of years ago I would have said no, but I hadn’t seen the PARC natural language technology then, and didn’t know that an effort this concerted and well-funded was on the way. Now, do I think that Powerset will do it? I still don’t know, frankly – there’s so much more to do to make it real and debugged and scaled the way it needs to be. But it’s clear to me that the next big thing in web search is either this or something a whole lot like this, and I think we have the best shot of anyone. And that’s why I’m at Powerset.
The article is definitely good reading for people interested in search and the potential benefits of NLP.
Posted by barney on September 12, 2007 at 9:03 pm | No Comments