« Travel Search at VerticalLeap | Main | Badly BackBlogged, But Barney's Back Blogging »

July 1, 2005

News / Blog search at Vertical Leap

News / Blog search Moderator: - Om Malik, Business 2.0 - Steve Gillmor, Gillmor Gang

Panelists:

Tantek: We index > 12M weblogs in real-time, and use them as a collaborative filter to show you top tags, movies, books, searches, etc. Also form simple standards like microformats so community can grow as a whole.

Scott: Feedster blog/RSS search, used side by side with technorati. We're a search engine and advertising network. As world moves from pure search there's a new way for things to be done.

Jim: CEO of Moreover. Current awareness content: News, blogs, etc. We have a strong footprint, and do distribution on the internet as well. MSN, Ask Jeeves, etc. Not a direct consumer play but behind the scenes, and value added services for publishers.

Chris: Topix.net is a news aggregator, with consumer footprint and also syndicate our feeds. We categorize news, with differentiated technology. We categorize down to each zip code in US and Canada.

Steve: Scott, connect some thread between previous sessions and this one?

Scott: The question that keeps occurring to me that affects the group of us up here is how people are thinking of vertical search. One ways is "vertical market", like jobs. Things like shopping and local search and functional in a certain way. A new standard is arising quickly. When feedster started by two guys, they just noticed a standard called RSS going through the roof, and that it really screwed up pagerank. They saw inefficiency between gold standard of Google and what some users want. So take this piece of the web, this kind of data google does poorly on, and get enough of it together with a great UI to push that to the stratosphere. There is any number of new formats coming out in web2.0 world, where as they hit critical mass there will be new problems.

Jim: News is the most used application besides email on a daily basis. Thousands of courses, millions of blogs, all updated throughout the day. The end user requirement is to find out about it as soon as it happens. This is a very different cycle from search. And it's been around a long time, a successful profitable vertical search industry. With emerging standards to take it in ew places.

Chris: A big factor for relevancy is "how new is it". The general search guys don't have the freshest content. Vertical search is a set of things that don't work as well for general reference search. In our case it's the things that you want to be fresh.

OM: How to make your search more relevant? I see we have the same news story and info appear countless time in your search results. How to add context to that?

Chris: At topix we take every news story. Cluster similar stories together. We turn each story into a mathematical vector, can say: "Is this the same AP story, a little different editing, or a completely different account of the same event". The underlying structure is an event. The journalistic output is a presentation and interpretation of the underlying event. We then have newsrank, what we believe the user wants to see. We have a completely automated story picker on the front page. More interesting, we can categorize stories to postal or zip code, so you can get the top stories of the day, categorized and ranked by relevancy. What the editor does for a magazine.

Jim: Relevant to whom, when, for what app? Moreover views it more as an issue of metadata, to support drill down for end users. We add 30 metadata fields to each article. Which San Jose? We have location in hierarchical system. But also have human editorial like "sourcerank", this is one of the top 100 publications in the world. Then user can get a stream of UK pubs with journalistic integrity, that are within this date range, unique, and mention these keywords. You find this more in product search than in general web search. We add this kind of layer for our blog product, etc, so user can decide what's relevant. We have algorithmic, human editors, and ultimately expose the control to the user to provide the most relevant experience.

Scott: We're in the funny position where deduplication is important, but many of these blog entries are 2 or 3 lines different, but those lines are the product of someone who cares so much about this story it drives them mad. And they are our searchers as well. Read-write web. We have to respect not pulling out the info provided, even if it's just 5% of the article, not disrespecting what's written by passionate users of ours. When the new pope was chosen, the amount of argument characterized in both our engines was immense on both sides. Many bloggers wrote just a few words, but it was heartfelt. You can't say it was mostly like another article so I'll get rid of it. In most cases, you need to decide which feeds are the most important in terms of their consistency of great content. We think that hyperlink meshes are broken in the read/write web. Page rank takes time to build. So we have to find some way to say that person has been blogging twice weekly about this topic for 1.5 years, and audit them as a micropublication and make the decision that way. The trick is to make the fewest people unhappy as possible. Deduping for us can lose publishers and searchers.

Tantek: Agree that relevance is difficult to measure for different people. We look at many different factors. At technorati we see people looking at the news 24 hours a day. You can look at the most emailed news articles on Yahoo. But someone who links to an article, and provides context, will be much more relevant as far as attention. So we look at this entire hyperlink mesh of bloggers linking on one axis to newspapers on another axis, and looking at the number of links in the past 24 hours to a news article. In addition, we've seen some news events that pop first in the blogosphere. Eg. the tsunami was heavily covered by bloggers long before mainstream media hit it.

Om: How to keep everything in context? The person who writes the most relevant story or post ends up being the last in the search results because the other people have reblogged the same thing and it keeps dropping. So even the timeliness and value of content are in a slight disconnect on all parmeters.

Chris: People want to read the news, and maybe some blogs. I disagree most blogs are interesting for some people. To find every instance of your name, ok. But to keep track of some subject, you need the journalistic perspective. There are people who provide editorial touch. That's a measure of relevance. If lots of people are talking about a tsunami in journalistic space, that's pretty relevant. We can choose to keep the first one up there, the guy who got the scoop. But I don't think the user cares about the scoop anymore. CNET complains that they put the story first, other people cover it. There's also a brand preference: You might want to see CNN's coverage on the tsunami. This brand preference might weigh more than who got it first, or even best.

Scott: Except Jason, who complains his scoop is stolen by CNET...

Tantek: Do we want opinion silos, like red.topix.com and blue.topix.com?

Chris: I think it's the underlying event. Expose people to the most content possible. I don't think if the NYT and a bloger cover something, that it's important for people to read one or the other.

JIM: Some people do care what the source is, and want the color commentary. Some technologies that would help: Provide fleixble ranking algs that optimize journalistic integrity, buzz factor, and put into an interface that is useable. Then you need sentiment, whether people mention the content in a positive or negative way. There's no good technology to do this automatically today for all articles. We don't presume to know what will be relevant to whom, but those are technologies that could certainly help us. No one algorithm for all use cases.

Scott: Whowbrokehestory.com as a vertical search engine has a very small audience (journalists). Figuring out who broke the story can be very hard. As for sentiment, which we call editorial bias, that's important for CPC advertisers as well. Nokia wants to advertise on 3000 mobile phone blogs. They want to make sure that not only are the blogs they advertise on relatively on topic, grammar and spelling, but also that it's not someone who is incredibly passionate about Erickson phones. They want to exlude this particular bias from their advertising. In the presidential campaign, there was a problem with democratic ads inadvertently running on republican sites. It's important to figure this out on both sides of the editorial / advertiser chinese wall.

Chris: Rottentomatoes.com has list of all movie reviews with their positive or negative sentiments.

Tantek: We've been working with other companies on open standard for publishing reviews on the web. If you search for "hreview" you can look at it. This problem has so many dimensions that finding something really simple is quite a challenge. But back to Om's question: Who broke the story, and who is the person most relevant about this? That comes to two issues: Time and authority. You can look at the last result on a topic as someone who broke the story. FOr authority, we determine it by our political biases, whatever network news channel you prefer. At technorati, we look at the link relatinoships between 12M publishers. Each link is a vote of attention, I trust this person somewhat or at least want to reference it. We can tell you by authority which bloggers are talking about any particular news event.

o Om: You folks are saying you are a search company, but you have to have some sort of editorial bent on everything you do. What are the ideal user interfaces you folks have figured out, and how to get the readers involved in this, stuff like tags working in your favor to fine-tune the user experience.

Scott: This gets us into religious issues. We're of the opinion that RSS and other lightweight flavors of XML become all the valuable content on the internet. But there is no ultimate interface. We are a web service first and foremost. 90% of our traffic on the outbound site leaves us as XML. I don't know all the user interfaces in which we're being used. It's a backend web service that has many applications, combined with many other sources. We have to keep reworking the terms of service as it gets tricky. But no ultimate UI, rather a long tail of user interfaces. Most of the value is in the ones that are too small for me to ever see.

Tantek: What's the most valuable resource we are so short of every day: Time. So when I go to my aggregator, and I have 5 minutes to read something that's relevant to me, whether topically relevant or something else, the UIs most effective will deliver to me within 5 minutes prioritized by my reading habits, social network, and geography what is relevant to me. We haven't seen that, but people are moving in that direction. Steve Gilmoor's attention.xml.

Chris: If you're a website, optimize for people to return to your side, and also for them to do what you want them to do. There isn't a single wonder UI that does everything, but a packaging problem. Just as with syndication today, you'll see the article in multiple newspapers. thatsracing.com has the same content put into a nascar site interface.

Jim: For us, the interface is the API, and that's the beauty of the service layer. Enable dynamic computation of relevanc. We all see the same vision, news and current awareness on our sleeves so we can read it when we have that 5 minutes. How to get there, is that the interface? There will be multiple interfaces. Ultimately the end scenario is user expresses interest and the results come back from the cloud. You need the content, metadata, and multiple relevance algorithms. At the end of the day that will get embedded in multiple UIs. Attention.xml is relevant as you can say: "I've just made this available". No uber interface, but all incrementally making significant contributions to building the infrastructure out to make those UIs possible.

o Om: Not just the uber UI, but for your own company. You all just aren't thinking about how the user wants to consume this.

Chris: We all syndicate to people, you can go where you like the UI. But maybe you're not the consumer, Om... for Topix site, we bias to local news. We dialed up the mayhem, show people things that are interesting. Not people who made their numbers. We have point of view... If you're building for people whining about an RSS standard you'll go to one site. If you're looking for people who care about other issues, you'll go to another site. If there's a murder on your block, how many of you don't want to know. (Audience: I live in Oakland, I don't want to know).

Jim: We don't have many consumer facing interfaces. We do have CI news desk. Any info request to filter in any shape or form, you can. This is a corporate product that you haven't seen.

Scott: Redsocks.feedster.com is something we did for the boston globe. They were trying to figure out how to fold the community in, how to include these passionate people in what was going on. They build this web page through a combination of editorial and algoithmic. Most people think Feedster is a subbrand of the boston globe. For redsocks fans, thi s is the interface they want, and the feedster homepage is now what they want. While for most readers it's just fine.

o Om: You are focused entirely on building the technology, not engaged with how we consume news. There's a reason people consume the NYT. Not the best journalism on the planet, but fed to you in a certain manner. Take Topix as an example: I use your telecom pages, but when I find the first 5 stories are about local stores starting to sell sprint phones etc, that's not the news I want to know. I want to see MCI is sold, and everything else that's important to me. I've been consuming info in a certain matter, on web, print or tv.

Chris: It's possible our telecom pages suck, we're trying to improve. But we can't duplicate the NYT experience as we're an aggregator. What do you mean by wanting to consume news in a certain way.

Tantek: Describe your ideal news experience, Om...

om: When you open a newspaper, you have the most important story up top, then less relevance going down. Whether I use technorati or feedster (let moreover slide as I don't use it that often...) how do you come up a UI. I find stories I just want to throw my head against the wall and break it in two because I can find the stuff I am looking for.

Chris: There just isn't that much news out there for some things. Most of it is syndicated content. Stories about San Carlos, there are only a small number of them in the news space. If you're looking for a specific subject, it may be there are no telecom news.

Tantek: There are only so many MCI executives to convict...

Scott: Dave Sifry shows traffic tied to events, inbound links, like the cryptonite locks scandal. These things peak and trough on hourly if not daily basis. That's why people use RSS aggregators. I have many search feeds in my aggregator that go bright blue only every few weeks. Journalism is an important but tiny market, we can't guarantee there's something breaking every day.

Tantek: Om likes to see the important news today at the top of the page. That's an editorial filter. Then we have the massive collaborative filtering effect, that's another type of filter. Maybe Om is looking for a specific set of searches to be notified about for a specific topic.

Om: You're missing the point here. When you open a newspaper there is a story which is deemed important. There are 50 newspapers printing a story about a serial killer in kansas, that's the story of the moment as more people either write or publish about it. Not saying to hire editors, but if so many more newspapers are printing it, it must be important.

Chris: We do an automated top page, so does Google and technorati. But at a small subject level, there just are 4 stories, not much there. That's the challenge for all of us.

q: Technorati and feedster are experimenting with tagging. How's it going?

Tantek: With technorati, we allow you to tag your own blog posts. We aggregate this, and display the contextual feed from flickr, buzznet, furl, etc on that topic. We've seen some spam, but the nice thing about spamming here is that they stick out like a sore thumb. It's provided a lot of value to us. In our new redesign, if there's been a post for a praticular term you're searching for that is tagged with that within the last 3 days, we put that higher.

Scott: There's talk about how different tag systems deal with spam. I believe tagging your own stuff is a problem. I'm worried about it as things scale, but currently it's great. With standalone tagging, we're making it easier for less technically sophisticated people to tag. Right now you need your own blog, or an account with delicio.us, furl, etc. This is beyond most people even who have blogs. We're trying to put a tag form in every post, feed, and web page in a way that my mother could tag. We're still in this process.

Tantek: Everyone is learning from each other's experiments. Just here in the bay area, we've seen so much interest that we started with flickr folks "Tag tuesday" tagtuesday.com to talk about experiments and see what works best. User, publisher, distributor tagging. Many appraoches. Be a part of that discussion moving forward.

q: For monetization, how important will be micropayments in your industry, particularly given google's announcements?

Scott: We think google is arleady a micropayments system, adwords/adsense. You get paid, google takes a skim. But micropayments to read an invidual article is going to be very limited.

Chris: Everyone tried it, it's never worked for anybody at any scale. If google's doing it, maybe it will work.

Tantek: I agree with that. We've seen a tremendous explosion in independent free content. The question I hear isn't get me to higher quality content, but help me filter it.

Scott: I was wrong about ITunes on exactly this topic...

o om: On adsense, nobody is going to get rich and make a living off it. I can say that for a fact...

Chris: Specific sites making money on adsense, we're doing ok thank you.

Scott: He's talking about as an individual writer, like most of his questions. We're the "vertical search for writers" conference...

Posted by barney at July 1, 2005 2:30 PM

This entry was posted in the following categories: Search

Trackback Pings

TrackBack URL for this entry:
http://www.barneypell.com/blog/mt-tb.cgi/37

Comments

Post a comment

Thanks for signing in, . Now you can comment. (sign out)

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


Remember me?