Puzzling OpenCalais

I’ve always been trying to follow the developments of OpenCalais. It’s a very interesting project on making it available for people to annotate elements in free-text files that can be used to relate things together without much manual work. The interesting thing is that every time I try it out, I’m surprised by some positive and negative things. Here is an example:

I’ve posted a MarketWatch article: These 13 ‘tipping points’ have us on the edge of a Depression

In general it does a pretty good job, in general. Especially on identifying people and some places. It even tries to tie facts together and even do some anaphora resolutions (finding who “he” in the phrase is referring to). It’s not too smart about it, though. For example, on the paragraphs:

So can you trust them to have a magical formula for predicting the next “tipping point,” the next “Black Swan” in our future? No. The correct answer is (c), Warren Buffett’s answer. Here’s why.
Background: First, Wall Street’s narrow equations always leave out key macroeconomic data. Always. They cannot handle “big picture” issues. Their formulas are what mathematicians call “indeterminate equations,” with an infinite set of solutions. Guesses. So Wall Street invariably ignores big-picture issues that lead to meltdowns. Meanwhile, they get rich playing with your money.
Second: The Buddha would call Wall Street’s mathematical problem, a Zen koan, an impossible question. And he‘d warn you to: “Believe nothing, no matter where you read it or who has said it, not even if I have said it, unless it agrees with your own reason and your own common sense.”

It decided that the “he” (added in bold for people to find it more easily) referred to Warren Buffet, 3 paragraphs before, and not to the Buddha, which was mentioned in the same paragraph, but not identified as a person.

Other oddities:

  • It thought that Chavez was a city and not a person
  • It called the metaphor “Grand Obstructionist Party” as an organization. That is and advanced interpretation!
  • Depression as a medical condition
  • It missed the first name for author Nassim Nicholas Taleb, thus missing the connection of “author”, which it doesn’t miss for Malcolm Gladwell

It does a pretty good job at what it was initially built to do: identifying phrases like “Henry Kaufman, former vice chairman and chief economist at Salomon”. Also identifying the Fed as the Federal Reserve and Dow as Dow Jones.

I also did like the addition of the identified “Industry Terms”: bank bailout (shows that they are up-to-date on modern tendencies), printing money, and shadow banking system, but why “telecommunications”? Is this term that technical?

Anyway, it’s easy to see as a human that things are wrong, and as an NLP-enthusiast how things could be improved (especially from my metadata background knowing that it would be very easy to catalog all authors from popular books), but I can’t ignore the fact that they’ve done what nobody has really tried before: put entity and even relationship extraction in production for anybody to use and criticize. Right on the theme of my latest resolution: whatever you do is worthless unless you put it in production for everybody to see.

Advertisements

4 Responses to “Puzzling OpenCalais”


  1. 1 Thomas Tague February 25, 2009 at 8:53 am

    Tom Tague from Calais here.

    Of course my initial reaction is to say “wait, wait. If you’d just changed the sentence structure to …. we would have done better” – but in fact things like this are one of the points of the entire Calais initiative. NLP techniques have been living in a vacuum too long – let’s make sure they can perform in the dirty world of obscure writing styles, blog responses, whatever.

    There are some simple things we can fix (we’re not great with one word or three word names) – and some that will take years of learning (100% perfect anaphora resolution, person disambiguation, etc).

    Please drop by sws.clearforest.com/linkeddatatestertool to play with it yourself – and start experimenting with some of the new Linked Data aspects of our most recent release. You can, for example, move from detecting a company to a disambiguated company name to a Linked Data page about the company to Dbpedia to …. you get the idea.

    Thanks for the article. We appreciate it when people give things a try and point out what’s great – as well as what needs work.

    Regards,

  2. 2 Bostjan Spetic February 26, 2009 at 2:08 pm

    heya,

    nice post, it seems you are very tough critic šŸ˜‰ am wondering if you know about zemanta API – it is offering some similar features, with more USG content focus…

    best, bostjan

  3. 3 michelgoldstein February 26, 2009 at 2:58 pm

    I’ve heard about the Zemanta API, but haven’t played with it much. I’ll put that into my things to try soon. Starting by running a head-to-head comparison with the same document. I just tried the SemantalyzR with it and the results were interesting too. I’ll confirm it’s using it all correctly before posting more details.

    As for being a tough critic, well, I guess it comes with the territory. Working in fact extraction for Amazon I’m forced to working on very high precision (and low latency/high throughput) algorithms (we try to prevent customers from buying a camera thinking that it has 12x optical zoom when it actually has no optical zoom and 12x digital zoom). I’ll post one day about what it’s like.

  4. 4 Thomas Tague February 26, 2009 at 4:42 pm

    Tom Tague again.

    Yes, we’re very familiar with Zemanta. They’ve done some great things to enhance the blogging experience in a very short time. In the end – I think we’re heading in different directions but I encourage users to try both tools and choose what works for them.

    We tend to stay focused a bit more on the “plumbing” rather than the UI. We want smart people to take our tools and build amazing things. We do have a couple of front end tools though: Gnosis is a browser plugin that tags as you browse and Tagaroo is a WordPress extension that provides similar functionality to Zemanta.

    Thanks again,


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




RSS My FriendFeed RSS

  • An error has occurred; the feed is probably down. Try again later.

%d bloggers like this: