Archive for the 'semantic web' Category

Puzzling OpenCalais

I’ve always been trying to follow the developments of OpenCalais. It’s a very interesting project on making it available for people to annotate elements in free-text files that can be used to relate things together without much manual work. The interesting thing is that every time I try it out, I’m surprised by some positive and negative things. Here is an example:

I’ve posted a MarketWatch article: These 13 ‘tipping points’ have us on the edge of a Depression

In general it does a pretty good job, in general. Especially on identifying people and some places. It even tries to tie facts together and even do some anaphora resolutions (finding who “he” in the phrase is referring to). It’s not too smart about it, though. For example, on the paragraphs:

So can you trust them to have a magical formula for predicting the next “tipping point,” the next “Black Swan” in our future? No. The correct answer is (c), Warren Buffett’s answer. Here’s why.
Background: First, Wall Street’s narrow equations always leave out key macroeconomic data. Always. They cannot handle “big picture” issues. Their formulas are what mathematicians call “indeterminate equations,” with an infinite set of solutions. Guesses. So Wall Street invariably ignores big-picture issues that lead to meltdowns. Meanwhile, they get rich playing with your money.
Second: The Buddha would call Wall Street’s mathematical problem, a Zen koan, an impossible question. And he‘d warn you to: “Believe nothing, no matter where you read it or who has said it, not even if I have said it, unless it agrees with your own reason and your own common sense.”

It decided that the “he” (added in bold for people to find it more easily) referred to Warren Buffet, 3 paragraphs before, and not to the Buddha, which was mentioned in the same paragraph, but not identified as a person.

Other oddities:

  • It thought that Chavez was a city and not a person
  • It called the metaphor “Grand Obstructionist Party” as an organization. That is and advanced interpretation!
  • Depression as a medical condition
  • It missed the first name for author Nassim Nicholas Taleb, thus missing the connection of “author”, which it doesn’t miss for Malcolm Gladwell

It does a pretty good job at what it was initially built to do: identifying phrases like “Henry Kaufman, former vice chairman and chief economist at Salomon”. Also identifying the Fed as the Federal Reserve and Dow as Dow Jones.

I also did like the addition of the identified “Industry Terms”: bank bailout (shows that they are up-to-date on modern tendencies), printing money, and shadow banking system, but why “telecommunications”? Is this term that technical?

Anyway, it’s easy to see as a human that things are wrong, and as an NLP-enthusiast how things could be improved (especially from my metadata background knowing that it would be very easy to catalog all authors from popular books), but I can’t ignore the fact that they’ve done what nobody has really tried before: put entity and even relationship extraction in production for anybody to use and criticize. Right on the theme of my latest resolution: whatever you do is worthless unless you put it in production for everybody to see.

Advertisements

OpenVocab

Just to add a final short link for today’s Semantic Web set of posts (I was preparing a long post on web frameworks, but I just can’t seem to be able to finish it. It’s probably somehow related to me having bad nightmares doing web development. But that’s certainly a subject for a completely different post).

OpenVocab

An interesting concept of allowing people to create easily and richly referenced vocabularies. Like on my previous post I was talking about getting more information about entities, this project allows you to tag anything as an “entity” and not just people and places.

Right now there isn’t much to see there, unless you want to learn about:

blah or blah22 (also known as blah2) or blah3

there you will find the most authoritative resource about those concepts. They are very important!

I don’t think that this is probably going to get anywhere unless somebody gets really serious about it and starts dumping data and information there and then other people start working on cleaning it up, grouping synonyms and connecting things that were too far away from each other for the authors to make the connections. I’ll keep an eye to this project and see where it goes. In general, maybe the best is to just have entities referencing something like FreeBase and non-entities referencing something like WordNet or other dictionary/thesaurus-like resources. All this would make me a little happier than throwing time and energy on a service that will just absorb it all and disappear.

The state-of-the-art of Semantic Web thinking?

Sometimes you go around the blogosphere and you see posts like these:

RDF/Linked Data Standards Not Good Enough for Intelligent Agents? Or Is It the Opposite?

And I have to say that they make me worried. Why? Because it attracts people with experience with building semantic web-like applications and they all say (note, this is my interpretation of it, and not really what they said): well, I’m not sure we know if RDF is enough. Also, we don’t want to really talk why it’s not enough and what is needed for it to be enough, because we know that if we start saying something somebody will come around and show how this can be done in RDF and we will feel silly for claiming we couldn’t use it.

My experience? Well, I have some good experience modeling-wise, but not so good experience performance-wise. But, like them, I’m afraid that my not-so-good experience was more because I didn’t have enough time to fully understand what was going on than actually a problem with the framework. Let me get to some details of my not-so-good experience:

I was building summary data and then a reporting mechanism that provided cross-cut views on this summary data (effectively summarizing the summary). The tricky part is that some of it was hierarchically related, i.e., there were summaries that could be categorized in a hierarchical fashion (all things for consumer electronics products, or TVs, or HDTVs) and I wanted to make the report configuration to be able to point to be aware of it. Looking around, I’ve decided to use RDFS to represent the data and the hierarchy and SPARQL to represent the filter that would select the things that I wanted.

In general it worked great! Very few lines of code needed to get it working, pretty much no complicated business logic added anywhere. However, when the data came things weren’t as “pretty” as I hoped. Now for some numbers: the data had about 50K triples, the category hierarchy added another 10K triples (it’s a pretty big hierarchy – but quite small dataset in general). Everything being calculated on a stateless server with 512 MB in the JVM. Using Jena for the RDF serialization/deserialization/representation/querying the result was that it worked well for simple reports (without much hierarchic aggregation), generating reports in about 10s. But for more complicated reports, it started taking 30-60s. And not only this, if multiple reports were generated at the same time (it’s all a web interface, so it’s like opening multiple tabs, one per report), it would lock up the server while doing SPARQL querying and never return.

So what is the current state right now? Unfortunately, as I mentioned, I haven’t had time to dig much further on it, so I’m not sure what is going on. I’m sure there are some things I can do to improve it, but it’s really sad that such apparently simple technology generates such bad results out-of-the-box. It’s not that Jena is a new project. SPARQL (actually ARQ, which is the search engine supported by Jena) might not be as old as the full framework, but it’s a mature project.

Anyway, one day I’ll get back to that system and figure out what was going on and post about it. Until then, I have to handle making sure that all clients are not opening multiple tabs when looking at their reports.

Linked data – Linked Movie Data Base

So this morning I was playing around with what apparently is a brand new website with RDF data: LinkedMDB. It’s new so it mostly only has structure but no data, thus I can’t really review it very well. But on this lack of data I’m already concerned about its quality. I’ll give the only example I’ve checked out so far (terrible sample):

I went to “production_company” – “Twentieth Century-Fox Film Corporation (Production Company)

Two things struck me as red flags:

  1. The URL shows that it’s production_company number 1. This is very limiting if they can’t easily allow elements to cross the boundaries of types if their URLs are fixed to the type
  2. It has a link to FreeBase – to actually the wrong entity. They link to “Twentieth Century-Fox Film Corporation“, which is incidentally the same name of their entity, but the right element in FreeBase is “20th Century Fox“. I’ve marked the two for merge at FreeBase, as I can’t edit any of the data in LinkedMDB. Let’s see what happens.

In other words, people have to remember that open data is only good if the data inside is dependable. Especially when there is no built-in method for fixing the data.


RSS My FriendFeed RSS

  • An error has occurred; the feed is probably down. Try again later.