Archive Page 2


I decided to add my FriendFeed RSS to the right side of this blog.

I’ve always been a fan of the idea behind FriendFeed as an aggregator and distributor of information about what I’m doing without trying to be the owner of all the data. It’s different from Twitter simply because it doesn’t expect you to integrate with it, but the opposite. Yes, business-wise it’s certainly worse, because now you have to handle multiple format for sites generating events, but it’s still a better architecture in my opinion.

At this point in time people should have realized that you can’t do everything right. So maybe your real goal is to really not make anything right, just point to the direction where people are doing things right. The only thing that is left is for people to be able to create their own feeds and add them to FriendFeed. Something like using Yahoo Pipes to generate the specific feed you want. But you’ll need to figure out how to write the feedback piece too. Yahoo Pipes is good for filtering and summarizing, but not for acting on the filtered data.

But back to the reason why I think they should be able to create their own feeds: because they can’t afford to keep running after new formats and building new icons all the time. I would hate to see FriendFeed to be gone, while Twitter is still there running and making no money.



Just to add a final short link for today’s Semantic Web set of posts (I was preparing a long post on web frameworks, but I just can’t seem to be able to finish it. It’s probably somehow related to me having bad nightmares doing web development. But that’s certainly a subject for a completely different post).


An interesting concept of allowing people to create easily and richly referenced vocabularies. Like on my previous post I was talking about getting more information about entities, this project allows you to tag anything as an “entity” and not just people and places.

Right now there isn’t much to see there, unless you want to learn about:

blah or blah22 (also known as blah2) or blah3

there you will find the most authoritative resource about those concepts. They are very important!

I don’t think that this is probably going to get anywhere unless somebody gets really serious about it and starts dumping data and information there and then other people start working on cleaning it up, grouping synonyms and connecting things that were too far away from each other for the authors to make the connections. I’ll keep an eye to this project and see where it goes. In general, maybe the best is to just have entities referencing something like FreeBase and non-entities referencing something like WordNet or other dictionary/thesaurus-like resources. All this would make me a little happier than throwing time and energy on a service that will just absorb it all and disappear.

More interesting Semantic Web stuff

When I first heard about Headup it seemed like a nice idea. Now that I’ve seen more details, I think it really might be a good product. I haven’t really tried it yet, because it seems to only work in Windows right now and uses MS Silverlight. But if they get past this silly requirement, it could be big.

If you don’t know what it is, check out their website and look a their videos. In general, it identifies entities in any page you are reading (well, at least on the ones that they show on their demos – Facebook, FriendFeed, YouTube) and allows you to dig through those entities getting specific information that is relevant to the entity. For example, if you click on a band, you will be able to see upcoming concerts for the band. If you click on a song, you can play it. If you click on a person, it will show that person’s profile on multiple websites and activities around.

I’m not sure how good it is on actually tracking everybody, but it’s certainly a neat concept. Let’s wait and see where it will take us.

The state-of-the-art of Semantic Web thinking?

Sometimes you go around the blogosphere and you see posts like these:

RDF/Linked Data Standards Not Good Enough for Intelligent Agents? Or Is It the Opposite?

And I have to say that they make me worried. Why? Because it attracts people with experience with building semantic web-like applications and they all say (note, this is my interpretation of it, and not really what they said): well, I’m not sure we know if RDF is enough. Also, we don’t want to really talk why it’s not enough and what is needed for it to be enough, because we know that if we start saying something somebody will come around and show how this can be done in RDF and we will feel silly for claiming we couldn’t use it.

My experience? Well, I have some good experience modeling-wise, but not so good experience performance-wise. But, like them, I’m afraid that my not-so-good experience was more because I didn’t have enough time to fully understand what was going on than actually a problem with the framework. Let me get to some details of my not-so-good experience:

I was building summary data and then a reporting mechanism that provided cross-cut views on this summary data (effectively summarizing the summary). The tricky part is that some of it was hierarchically related, i.e., there were summaries that could be categorized in a hierarchical fashion (all things for consumer electronics products, or TVs, or HDTVs) and I wanted to make the report configuration to be able to point to be aware of it. Looking around, I’ve decided to use RDFS to represent the data and the hierarchy and SPARQL to represent the filter that would select the things that I wanted.

In general it worked great! Very few lines of code needed to get it working, pretty much no complicated business logic added anywhere. However, when the data came things weren’t as “pretty” as I hoped. Now for some numbers: the data had about 50K triples, the category hierarchy added another 10K triples (it’s a pretty big hierarchy – but quite small dataset in general). Everything being calculated on a stateless server with 512 MB in the JVM. Using Jena for the RDF serialization/deserialization/representation/querying the result was that it worked well for simple reports (without much hierarchic aggregation), generating reports in about 10s. But for more complicated reports, it started taking 30-60s. And not only this, if multiple reports were generated at the same time (it’s all a web interface, so it’s like opening multiple tabs, one per report), it would lock up the server while doing SPARQL querying and never return.

So what is the current state right now? Unfortunately, as I mentioned, I haven’t had time to dig much further on it, so I’m not sure what is going on. I’m sure there are some things I can do to improve it, but it’s really sad that such apparently simple technology generates such bad results out-of-the-box. It’s not that Jena is a new project. SPARQL (actually ARQ, which is the search engine supported by Jena) might not be as old as the full framework, but it’s a mature project.

Anyway, one day I’ll get back to that system and figure out what was going on and post about it. Until then, I have to handle making sure that all clients are not opening multiple tabs when looking at their reports.

Linked data – Linked Movie Data Base

So this morning I was playing around with what apparently is a brand new website with RDF data: LinkedMDB. It’s new so it mostly only has structure but no data, thus I can’t really review it very well. But on this lack of data I’m already concerned about its quality. I’ll give the only example I’ve checked out so far (terrible sample):

I went to “production_company” – “Twentieth Century-Fox Film Corporation (Production Company)

Two things struck me as red flags:

  1. The URL shows that it’s production_company number 1. This is very limiting if they can’t easily allow elements to cross the boundaries of types if their URLs are fixed to the type
  2. It has a link to FreeBase – to actually the wrong entity. They link to “Twentieth Century-Fox Film Corporation“, which is incidentally the same name of their entity, but the right element in FreeBase is “20th Century Fox“. I’ve marked the two for merge at FreeBase, as I can’t edit any of the data in LinkedMDB. Let’s see what happens.

In other words, people have to remember that open data is only good if the data inside is dependable. Especially when there is no built-in method for fixing the data.

More into “free” data sources – Swivel

I think I’ve posted about Swivel before on a past blog, but I found myself digging through it again. And I’ll have to say that I found myself once again disappointed by it. It might contain good data, but in general it’s a letdown mostly because on most searches for data the only thing I can find is noise. Either data with not enough information for you to understand like (which is the second hit when you search for “Seattle”):

Top 10 Increases in Total Crime

Or it’s just something that is probably better classified as “private”:

Elite Activity Membership Growth : Note, before you become trigger happy and open this link, let me explain what is it about. Elite Activity apparently is some sort of religion and this graph shows their membership growth for March and May 2008 (not even consecutive months) to be somehow flat. That’s all it has! And how did I find it? It was one of the 4 most viewed data sets today.

So, there are some limitations with the site. But we try to look beyond them into what is really missing with it. Here are some suggestions:

  1. Allow to add and filter by graph metadata: let’s say that I want to get recent data for Seattle. I should be able to specifically specify that I want city data, the city name is Seattle and that the data should contain the year 2007 or 2008.
  2. Provide the ability to cleanup duplicate information
  3. Somehow cache the source for the data. Many places I tried to click on the source link to understand the data, but I received a 404 or even a DNS error for the site. If they want to allow people to get data from different sites and use them to authenticate the data, they should make sure that the sites contain the data
  4. Provide the ability to easily merge and reconcile apparent redundant data from multiple graphs. Sometimes there are some spikes on “fun” data that appears out there and people probably flock to create their own visualization of this data. After the fun is gone, because the data is already a couple of years old, it would be good to be able to clean it up and combine those visualization of the same data.

This is just a short list of things that would improve the site. But one thing that really bugs me with it (mostly because of my current open data mindset) is that they are nice to allow you to import data from multiple places, but the only “API” they seem to have is a way to dump it to Excel. What if I want to cross-relate this data with other thing on my web service? I’m out of luck.

There is us in everything – can there be computers in us?

Today I spent some time watching a thriller, The Reaping and it made me think about why it’s sometimes so hard for some people to watch some of this movies. In my opinion, it all goes back to the way we can understand the world outside us: we put ourselves in the shoes of the people that we are observing. However, we don’t quite get that imagining us in a fantasy world where there are people out there planning on killing us might not be a good idea.

Minds never stop where they should, right? So after giving up in watching the movie (it’s not that it’s a terrible movie, but it was close to 1 AM and I probably should be going to bed to enjoy a long Sunday) I started thinking about intelligence and how one can simulate it. If most of our ability to understand people comes from this “self-projection” ability, how can we make an algorithm understand people? Are we algorithm too? Is this the best that it can ever hope to achieve?

Let’s consider that our computer is a dog. The dog looks at us and realizes that our face is pointing in the direction of a toy, but we don’t jump on it. So the dog “thinks”: well, if I was this person and was looking at the toy, the only reason I wouldn’t jump on it and bite it is if I’m sick. So maybe he is sick and I should be away from him. Quite restrictive conclusion.

That’s probably not what we are looking for when we think of computer understanding. So what can be done? Put all our learners in Second Life? 😉

RSS My FriendFeed RSS

  • An error has occurred; the feed is probably down. Try again later.