Archive Page 2

The state-of-the-art of Semantic Web thinking?

Sometimes you go around the blogosphere and you see posts like these:

RDF/Linked Data Standards Not Good Enough for Intelligent Agents? Or Is It the Opposite?

And I have to say that they make me worried. Why? Because it attracts people with experience with building semantic web-like applications and they all say (note, this is my interpretation of it, and not really what they said): well, I’m not sure we know if RDF is enough. Also, we don’t want to really talk why it’s not enough and what is needed for it to be enough, because we know that if we start saying something somebody will come around and show how this can be done in RDF and we will feel silly for claiming we couldn’t use it.

My experience? Well, I have some good experience modeling-wise, but not so good experience performance-wise. But, like them, I’m afraid that my not-so-good experience was more because I didn’t have enough time to fully understand what was going on than actually a problem with the framework. Let me get to some details of my not-so-good experience:

I was building summary data and then a reporting mechanism that provided cross-cut views on this summary data (effectively summarizing the summary). The tricky part is that some of it was hierarchically related, i.e., there were summaries that could be categorized in a hierarchical fashion (all things for consumer electronics products, or TVs, or HDTVs) and I wanted to make the report configuration to be able to point to be aware of it. Looking around, I’ve decided to use RDFS to represent the data and the hierarchy and SPARQL to represent the filter that would select the things that I wanted.

In general it worked great! Very few lines of code needed to get it working, pretty much no complicated business logic added anywhere. However, when the data came things weren’t as “pretty” as I hoped. Now for some numbers: the data had about 50K triples, the category hierarchy added another 10K triples (it’s a pretty big hierarchy – but quite small dataset in general). Everything being calculated on a stateless server with 512 MB in the JVM. Using Jena for the RDF serialization/deserialization/representation/querying the result was that it worked well for simple reports (without much hierarchic aggregation), generating reports in about 10s. But for more complicated reports, it started taking 30-60s. And not only this, if multiple reports were generated at the same time (it’s all a web interface, so it’s like opening multiple tabs, one per report), it would lock up the server while doing SPARQL querying and never return.

So what is the current state right now? Unfortunately, as I mentioned, I haven’t had time to dig much further on it, so I’m not sure what is going on. I’m sure there are some things I can do to improve it, but it’s really sad that such apparently simple technology generates such bad results out-of-the-box. It’s not that Jena is a new project. SPARQL (actually ARQ, which is the search engine supported by Jena) might not be as old as the full framework, but it’s a mature project.

Anyway, one day I’ll get back to that system and figure out what was going on and post about it. Until then, I have to handle making sure that all clients are not opening multiple tabs when looking at their reports.

Linked data – Linked Movie Data Base

So this morning I was playing around with what apparently is a brand new website with RDF data: LinkedMDB. It’s new so it mostly only has structure but no data, thus I can’t really review it very well. But on this lack of data I’m already concerned about its quality. I’ll give the only example I’ve checked out so far (terrible sample):

I went to “production_company” – “Twentieth Century-Fox Film Corporation (Production Company)

Two things struck me as red flags:

  1. The URL shows that it’s production_company number 1. This is very limiting if they can’t easily allow elements to cross the boundaries of types if their URLs are fixed to the type
  2. It has a link to FreeBase – to actually the wrong entity. They link to “Twentieth Century-Fox Film Corporation“, which is incidentally the same name of their entity, but the right element in FreeBase is “20th Century Fox“. I’ve marked the two for merge at FreeBase, as I can’t edit any of the data in LinkedMDB. Let’s see what happens.

In other words, people have to remember that open data is only good if the data inside is dependable. Especially when there is no built-in method for fixing the data.

More into “free” data sources – Swivel

I think I’ve posted about Swivel before on a past blog, but I found myself digging through it again. And I’ll have to say that I found myself once again disappointed by it. It might contain good data, but in general it’s a letdown mostly because on most searches for data the only thing I can find is noise. Either data with not enough information for you to understand like (which is the second hit when you search for “Seattle”):

Top 10 Increases in Total Crime

Or it’s just something that is probably better classified as “private”:

Elite Activity Membership Growth : Note, before you become trigger happy and open this link, let me explain what is it about. Elite Activity apparently is some sort of religion and this graph shows their membership growth for March and May 2008 (not even consecutive months) to be somehow flat. That’s all it has! And how did I find it? It was one of the 4 most viewed data sets today.

So, there are some limitations with the site. But we try to look beyond them into what is really missing with it. Here are some suggestions:

  1. Allow to add and filter by graph metadata: let’s say that I want to get recent data for Seattle. I should be able to specifically specify that I want city data, the city name is Seattle and that the data should contain the year 2007 or 2008.
  2. Provide the ability to cleanup duplicate information
  3. Somehow cache the source for the data. Many places I tried to click on the source link to understand the data, but I received a 404 or even a DNS error for the site. If they want to allow people to get data from different sites and use them to authenticate the data, they should make sure that the sites contain the data
  4. Provide the ability to easily merge and reconcile apparent redundant data from multiple graphs. Sometimes there are some spikes on “fun” data that appears out there and people probably flock to create their own visualization of this data. After the fun is gone, because the data is already a couple of years old, it would be good to be able to clean it up and combine those visualization of the same data.

This is just a short list of things that would improve the site. But one thing that really bugs me with it (mostly because of my current open data mindset) is that they are nice to allow you to import data from multiple places, but the only “API” they seem to have is a way to dump it to Excel. What if I want to cross-relate this data with other thing on my web service? I’m out of luck.

There is us in everything – can there be computers in us?

Today I spent some time watching a thriller, The Reaping and it made me think about why it’s sometimes so hard for some people to watch some of this movies. In my opinion, it all goes back to the way we can understand the world outside us: we put ourselves in the shoes of the people that we are observing. However, we don’t quite get that imagining us in a fantasy world where there are people out there planning on killing us might not be a good idea.

Minds never stop where they should, right? So after giving up in watching the movie (it’s not that it’s a terrible movie, but it was close to 1 AM and I probably should be going to bed to enjoy a long Sunday) I started thinking about intelligence and how one can simulate it. If most of our ability to understand people comes from this “self-projection” ability, how can we make an algorithm understand people? Are we algorithm too? Is this the best that it can ever hope to achieve?

Let’s consider that our computer is a dog. The dog looks at us and realizes that our face is pointing in the direction of a toy, but we don’t jump on it. So the dog “thinks”: well, if I was this person and was looking at the toy, the only reason I wouldn’t jump on it and bite it is if I’m sick. So maybe he is sick and I should be away from him. Quite restrictive conclusion.

That’s probably not what we are looking for when we think of computer understanding. So what can be done? Put all our learners in Second Life? ;-)

Current projects in mind

So I think I’ve settled on project plans for now. I’m going back to my early data aggregation project combining information about things from multiple open database sources around, with information about what people think is important and how things relate from discussion sources like Twine. The part that I’m not yet settled on is whether

  1. I’ll dig my early research project and deal with stock market movement annotation (not really prediction, just looking at the past and being able to link specific behavior with something in the news), or
  2. I’ll do what I’ve done the most in the last couple of years and deal with product information gathering and structuring.

If I know myself, I’ll probably go to #2, because I can then apply what I find back to my current work. But #1 might be “easier” (as for the amount of required data sources and the availability of those data sources). I’ll see and post here a more detailed plan of what I plan on doing one I have time for it.

Scrum-yourself

So my current team is doing Scrum. We are on our 2nd iteration, so it’s something quite new, but I’m already seeing some signs that worry me. Before I get to them, though, I’ll document here some interesting reactions I’ve received from people when I mentioned “scrum” (in a completely different context – I had tendinitis not too long ago and I told my techie friends that it was the result of my fight with the scrum master):

“Oh, you do Scrum? I’m sorry… We do it too and the only interesting things we’ve delivered happened when somebody just got outside the iteration and worked on things in a scrum-less fashion”

“Are you doing Scrum? No? Great! Run away!”

“Scrum? Argh…”

Yes, these were real reactions of people that will be kept anonymous. So all these people had the experience and had their reservations. Why? Well, now that I have some experience with it I’m starting to get an idea why. But before I get to it, I’ll try to smooth it out by talking a little bit about what I think is great about it.

  1. Visibility: You know what is going on most of the time. If somebody is struggling to get something to work, you know at the time the struggling starts and not 2 weeks later when this person can’t deliver what they said they were going to.
  2. Nimbleness: Because things are sort of decided day-to-day, it’s easier to readjust plans midway. Things change all the time in most industries.
  3. Communication: Sometimes considered the most important aspect of Scrum (or any Agile methods). It forces you to discuss things with everybody all the time. It encourages people to argue and move as a group. This facilitates integration and sharing of information and experience throughout the team.
  4. Well-defined end: by forcing deliverables at the end of the iteration, it forces people to just get it done with. Considering a project finished, especially when it involves “fuzzy” things, like UI and operations support, is hard.

There is more I could add to this list, but I think these are the most important positive aspects. Now onto my concerns:

  1. Distancing of the specific person on the team to the project: because all projects are “shared”, each team member ends up losing that maternal connection to the code that is generated. It’s not something that is yours anymore. You lose the will to just stay late to get something extra done, as nobody else is doing it anyway. Your personal life might appreciate it, but I’m not sure this fulfills the coder inside you.
  2. Focus on deliverables generate internal operational mess: it’s hard to explain to a client why you chose to take 5 days to do something that you could have done in 2 just because you claim that this will make your life easier to deliver something that is only scheduled 2 months from now. Or because you think that if you choose the slower way you will get something that will be easier to deal with in cases of operational emergency, even though this might only happen 2 years from now. I think I’ve been through operational headaches enough (I was just on call) to know that the more you can build already thinking of operations on failure modes the better. You can never predict all failure modes, but covering as many as you can is the least you should do.
  3. Very long iteration planning meetings for something that is very likely to change in the future. This is aggravated if the team is quite efficient and has a lot of things that they can do in the iteration. You plan them all (for example, we have something like 9 projects planned this iteration) with 1-2-day long tasks, taking many hours of your month to do that and then, in about 2 weeks you find out that half of the things you’ve spent all this time to plan will be pushed out because something else showed up. Maybe this is more of a failure of my own team, but I just think sometimes there is too much of a granular task focus and building 50-100 granular tasks in a meeting is tiring and very error-prone. So if it’s going to be wrong, why do it this granular this early on?

Anyway, we’ll be moving on with it. We’ll learn with it, we’ll deliver a lot of things, and then we’ll adapt the process to whatever really makes sense to the problem space we are dealing with.

Playing with Processing

One of the things I’ve been trying to do is to get back to playing around with technology. Getting to know what is out there so that I don’t get stuck on always using the same solutions to different problems. Today’s technology was Processing. Processing is a programmable environment for interactive graphical interfaces. It’s used mostly for data visualization and a little bit of art.

Why did I end up looking at it? Well, I’m always interested in data visualization. I’m interested because it’s an incredibly hard problem to solve. I’ve worked in the past with some really good people in large dataset visualization and they made me observe how challenging it is. But, at the same time, how easy it is to see patterns when you just have a way to show the data to people. The smarter your visualization technique is, the easier it is to see the meaningful patterns. But even with a silly visualization, data has this natural ability of always exhibit some sort of organization.

So, what do I want to visualize? I’m not completely sure yet. I was thinking of doing something like Buzztracker or what OnlineJournalismBlog but probably focused on trying to see more connections of things and dynamics of the news. It’s going to be hard to get reliable timestamped information, but it’s something that I’ve always been interested.

Another possibility for visualization goes to applications at work. Currently we are doing multiple projects to cleanup the Amazon catalog, but once we deploy new cleanup rules, we have very poor visibility in what it’s doing and where. Having a good interactive visualization mechanism would potentially provide us with data that can be used to improve our algorithms and datasets.

What are my first impressions of the project? I actually liked it. I thought it was straight-forward to do some things and that many of their abstractions for user interaction (the event handling methods and the draw() method) are quite simple to use and can be quite powerful if used correctly. It’s fun how easy it is to write a program that allows you to just click on a screen and get it to draw different shapes depending on the button pressed.

I’m excited so far! More on this when I find more time to play around with it.

So another blog

If I can’t keep one blog, why did I create a second one? Well the goal is to have a better separation of interests and, hopefully, more freedom for me to write whatever I want in the right places. So, what is going to be written here? That’s a good question that I hope to answer soon. My current idea is to leave this blog for more non-personal stuff. Things like projects, tech article reviews and research. My other blog can be focused on how my life is going and what I want to complain about.

Enjoy!

« Previous Page