« Mechanical Turk | Main | Role of a CTO »


Patrick Mueller

"I am not yet seeing an obvious killer app"

But there is no "killer app" for RDMS's.


Information organisation is the best promise of the freebase like model, where the data is structured almost automatically because it is pretty much self-identifying in a manner that is pretty close to how humans look at the real world. I've been looking to roll it out in content scenarios in media and have found it to be an exciting prospect, but the killer problem is legacy data. It works well on new data being added to a system, but existing data, depending on how well or badly it is structured, could create a whole lot of trouble for you, in the sense that it would need a fair degree of sanitzation/consistency checks.

Within IBM, I can imagine numerous possibilities, but the best idea would be to start with something that is a smaller in scale (not to be confused with incomplete) data set and make it consistent with the RDF model, generate multiple views, pass it on to something lovely like the Slor framework and get faceted results out of the box.

Business model, I am afraid is a bit of an issue right now. But the lovely thing about the freebase-like model is that the data quality improves and enriches itself over time. So, what is possible with it can only be found out for sure at that point in time.

Scaling would obviously involve some kind of middleware that will do pooling/caching etc. Left alone to themselves I don't think the app servers can handle all the attention.

disclaimer: I am an RDF infant and not even close to being a geek, so consume whatever I've written above at your own risk :-)


theres a free triplestore called Virtuoso that can handle 1gt no problem.. freebase only has 2.8m topics. so figure 10 triples each, is 0.028gt. nothing really. Garlik has 8gt or so of data and measures their query response in the low single ms. plus theres a few others, Allegro, and the like.

of course, in the semantic web, we might find that the most common predicates only number around 100 or so. so a 'sparse columns' approach like bigtable could easily work as well

in short, i dont see scalability as a major problem. the larger one is getting people to input semantic-annotated data as simply as they fill out a few form fields with plain text, so that theres actually interesting data to aggregate.

personally im fine with a FS-based triplestore that uses my own disk. no need for thinking about billions of records, since i'll never create that many, and no need for rewriting things in Erlang, for that matter..

Kavitha Srinivas

About the business model of Freebase, here are some inklings:

Freebase is just the first application to be built on the Metaweb infrastructure. Metaweb can hold data with any license, including copyrighted material. Companies that wish to use the Metaweb infrastructure to hold that data for their own purposes would pay a fee, particularly at higher volumes. Metaweb would act as a clearinghouse for licensable proprietary data in addition to the open data in Freebase. We also have in mind other services that can be offered to business users of the Metaweb infrastructure. (http://www.openbusiness.cc/2007/05/23/wikipedia-for-data-freebase/)

The comments to this entry are closed.