Recently, some of my colleagues (David Fallside, Kevin Beyer, Steven Parkes, Kaushik Bhaskar, Howard Ho et al) have been telling me, "Anant, you have to take a look at freebase." It turns out that it is quite interesting that it is taking metaweb's (almost triple store, RDF + ontological approach) and building a repository for structured information on the web. And using social tools, as opposed to academic approaches to taxonomy generation (in contrast with, say, Cyc.) Thus, not that anybody cares, "Anant Jhingran is a person", "born in xyz.." etc. Clearly there is a huge amount of such information that can be organized. And I like their approach. But I have a few questions that I have been pondering since I came across freebase.
- The freebase folks do not reveal much about their scaling. The scaleout models for google and wikipedia (where partitioning/replication strategies work quite well) do not quite work in such a networked graph (after all, a query on person="anant" with one or two pointer chases would end up pinging a few nodes under any partition model), so the question is, if we have billions of pieces of information in a dense graph, how does the query load on the system scale?
- What can be the business model? I would assume that a very small fraction of people who do search, do wikipedia searches. And a very small fraction of wikipedia searchers will search freebase... So advertising, while possible, seems not truly monetizable.
- Which brings me to enterprises. I think there is a promise of freebase like model to organize enterprise information (and for the freebase folks to make money on it). I would love to try this inside IBM and see what creative use people find for both creating and using information. Without use, the creation will suffer of course, so I am assuming some killer apps (extensions of enterprise search, maybe?) might emerge with freebase in the enterprise. I have to look into what options exist for trying freebase within IBM, anybody knows?
The broader context I have been having these discussions in, of course, is the Info 2.0 patterns. Will there be a large class of web 2.0 class of applications that will benefit from a freebase like organization? Perhaps, but in my mind, beyond, "this is fun,", I am not yet seeing an obvious killer app.
"I am not yet seeing an obvious killer app"
But there is no "killer app" for RDMS's.
Posted by: Patrick Mueller | September 23, 2007 at 06:01 PM
Information organisation is the best promise of the freebase like model, where the data is structured almost automatically because it is pretty much self-identifying in a manner that is pretty close to how humans look at the real world. I've been looking to roll it out in content scenarios in media and have found it to be an exciting prospect, but the killer problem is legacy data. It works well on new data being added to a system, but existing data, depending on how well or badly it is structured, could create a whole lot of trouble for you, in the sense that it would need a fair degree of sanitzation/consistency checks.
Within IBM, I can imagine numerous possibilities, but the best idea would be to start with something that is a smaller in scale (not to be confused with incomplete) data set and make it consistent with the RDF model, generate multiple views, pass it on to something lovely like the Slor framework and get faceted results out of the box.
Business model, I am afraid is a bit of an issue right now. But the lovely thing about the freebase-like model is that the data quality improves and enriches itself over time. So, what is possible with it can only be found out for sure at that point in time.
Scaling would obviously involve some kind of middleware that will do pooling/caching etc. Left alone to themselves I don't think the app servers can handle all the attention.
disclaimer: I am an RDF infant and not even close to being a geek, so consume whatever I've written above at your own risk :-)
Posted by: shyam | October 03, 2007 at 11:56 PM
theres a free triplestore called Virtuoso that can handle 1gt no problem.. freebase only has 2.8m topics. so figure 10 triples each, is 0.028gt. nothing really. Garlik has 8gt or so of data and measures their query response in the low single ms. plus theres a few others, Allegro, and the like.
of course, in the semantic web, we might find that the most common predicates only number around 100 or so. so a 'sparse columns' approach like bigtable could easily work as well
in short, i dont see scalability as a major problem. the larger one is getting people to input semantic-annotated data as simply as they fill out a few form fields with plain text, so that theres actually interesting data to aggregate.
personally im fine with a FS-based triplestore that uses my own disk. no need for thinking about billions of records, since i'll never create that many, and no need for rewriting things in Erlang, for that matter..
Posted by: carmen | October 06, 2007 at 01:27 PM
About the business model of Freebase, here are some inklings:
Freebase is just the first application to be built on the Metaweb infrastructure. Metaweb can hold data with any license, including copyrighted material. Companies that wish to use the Metaweb infrastructure to hold that data for their own purposes would pay a fee, particularly at higher volumes. Metaweb would act as a clearinghouse for licensable proprietary data in addition to the open data in Freebase. We also have in mind other services that can be offered to business users of the Metaweb infrastructure. (http://www.openbusiness.cc/2007/05/23/wikipedia-for-data-freebase/)
Posted by: Kavitha Srinivas | October 07, 2007 at 07:06 PM