First, as I have said, I want to thank Curt Monash for this term -- I had been using the term "schema variance" but I will move to poly-structure, or at least promote it the way to talk about it. Curt has an excellent blog on it here.
My view is relatively simple (to paraphrase Curt)
if (# entities >> # entity-structures) then structured, else poly-structured. Let me explain. A database with 10 tables but a billion records has 1 billion >> 10, hence structured. However, an XML database with a billion rows, each different, has 1 billion = 1 billion, hence poly-structured. So far ok?
When # entities >> # entity-structures, then we can not only do the single entity puts and gets, we can also "analyze" -- we can ask the "database" system to tell us which entities have what properties? We can ask what entities are related to other entities. And we can do that because we can reason and optimize at the entity structure level, and since # entity-structures are much smaller, we can afford these kinds of optimizations.
Now when # entities ~ # entity-structures, what can you do with the data? Clearly, you can store and retrieve, efficiently. That is the focus of many of the NoSQL databases -- couchDB to name one. Give a key to an entity, store (put) it. Give a key to an entity, retrieve (get) it. RDF and linked data systems also typically operate at this level -- every entity can have a different set of properties and values (or subjects, predicates etc. -- I can never get it right!)
But what about analytics? Can we ask these databases "how many entities satisfy a given property?" These kinds of analyses are very difficult to do when #entities ~ # entity-structures, since you end up pointer chasing, which makes the computational complexity of these kinds of analytical problems to be very high. One solution that people are beginning to look at is to use Lucene Indexing as side structures that index these poly-structured entities, and then using some advancements in Lucene to be able to answer these kinds of queries.
That brings me to hadoop. I am a firm believer that there are two roles for hadoop and Big Data. One, where the data itself is poly-structured. Second, when the analysis is poly-structured -- i.e. you are looking for some key dimensions in your structured and poly-structured data, but you do not want to be contstrained in presupposing what they are. Either way, you want a flexibility in looking at the data different ways. You could, always, have every analysis scan through all the data, and look out for some key dimensions. Or you could start to build some "structure" on poly-structure. Analysis 1 adds some key dimensions to the data. Analysis 2 adds other dimension, or adds hierarchy to the dimension that Analysis 1 had added.
And guess what, if you want to work in that way, you need things like HBase to enable this kind of polys-tructured analysis. Poly-structured stuff goes into hdfs. Hadoop analysis happens. Structure or dimensionality gets discovered. It gets added to HBase. More analysis happens. More structure gets discovered. More gets added to HBase. Soon, some folks can do their analysis off HBase, and forgo analysis on raw polys-structured data completely.
And if some dimensions become repeatable and important for daily, hourly analysis, they get promoted to a real database...
So in effect, hadoop and hdfs will over time get surrounded by other data structures that handle structure better -- lucene indexing, hdfs, or in some cases real databases also...
Makes sense?
Hi Anant,
I like your article. May I ask where do you see Hive and HiveQL in all this if data is poly-structured in hdfs? Is it the case that you see Hive as somewhat powerless in that case i.e. the capabilities of HiveQL could not be brought to bear on the data....after all poly-structured data is hard to analyze
Posted by: Mikeferguson1 | September 15, 2011 at 08:23 AM