Traditionally Big Data has focused on hadoop. However, I am seeing an even bigger "WTH (that is H, not F!) should we do for database?" issue crop up in minds of people (and something that I have been thinking deeply about for Apigee Analytics also). I am not going to dive into the muckety-muck of NoSQL vs. SQL (I have debated that in the past, and will do so in the future).
I was in Bangalore last week, and I had the pleasure of addressing the MySQL user group there. Those of you who know me, know that I like to make people think in my talks, and I chose to (among other things) make them think how the world of databases is different in the new era.
Assume the IBM's version of the 3 V's -- volume, variety and velocity. I will only cover how volume impacts the database design. Variety and Velocity's impact on databases is a separate topic.
1. When we were building DB2 Parallel Edition (that became InfoSphere Warehouse), the sharding (we used to call it partitioning then :) was designed to improve query performance. In the new world, sharding is needed just to keep up with the data rate because one server can only do so many inserts/sec.
What that means is that query processing is likely to be more brute force, and needs to go to many more shards than in the old days. BUT, we have made a conscious choice of favoring insert speeds over queries in the new world.
Of course the two (optimize for both query and inserts) can be combined leading to
2. In the good old days, vacuuming old data was carefully done through time range partitionings, multi-dimensional clustering, chunking and others. In the new days, vacuuming old data requires the same things! However, in addition, it is more urgent and important since volume implies that not everything can be kept on-line forever. Therefore the database design is deeply influenced by vacuuming considerations. In fact, I would assert that after insert rates, vacuuming considerations are the second most important. However, I believe that combined with the most important consideration, namely inserts, built in vacuuming models in the database are not likely to be terribly useful.
I give a simple example here. Postgres has a wonderful table partitioning feature. Wonderful BEFORE triggers in the parent table will end up putting the records in the right "partition" (and in my grad days at Berkeley, I had something to do with it, a very small part :). And one can drop the partitions, redefine the triggers, and everything is hunky dory. EXCEPT, there will be a significant overhead doing it like this FOR inserts and remember, inserts are kings. So that kills it...
3. Finally, there is cost vs. value. In the olden days, data were stored in "expensive" SANs. Therefore for every data element, there was a measure of "what it costs to store and what value it gives". In the new days, data are stored on cheap disks (or EBS'es). Fundamentally, the equation has changed. Cost/byte/timeunit = ~0, and therefore store first and worry about value later (of course, we still have to worry about how many timeunits we keep the data for, because that can be infinitely long and therefore cost infinite, see vacuuming above). Therefore, the databases in the new world are likely to be much much larger requiring much better distributed system design to handle failures.
Net net, volume itself does not make SQL databases irrelevant. We just have to think about sharding, partitioning and size differently, and make the in-build functions somewhat less useful. BUT never let size make you forget RDBMs. And even if you do, you still have to think through all of the above issues.