I have talked about the NoSQL as CHARM databases before.
As we have been building Apigee Data and Analytics platform, we have been looking at the tradeoffs of RDBMSs and CHARMs. Here is what we are concluding
- When one gets a large volume of data, one needs a a very "low indexing" strategy because that way one can get sequential write speeds. Any indexing makes writes "random" and therefore instead of 100's of MB/sec/disk (assuming no striping), we end up in 10's. Therefore, the bias is towards low indexing and write optimization.
- When one gets a large volume of data, one needs a sharding strategy, since any one server may not be able to sustain a large rate of writes.
- With write optimization, queries naturally suffer. So now one needs some combination of materialized views, sampling and columnar structures so that one can answer queries by efficient "sequential" block reads.
So these are the facts. How does this then help us look at the tradeoffs of RDBMs and CHARMs?
- With low indexing, some of the advantages of RDBMSs go away, since indexing (and joins) are two features that really make RDBMs stand out.
- If sharding is needed, then one has to design sharding in user space for some of the "open source" databases (not for DB2 and Teradata, of course, and to a lesser degree not for Oracle).
- The automatic "sharding across all nodes" as a claimed advantage for CHARMs is not really a significant advantage since one wants to avoid data movements at all costs (which will happen whenever we grow a system, whether background or not), and one wants to avoid small data on all nodes and one wants to avoid secondary indexes not collocated with the data etc etc. So all of this stuff still has to be very carefully designed.
So for us the above becomes six of one and half a dozen of other. So at least till the part I talked about above, almost who cares what technology we use (I am reminded of this video, ROTFL). But we have chosen RDBMS (Postgres to be precise) because of the relative maturity of the technology, and we are dealing with 100's of TB, not 10's of PB.
However, there are two other aspects that will require us to extend our (current) postgres based implementation with CHARM constructs, and we will roll it out over the next few months.
- Different APIs have different shapes, and managing it in a very well schemaed system such as Postgres is not easy.
- Different analytics "add" discovered metatadata or dimensions, and "ALTER TABLE" in any RDBMS is not for the faint-hearted.
Now one might then ask, why still keep Postgres -- if it is six of one or half a dozen of other, and these two aspects are important, why not switch to a CHARM database?
- Maturity of technology, of course, means that we want to make sure that business users hit the more mature system -- i.e. Postgres.
- More indexing is good for faster query responses and there are many places where we need sub-minute responses, so our Postgres system is heavier indexed (and contains lower volume of data). Of course, all of it can be done in Cassandra, but, man, there are only 24 hours in a day, so why do something when it has been done for you :) So the Postgres system is slightly more balanced towards (semi-random) reads and (semi-random) writes, whereas the Cassandra system is optimized for sequential writes and batch reads.
- And do not forget joins -- we still need them, for lots of other data. And doing joins in application space is no one's idea of fun.
That's it. Summary. For the core, CHARM and RDBMs are a wash, and we went with Postgres for its maturity. But for schema variance, CHARM wins out, and for faster query responses, RDBMS wins out. Therefore we have to keep both.
I am not the only one coming to this observation. As far as I know, many many many other companies have reached similar conclusions, but I wanted to give *our* reasons.
Now why Cassandra, and not HBase or other CHARMs? That is a topic of another discussion.