So we handled volume in the previous post, and how, volume requires thinking database layouts slightly differently. But let us be very clear. Volume by itself is not the reason to go for NoSQL databases. While it is true that
- NoSQL databases will often relax serializability and therefore achieve higher scale, relational databases of high 100's of TBs are very common and anyway, one can always build petabyte databases as 10's of {100's of TBs} databases.
- It is also true that some NoSQL systems have a better reliability model that enables better price availability metric than many relational databases, but availability/reliability have no inherent correlation with SQL/NoSQL debate.
What is absolutely true is that NoSQL is not inherently faster than relational, in fact quite the opposite might be true. How often have I heard this, "oh, you are having database performance problems, why don't you switch to hbase and your problems will go away?". That is funny the first time, but the 10th time it makes me want to cry... HBases and others are not solutions to performance problems in databases; in fact sometimes the opposite is true. When enough preprocessing gets done, and schema gets fixed -- when you have gone from determining what questions to ask to asking the known querstions repeatedly, there is nothing to beat databases my friends. So play with HBases of the world when you are living in the wild wild west (chaotic schema, or still in the determining of the key dimensions), but do that in addition to, or before, putting known stuff in databases. See this post on more discussions...
Before you guys jump up and down and say, "Oh, for Anant, databases are the answer, what is the question?", let me tell you what I told the the MySQL group.
That when "variety" is the problem, relational systems require much more hand-holding than the NoSQL systems do. I am not talking of NoSQL transactional systems (where the database guys would argue that salesforce.com has shown how variety can be handled, and DB2/Oracle with their XML, RDF and other features will show that variety can also be handled).
I am talking about NoSQL analytical stores. As I have discussed in the past, analytics deals with "knowns" (schema and dimensions) and "unknows" (schema chaos and discovery). Relational systems are not great at the latter. Let us say you run an analysis and discover some other fact about a set of records/documents? And let us say that this is the first time you are discovering this hidden dimension... Have you tried ALTER TABLE ADD COLUMN on a 10 TB table? Nobody's idea of fun? Building rental columns a la salesforce.com is also nobody's idea of fun. So relational systems cannot really be used for analytical needs where the schema is varying or evolving rapidly.
I see the following: schematized data flows into rdbms, unschematized into some NoSQL system (take your pick -- HBase, Mongo, Couchbase, whatever is your favorite poison or manna!). Hadoop jobs run on unschematized data and add dimensional structures that get stored back in the flexible NoSQL. And when the dimensions get fixed (and the questions to ask them get fixed), if needed, the fixed data now flows into RDBMS. (Below assumes NoSQL = HBase)
In Part III, I will add "velocity" to this story.
Comments