So before I get to velocity, our third V, let me spend some time on cost vs. value tradeoffs for Big Data analytics (which I had alluded to before). Quite simply, storing data, storing lots of data, and storing it for a long time (even if marginal costs/byte are very very small) is not without cost. And then querying/analyzing takes CPUs, and takes indexes, and reduces the rate of insertions, which you will recall from part I, is one of the primary drivers for sharding. All this is cost before an ounce of value is derived from analytics.
As I have thought this through, there are 5 parameters that primarily determine "cost". BTW, all of you could say, "we knew this, why is there a blog post on it?" To all of them, I say, "I knew it too, but the exercise of writing it down for Apigee taught me a lot."
C = fn (T, W, I, H, Q)
- T, the transaction rate
- W, the width of the records (assume compressed on missing dimensions, assume columnar, assume whatever pleases you!)
- I, the indexing inflation factor (sequential scans, or primary key lookups (in many NoSQL systems) are no fun for a large number of queries, so this will typically be say 1.5x or 2x indicating how much extra to we allocate for indexing)
- H, history kept, say, in months
- Q, the inflation factor due to queries -- queries slow inserts down, queries chew up CPUs, don't you wish queries were just not there? :) aha, but without queries, it is just cost, no value! Q might be 1.2x, or much higher...
I have built a spreadsheet for these numbers assuming AWS (with EC2, EBS, S3 and all the associated costs), but I encourage all of you to do these computations for yourselves.
For those of you in the analytics provider business, you want to make sure that cost and value go hand in hand, and that you (or your customers) are not burdened with costs before they see an ounce of value.
We will discuss the "variety" part in the next post then folks.