So before I get to velocity, our third V, let me spend some time on cost vs. value tradeoffs for Big Data analytics (which I had alluded to before). Quite simply, storing data, storing lots of data, and storing it for a long time (even if marginal costs/byte are very very small) is not without cost. And then querying/analyzing takes CPUs, and takes indexes, and reduces the rate of insertions, which you will recall from part I, is one of the primary drivers for sharding. All this is cost before an ounce of value is derived from analytics.
As I have thought this through, there are 5 parameters that primarily determine "cost". BTW, all of you could say, "we knew this, why is there a blog post on it?" To all of them, I say, "I knew it too, but the exercise of writing it down for Apigee taught me a lot."
C = fn (T, W, I, H, Q)
- T, the transaction rate
- W, the width of the records (assume compressed on missing dimensions, assume columnar, assume whatever pleases you!)
- I, the indexing inflation factor (sequential scans, or primary key lookups (in many NoSQL systems) are no fun for a large number of queries, so this will typically be say 1.5x or 2x indicating how much extra to we allocate for indexing)
- H, history kept, say, in months
- Q, the inflation factor due to queries -- queries slow inserts down, queries chew up CPUs, don't you wish queries were just not there? :) aha, but without queries, it is just cost, no value! Q might be 1.2x, or much higher...
I have built a spreadsheet for these numbers assuming AWS (with EC2, EBS, S3 and all the associated costs), but I encourage all of you to do these computations for yourselves.
For those of you in the analytics provider business, you want to make sure that cost and value go hand in hand, and that you (or your customers) are not burdened with costs before they see an ounce of value.
We will discuss the "variety" part in the next post then folks.
Thanks Anant for putting it down here in lucid way, what you presented at Apigee campus. Looking forward for the next post.
Posted by: Account Deleted | December 17, 2011 at 07:20 AM
I'm really enjoying these Big Data blogs. Especially liked "Pig and Hive add data models" --more please. Hope you can review AsterData which merges MapReduce + parallel RDBMS (best of both?
Please add LinkedIn login -- Facebook for adults.
Posted by: Daniel Graham | December 19, 2011 at 03:35 PM
Pretty useful indeed. Should some of these indirect costs also need to be taken into consideration as some of these are proportional to the size of data
- Bandwidth costs in transporting the data to the final destination. Assumption is that the environment where the data is generated is not necessarily the same as the the environment in which it is getting processed.
- Intermediate storage (before the data is transported to final destination S3, EBS etc)
- CPU cost. All the intermediate compute (to run scribe, collectors etc ) required to receive the logs, store it temporarily and send it to the final destination
Posted by: Forcecarrier.wordpress.com | January 03, 2012 at 07:11 AM
Thanks for your information Anant, Sharing the information about the C = fn (T, W, I, H, Q).
Posted by: Vijayakumar Ramdoss | January 28, 2012 at 10:17 AM