Anant Jhingran's Musings

My Photo

About

Recent Posts

  • My Defrag talk
  • My two blogs on why the world of data is changing fast in the new app economy
  • If NoData is not an option, is OData the answer?
  • More on CHARM vs. RDBMS debate
  • Five changes in my thinking in five months since leaving IBM
  • APIs and Analytics
  • In the world of Big Data, what is the role of databases? Part III
  • (In) The new world of Big Data, how different is the world of databases? Part II
  • The Yin and Yang of Cloud and APIs
  • The new world of Big Data, how different is the world of Databases?
Subscribe to this blog's feed
Blog powered by Typepad

My Defrag talk

I got a chance to elaborate on "At the Edge, it is Broad Data, not Big Data!"  today at Defrag.  I will post more details later, but as I was preparing my slides, it occurred to me that one of the best examples of amplifying signals by connecting enough (a broad set) of relatively poor signal to noise ratios is Nate Silver, and what he did at FiveThirtyEight!  He took a relatively large number of noisy signals (individual polls) and combined them to correctly predict everything (almost, he got a Senate seat wrong!) in the Nov 6th election.

That's in one meme, the essence of "Broad Data" put to practice. And just before I spoke, Jeff Ma (of 21 fame) opined on "It is data not Big Data" giving his own and Nate Silver's example, and then after I spoke,@TheUpstartCO tweeted that it was a good day for Nate Silver :)

 

November 15, 2012 | Permalink | Comments (0)

My two blogs on why the world of data is changing fast in the new app economy

First, a preview of some of the discussions we'll have at GigaOm's structure conference, Amsterdam, on Oct 17th.  

  • Few enterprises are ready for the app economy's explosition of data. In this I argue that there are three kinds of new data sources that enterprises have to now deal with -- data around the use of an enterpise's APIs, data around the apps that are built to those APIs, and data accessed by someone else's APIs.  And that, such new data requires new capabilities, and new business processes within the enterprises.

Second, I argue that in this new world, ETL is dead on arrival.  

  • From ETL to API -- A Changing Landscape for Enterprise Data Integration.  In this, I argue that the shape and form of the data changes so rapidly that enterprises are much better off thinking new system designs.  And no surprises here :), I argue that the future of ETL is API based data exposure, where different apps that need different data, can mix and match through data that is exposed by the APIs.

I have never been so excited about the role of data in this new app economy, and never been so excited about the fact that new capabilities will lead the way.

October 13, 2012 | Permalink | Comments (2)

If NoData is not an option, is OData the answer?

A few weeks back, I had the chance to speak at the OData meetup at the Microsoft campus.  There were several interesting discussions there, and I believe that OData has a lot of chance to become the de facto or de jure standard for exposing data API's, but a few things need to be done before that can happen.

My views on why data APIs are important, and what we are observing in the market, and the role of OData is captured in this video from the meetup from Channel 9, Microsoft.

April 26, 2012 | Permalink | Comments (0)

More on CHARM vs. RDBMS debate

I have talked about the NoSQL as CHARM databases before.  

As we have been building Apigee Data and Analytics platform, we have been looking at the tradeoffs of RDBMSs and CHARMs.  Here is what we are concluding

  1. When one gets a large volume of data, one needs a a very "low indexing" strategy because that way one can get sequential write speeds.  Any indexing makes writes "random" and therefore instead of 100's of MB/sec/disk (assuming no striping), we end up in 10's.  Therefore, the bias is towards low indexing and write optimization.
  2. When one gets a large volume of data, one needs a sharding strategy, since any one server may not be able to sustain a large rate of writes.
  3. With write optimization, queries naturally suffer.  So now one needs some combination of materialized views, sampling and columnar structures so that one can answer queries by efficient "sequential" block reads.

So these are the facts. How does this then help us look at the tradeoffs of RDBMs and CHARMs?  

  1. With low indexing, some of the advantages of RDBMSs go away, since indexing (and joins) are two features that really make RDBMs stand out.
  2. If sharding is needed, then one has to design sharding in user space for some of the "open source" databases (not for DB2 and Teradata, of course, and to a lesser degree not for Oracle).
  3. The automatic "sharding across all nodes" as a claimed advantage for CHARMs is not really a significant advantage since one wants to avoid data movements at all costs (which will happen whenever we grow a system, whether background or not), and one wants to avoid small data on all nodes and one wants to avoid secondary indexes not collocated with the data etc etc.  So all of this stuff still has to be very carefully designed.

So for us the above becomes six of one and half a dozen of other.  So at least till the part I talked about above, almost who cares what technology we use (I am reminded of this video, ROTFL).  But we have chosen RDBMS (Postgres to be precise) because of the relative maturity of the technology, and we are dealing with 100's of TB, not 10's of PB.

However, there are two other aspects that will require us to extend our (current) postgres based implementation with CHARM constructs, and we will roll it out over the next few months.  

  1. Different APIs have different shapes, and managing it in a very well schemaed system such as Postgres is not easy.
  2. Different analytics "add" discovered metatadata or dimensions, and "ALTER TABLE" in any RDBMS is not for the faint-hearted.

Now one might then ask, why still keep Postgres -- if it is six of one or half a dozen of other, and these two aspects are important, why not switch to a CHARM database?

  1. Maturity of technology, of course, means that we want to make sure that business users hit the more mature system -- i.e. Postgres.
  2. More indexing is good for faster query responses and there are many places where we need sub-minute responses, so our Postgres system is heavier indexed (and contains lower volume of data).  Of course, all of it can be done in Cassandra, but, man, there are only 24 hours in a day, so why do something when it has been done for you :)  So the Postgres system is slightly more balanced towards (semi-random) reads and (semi-random) writes, whereas the Cassandra system is optimized for sequential writes and batch reads.  
  3. And do not forget joins -- we still need them, for lots of other data.  And doing joins in application space is no one's idea of fun.

That's it. Summary.  For the core, CHARM and RDBMs are a wash, and we went with Postgres for its maturity.  But for schema variance, CHARM wins out, and for faster query responses, RDBMS wins out. Therefore we have to keep both.

I am not the only one coming to this observation.  As far as I know, many many many other companies have reached similar conclusions, but I wanted to give *our* reasons.

Now why Cassandra, and not HBase or other CHARMs?  That is a topic of another discussion.

 

April 25, 2012 | Permalink | Comments (0)

Five changes in my thinking in five months since leaving IBM

As many of you know, I left IBM after 21 years to join Apigee.  Foolish, many would say, but an enjoyable ride, nonetheless.  So how has my technical thinking evolved?
  1. Consumers of technology have a very different perspective than producers of technology.  I was the latter before, now I am more of the former.  As an example, see my post on database design...
  2. NoSQL is a bigger deal than I thought while I was at IBM.  While my KnowSQL post is still valid, I now realize that the big 5 in NoSQL (what I call the CHARM -- {Cassandra, Couchbase}, HBase, {Riak, REDIS} and Mongo) do not exist for no reason.  So NoSQL is legit, and important.  Not for replacement of the relational databases, but to solve different problems.  {BTW, how do you like the CHARM term -- I was inspired by the BRIC term for the emerging economies?}
  3. Cloud delivery is months for functions, not years.  I now tell new recruits, what can you do in 6 weeks?  In my previous life of software delivery, yearly or 18 month cycles abound...
  4. Coding is liberating.  You say, "huh?"  I say, "duh..."  I coded first 10+ years in my IBM life, then I coded as a hobby in the last few years.  But coding is great.  And I am loving python!!
  5. People talk a lot (in a good sense) about the innovations they are doing.  They contribute to open source, they give talks.  And by reading, listening, watching, and most importantly, coding, we all become smarter.  So why is it different from when I was in IBM?  Simply speaking, in IBM, I would do all the reading, listening, watching.  But I did not appreciate this open common knowledge till I started coding. 

That's all folks.  I promise I will not post a new insight every month (or should I?)

March 02, 2012 | Permalink | Comments (3)

APIs and Analytics

Sometimes I will blog on Apigee blogs, especially when it deals with API specific stuff.  More general stuff I will continue posting here.

Check out two postings on the Apigee blog:

http://blog.apigee.com/detail/data_analytics_apis/

{and yes, readers might ask why I am wearing a jacket.  I was suckered into it :) :)}

http://blog.apigee.com/detail/managing_big_data/

 

 

February 27, 2012 | Permalink | Comments (0)

In the world of Big Data, what is the role of databases? Part III

So before I get to velocity, our third V, let me spend some time on cost vs. value tradeoffs for Big Data analytics (which I had alluded to before).  Quite simply, storing data, storing lots of data, and storing it for a long time (even if marginal costs/byte are very very small) is not without cost.  And then querying/analyzing takes CPUs, and takes indexes, and reduces the rate of insertions, which you will recall from part I, is one of the primary drivers for sharding.  All this is cost before an ounce of value is derived from analytics.  

As I have thought this through, there are 5 parameters that primarily determine "cost".  BTW, all of you could say, "we knew this, why is there a blog post on it?"  To all of them, I say, "I knew it too, but the exercise of writing it down for Apigee taught me a lot."  

C = fn (T, W, I, H, Q)

  1. T, the transaction rate
  2. W, the width of the records (assume compressed on missing dimensions, assume columnar, assume whatever pleases you!)
  3. I, the indexing inflation factor (sequential scans, or primary key lookups (in many NoSQL systems) are no fun for a large number of queries, so this will typically be say 1.5x or 2x indicating how much extra to we allocate for indexing)
  4. H, history kept, say, in months
  5. Q, the inflation factor due to queries -- queries slow inserts down, queries chew up CPUs, don't you wish queries were just not there? :)  aha, but without queries, it is just cost, no value!  Q might be 1.2x, or much higher...

I have built a spreadsheet for these numbers assuming AWS (with EC2, EBS, S3 and all the associated costs), but I encourage all of you to do these computations for yourselves. 

For those of you in the analytics provider business, you want to make sure that cost and value go hand in hand, and that you (or your customers) are not burdened with costs before they see an ounce of value. 

 We will discuss the "variety" part in the next post then folks.

December 15, 2011 | Permalink | Comments (4)

(In) The new world of Big Data, how different is the world of databases? Part II

So we handled volume in the previous post, and how, volume requires thinking database layouts slightly differently.  But let us be very clear.  Volume by itself is not the reason to go for NoSQL databases.  While it is true that

  1. NoSQL databases will often relax serializability and therefore achieve higher scale, relational databases of high 100's of TBs are very common and anyway, one can always build petabyte databases as 10's of {100's of TBs} databases.
  2. It is also true that some NoSQL systems have a better reliability model that enables better price availability metric than many relational databases, but availability/reliability have no inherent correlation with SQL/NoSQL debate.

What is absolutely true is that NoSQL is not inherently faster than relational, in fact quite the opposite might be true.  How often have I heard this, "oh, you are having database performance problems, why don't you switch to hbase and your problems will go away?".  That is funny the first time, but the 10th time it makes me want to cry...  HBases and others are not solutions to performance problems in databases; in fact sometimes the opposite is true.  When enough preprocessing gets done, and schema gets fixed -- when you have gone from determining what questions to ask to asking the known querstions repeatedly, there is nothing to beat databases my friends.  So play with HBases of the world when you are living in the wild wild west (chaotic schema, or still in the determining of the key dimensions), but do that in addition to, or before, putting known stuff in databases.  See this post on more discussions...

Before you guys jump up and down and say, "Oh, for Anant, databases are the answer, what is the question?", let me tell you what I told the the MySQL group. 

That when "variety" is the problem, relational systems require much more hand-holding than the NoSQL systems do.  I am not talking of NoSQL transactional systems (where the database guys would argue that salesforce.com has shown how variety can be handled, and DB2/Oracle with their XML, RDF and other features will show that variety can also be handled).  

I am talking about NoSQL analytical stores.  As I have discussed in the past, analytics deals with "knowns" (schema and dimensions) and "unknows" (schema chaos and discovery).  Relational systems are not great at the latter.  Let us say you run an analysis and discover some other fact about a set of records/documents?  And let us say that this is the first time you are discovering this hidden dimension...  Have you tried ALTER TABLE ADD COLUMN on a 10 TB table?  Nobody's idea of fun?  Building rental columns a la salesforce.com is also nobody's idea of fun.  So relational systems cannot really be used for analytical needs where the schema is varying or evolving rapidly.  

I see the following: schematized data flows into rdbms, unschematized into some NoSQL system (take your pick -- HBase, Mongo, Couchbase, whatever is your favorite poison or manna!).  Hadoop jobs run on unschematized data and add dimensional structures that get stored back in the flexible NoSQL.  And when the dimensions get fixed (and the questions to ask them get fixed), if needed, the fixed data now flows into RDBMS. (Below assumes NoSQL = HBase)

Slide06

In Part III, I will add "velocity" to this story. 

December 13, 2011 | Permalink | Comments (0)

The Yin and Yang of Cloud and APIs

On Apigee Blog, I have posted two blogs -- one (the Yin) on why clouds need APIs, and the other (Yang) on why APIs need cloud.  

The argument for the Yin is simple -- portals are for people, APIs are for apps.  The aperture for apps is much larger, and allows process-based integration.  So just like websites are becoming clients of backend APIs, cloud portals (login and provision X, or enter a prospect Y, or measure Z) are becoming clients for backend cloud APIs.  And these cloud APIs can service not only the portals, but for companies such as Innotas, deliver true enterprise integration.

The argument for Yang is also simple -- APIs enable the app economy, which has a very different lifecycle than the traditional enterprise application economy.  In the latter, things are planned and resources can be planned equally.  In the former, we just cannot.  Therefore, cloud as a delivery model makes perfect sense for APIs.  Now one can choose internal and external cloud, though I lean towards the latter.

Thanks @sramji for pointing me to the Yin and the Yang duality!!  You the man! And Helen Whelan for bringing out the thoughts!

 

December 09, 2011 | Permalink | Comments (0)

The new world of Big Data, how different is the world of Databases?

Traditionally Big Data has focused on hadoop.  However, I am seeing an even bigger "WTH (that is H, not F!) should we do for database?" issue crop up in minds of people (and something that I have been thinking deeply about for Apigee Analytics also).  I am not going to dive into the muckety-muck of NoSQL vs. SQL (I have debated that in the past, and will do so in the future).  

I was in Bangalore last week, and I had the pleasure of addressing the MySQL user group there.  Those of you who know me, know that I like to make people think in my talks, and I chose to (among other things) make them think how the world of databases is different in the new era.

Assume the IBM's version of the 3 V's -- volume, variety and velocity.  I will only cover how volume impacts the database design.  Variety and Velocity's impact on databases is a separate topic.

1. When we were building DB2 Parallel Edition (that became InfoSphere Warehouse), the sharding (we used to call it partitioning then :) was designed to improve query performance.  In the new world, sharding is needed just to keep up with the data rate because one server can only do so many inserts/sec.  

Illustratively, in the good old days,  Slide08

whereas in the new world, Slide09

What that means is that query processing is likely to be more brute force, and needs to go to many more shards than in the old days.  BUT, we have made a conscious choice of favoring insert speeds over queries in the new world.

Of course the two (optimize for both query and inserts) can be combined leading to

Slide10

 2. In the good old days, vacuuming old data was carefully done through time range partitionings, multi-dimensional clustering, chunking and others.  In the new days, vacuuming old data requires the same things!  However, in addition, it is more urgent and important since volume implies that not everything can be kept on-line forever.  Therefore the database design is deeply influenced by vacuuming considerations.  In fact, I would assert that after insert rates, vacuuming considerations are the second most important.  However, I believe that combined with the most important consideration, namely inserts, built in vacuuming models in the database are not likely to be terribly useful.  

I give a simple example here.  Postgres has a wonderful table partitioning feature.  Wonderful BEFORE triggers in the parent table will end up putting the records in the right "partition" (and in my grad days at Berkeley, I had something to do with it, a very small part :).  And one can drop the partitions, redefine the triggers, and everything is hunky dory.  EXCEPT, there will be a significant overhead doing it like this FOR inserts and remember, inserts are kings.  So that kills it...

3. Finally, there is cost vs. value.  In the olden days, data were stored in "expensive" SANs.  Therefore for every data element, there was a measure of "what it costs to store and what value it gives".  In the new days, data are stored on cheap disks (or EBS'es).  Fundamentally, the equation has changed.  Cost/byte/timeunit = ~0, and therefore store first and worry about value later (of course, we still have to worry about how many timeunits we keep the data for, because that can be infinitely long and therefore cost infinite, see vacuuming above).  Therefore, the databases in the new world are likely to be much much larger requiring much better distributed system design to handle failures. 

Net net, volume itself does not make SQL databases irrelevant.  We just have to think about sharding, partitioning and size differently, and make the in-build functions somewhat less useful.  BUT never let size make you forget RDBMs.  And even if you do, you still have to think through all of the above issues.


December 07, 2011 | Permalink | Comments (1)

»