Anant Jhingran's Musings

My Photo

About

Recent Posts

  • In the world of Big Data, what is the role of databases? Part III
  • (In) The new world of Big Data, how different is the world of databases? Part II
  • The Yin and Yang of Cloud and APIs
  • The new world of Big Data, how different is the world of Databases?
  • What's the big deal about APIs anyway?
  • API and Data Analytics
  • A new chapter in my (work) life
  • Top 10 reasons to #knowsql, not just #nosql!!
  • The education part of my keynote...
  • Why I do not like the term "data scientist"
Subscribe to this blog's feed
Blog powered by TypePad

In the world of Big Data, what is the role of databases? Part III

So before I get to velocity, our third V, let me spend some time on cost vs. value tradeoffs for Big Data analytics (which I had alluded to before).  Quite simply, storing data, storing lots of data, and storing it for a long time (even if marginal costs/byte are very very small) is not without cost.  And then querying/analyzing takes CPUs, and takes indexes, and reduces the rate of insertions, which you will recall from part I, is one of the primary drivers for sharding.  All this is cost before an ounce of value is derived from analytics.  

As I have thought this through, there are 5 parameters that primarily determine "cost".  BTW, all of you could say, "we knew this, why is there a blog post on it?"  To all of them, I say, "I knew it too, but the exercise of writing it down for Apigee taught me a lot."  

C = fn (T, W, I, H, Q)

  1. T, the transaction rate
  2. W, the width of the records (assume compressed on missing dimensions, assume columnar, assume whatever pleases you!)
  3. I, the indexing inflation factor (sequential scans, or primary key lookups (in many NoSQL systems) are no fun for a large number of queries, so this will typically be say 1.5x or 2x indicating how much extra to we allocate for indexing)
  4. H, history kept, say, in months
  5. Q, the inflation factor due to queries -- queries slow inserts down, queries chew up CPUs, don't you wish queries were just not there? :)  aha, but without queries, it is just cost, no value!  Q might be 1.2x, or much higher...

I have built a spreadsheet for these numbers assuming AWS (with EC2, EBS, S3 and all the associated costs), but I encourage all of you to do these computations for yourselves. 

For those of you in the analytics provider business, you want to make sure that cost and value go hand in hand, and that you (or your customers) are not burdened with costs before they see an ounce of value. 

 We will discuss the "variety" part in the next post then folks.

December 15, 2011 | Permalink | Comments (3)

(In) The new world of Big Data, how different is the world of databases? Part II

So we handled volume in the previous post, and how, volume requires thinking database layouts slightly differently.  But let us be very clear.  Volume by itself is not the reason to go for NoSQL databases.  While it is true that

  1. NoSQL databases will often relax serializability and therefore achieve higher scale, relational databases of high 100's of TBs are very common and anyway, one can always build petabyte databases as 10's of {100's of TBs} databases.
  2. It is also true that some NoSQL systems have a better reliability model that enables better price availability metric than many relational databases, but availability/reliability have no inherent correlation with SQL/NoSQL debate.

What is absolutely true is that NoSQL is not inherently faster than relational, in fact quite the opposite might be true.  How often have I heard this, "oh, you are having database performance problems, why don't you switch to hbase and your problems will go away?".  That is funny the first time, but the 10th time it makes me want to cry...  HBases and others are not solutions to performance problems in databases; in fact sometimes the opposite is true.  When enough preprocessing gets done, and schema gets fixed -- when you have gone from determining what questions to ask to asking the known querstions repeatedly, there is nothing to beat databases my friends.  So play with HBases of the world when you are living in the wild wild west (chaotic schema, or still in the determining of the key dimensions), but do that in addition to, or before, putting known stuff in databases.  See this post on more discussions...

Before you guys jump up and down and say, "Oh, for Anant, databases are the answer, what is the question?", let me tell you what I told the the MySQL group. 

That when "variety" is the problem, relational systems require much more hand-holding than the NoSQL systems do.  I am not talking of NoSQL transactional systems (where the database guys would argue that salesforce.com has shown how variety can be handled, and DB2/Oracle with their XML, RDF and other features will show that variety can also be handled).  

I am talking about NoSQL analytical stores.  As I have discussed in the past, analytics deals with "knowns" (schema and dimensions) and "unknows" (schema chaos and discovery).  Relational systems are not great at the latter.  Let us say you run an analysis and discover some other fact about a set of records/documents?  And let us say that this is the first time you are discovering this hidden dimension...  Have you tried ALTER TABLE ADD COLUMN on a 10 TB table?  Nobody's idea of fun?  Building rental columns a la salesforce.com is also nobody's idea of fun.  So relational systems cannot really be used for analytical needs where the schema is varying or evolving rapidly.  

I see the following: schematized data flows into rdbms, unschematized into some NoSQL system (take your pick -- HBase, Mongo, Couchbase, whatever is your favorite poison or manna!).  Hadoop jobs run on unschematized data and add dimensional structures that get stored back in the flexible NoSQL.  And when the dimensions get fixed (and the questions to ask them get fixed), if needed, the fixed data now flows into RDBMS. (Below assumes NoSQL = HBase)

Slide06

In Part III, I will add "velocity" to this story. 

December 13, 2011 | Permalink | Comments (0)

The Yin and Yang of Cloud and APIs

On Apigee Blog, I have posted two blogs -- one (the Yin) on why clouds need APIs, and the other (Yang) on why APIs need cloud.  

The argument for the Yin is simple -- portals are for people, APIs are for apps.  The aperture for apps is much larger, and allows process-based integration.  So just like websites are becoming clients of backend APIs, cloud portals (login and provision X, or enter a prospect Y, or measure Z) are becoming clients for backend cloud APIs.  And these cloud APIs can service not only the portals, but for companies such as Innotas, deliver true enterprise integration.

The argument for Yang is also simple -- APIs enable the app economy, which has a very different lifecycle than the traditional enterprise application economy.  In the latter, things are planned and resources can be planned equally.  In the former, we just cannot.  Therefore, cloud as a delivery model makes perfect sense for APIs.  Now one can choose internal and external cloud, though I lean towards the latter.

Thanks @sramji for pointing me to the Yin and the Yang duality!!  You the man! And Helen Whelan for bringing out the thoughts!

 

December 09, 2011 | Permalink | Comments (0)

The new world of Big Data, how different is the world of Databases?

Traditionally Big Data has focused on hadoop.  However, I am seeing an even bigger "WTH (that is H, not F!) should we do for database?" issue crop up in minds of people (and something that I have been thinking deeply about for Apigee Analytics also).  I am not going to dive into the muckety-muck of NoSQL vs. SQL (I have debated that in the past, and will do so in the future).  

I was in Bangalore last week, and I had the pleasure of addressing the MySQL user group there.  Those of you who know me, know that I like to make people think in my talks, and I chose to (among other things) make them think how the world of databases is different in the new era.

Assume the IBM's version of the 3 V's -- volume, variety and velocity.  I will only cover how volume impacts the database design.  Variety and Velocity's impact on databases is a separate topic.

1. When we were building DB2 Parallel Edition (that became InfoSphere Warehouse), the sharding (we used to call it partitioning then :) was designed to improve query performance.  In the new world, sharding is needed just to keep up with the data rate because one server can only do so many inserts/sec.  

Illustratively, in the good old days,  Slide08

whereas in the new world, Slide09

What that means is that query processing is likely to be more brute force, and needs to go to many more shards than in the old days.  BUT, we have made a conscious choice of favoring insert speeds over queries in the new world.

Of course the two (optimize for both query and inserts) can be combined leading to

Slide10

 2. In the good old days, vacuuming old data was carefully done through time range partitionings, multi-dimensional clustering, chunking and others.  In the new days, vacuuming old data requires the same things!  However, in addition, it is more urgent and important since volume implies that not everything can be kept on-line forever.  Therefore the database design is deeply influenced by vacuuming considerations.  In fact, I would assert that after insert rates, vacuuming considerations are the second most important.  However, I believe that combined with the most important consideration, namely inserts, built in vacuuming models in the database are not likely to be terribly useful.  

I give a simple example here.  Postgres has a wonderful table partitioning feature.  Wonderful BEFORE triggers in the parent table will end up putting the records in the right "partition" (and in my grad days at Berkeley, I had something to do with it, a very small part :).  And one can drop the partitions, redefine the triggers, and everything is hunky dory.  EXCEPT, there will be a significant overhead doing it like this FOR inserts and remember, inserts are kings.  So that kills it...

3. Finally, there is cost vs. value.  In the olden days, data were stored in "expensive" SANs.  Therefore for every data element, there was a measure of "what it costs to store and what value it gives".  In the new days, data are stored on cheap disks (or EBS'es).  Fundamentally, the equation has changed.  Cost/byte/timeunit = ~0, and therefore store first and worry about value later (of course, we still have to worry about how many timeunits we keep the data for, because that can be infinitely long and therefore cost infinite, see vacuuming above).  Therefore, the databases in the new world are likely to be much much larger requiring much better distributed system design to handle failures. 

Net net, volume itself does not make SQL databases irrelevant.  We just have to think about sharding, partitioning and size differently, and make the in-build functions somewhat less useful.  BUT never let size make you forget RDBMs.  And even if you do, you still have to think through all of the above issues.


December 07, 2011 | Permalink | Comments (0)

What's the big deal about APIs anyway?

All of us are familiar with SOA -- exposing applications, or application components -- so that new apps and new processes can be written by reusing existing functionality.  And we are all familiar with APIs, a set of (sometimes language specific bindings) that allow new apps to be written utilizing these bindings (windows APIs for example).  

However, the new world of APIs is the world of SOA brought to the end app developers -- often outside the firewalls of the organization.  And that means that many of the SOA concepts have to become accessible -- frictionless app creation is de jure.  Every enterprise has a core business -- Gamespy is in gaming platform, Netflix is in video and movie delivery, Sears is in retailing.  Attracting more transactions into the core business is *the* new growth model, or should be.  Traditional methods have attracted people to the "web front", and with hopes that that translates to more core transactions.  

However, there is a movement -- mobile apps!  The more mobile apps that tie in the enterprise's core business, the more transactions will flow through and more the monetization will happen.  However, there are two problems that need to be solved for this to happen...

  1. When's the last time you have liked a mobile app that only took you to the "web front?"  Exactly.  So you want a direct connection between a mobile app and the backend systems' APIs that will fulfill those requests.  Which means that this is likely to make the IT managers groan -- the systems were designed for carefully managed class of applications, and now every Tom, Dick and Harry is going to pound on it?  Help.  Mediate, cache, rate control, block, protect these calls.
  2. However gorpy the backend APIs are (and having written a few of these gorpy ones myself), they have to be made "accessible", i.e. brain dead simple to use.  Monetization will come not from charging for using the APIs but from the business that the API use generates!  Which means that RESTification is a pre-req, but it does not stop there.  The whole process of using the APIs has to become frictionless.

So there you have it -- the providers need some guarantees that popularity of their APIs will not kill them and the developers want the whole process of using a providers APIs to be frictionless.  That is what makes for a great API platform.

And that is what we at Apigee are doing!   

October 04, 2011 | Permalink | Comments (0)

API and Data Analytics

We've seen this trend before -- at IBM, we talked of business automation and business optimization.  First the enteprises set an "application agenda" to get their business processes and apps standardized and consolidated.  And then, in a serious way, they got into business optimization using data and analytics.  At IBM, we called it "information agenda."  It is not just about getting visibility into processes and applications, or reports (which is important, but nowadays is more or less the cost of doing business). Enterprises that master the Information Agenda are getting a handle on information and leveraging it to optimize the business using analytics.

I am seeing the same thing around API's.  {I am assuming that most of you grock the concept that API's connect backend and enterprise systems to (primarily) front-end and mobile developers, and need to be managed and tracked as a valuable asset.}  The current spend is around the transactions that flow through the API's (*, see below) . And of course key clients (**, see below) want visibility into APIs - who is calling what, when is it happening, what error reports etc.  I call this IT and API metrics.

However, as I talk to customers, it is also clear that they want analytics at the business level -- what products are being bought, what money is being spent on games, what is being tagged most frequently?  I call this business metrics.

So the API team, the lines of businesses and the developers of the apps using APIs want analytics on IT and business metrics.  So far, that is natural.  After all, if something is being used, it must be tracked and monitored.

However, the real game changer is when the metrics are used to do prediction and to take action.  Analysis of an IT level metric might automatically lead to a policy change for caching during peak hours/months. Analysis of a business level metric might allow a game platform provider to make the games more sticky by predicting abandonment and automatic triggers of some offers.  APIs are going to be central to any future enterprise.  And predictive analytics on APIs, in my mind, is going to be central to those enterprises (hopefully 100%) that are trying to optimize, not just automate, their business.

There are two other trends that need highlighting.  One, of course, is cloud.  Can such analytics be delivered in a simple but gorgeous way through the cloud?  And second is data, with so much data out there that is external to an enterprise, can that be brought to bear to deliver some wonderful added value to analytics?

So this is what I will be helping Apigee do.  If you are an Apigee customer, then I will be spending time with you.  If you are not, then I hope that our data and analytics technology will convince you to give Apigee a try.

But most importantly, I will always engage in an open and honest dialog with the blogosphere so that we can collectively "rise all boats".

* I am so used to writing API's (note the apostrophe), but at apigee they will convert me into writing APIs, I am not sure whether I will ever master that :) and maybe I will convert the rest of my colleagues to my way!)

** oops, at apigee, it is customers, so another thing I will have to retrain myself on

September 27, 2011 | Permalink | Comments (4)

A new chapter in my (work) life

As a few of you know, I left IBM last week to join Apigee.  IBM is an absolutely wonderful place with the best people and the best technology.  I have been fortunate to work there for 21 years, and have had the privilege of working with and for the best and the brightest.  I had friends as bosses, and bosses who became friends.  I had colleagues who respected each other.  And the best part of being an IBMer is that the customers (even when criticizing some feature or capabilities of our products) had the admiration fo what IBM brings to the table.  In some small ways, I have given back to IBM by helping it grow a few businesses, but what I have gotten far exceeds what I given back.  Thank you IBM!

But there comes a stage in life when one examines what's next and sometimes concludes that there is something else to prove.  I reached that point this summer and decided that I had to be in a startup that offered me a very different kind of challenge.  IBM succeeds when innovation leverages the IBM engine and when it does the success is BIG.  With the help of my colleagues, there are several technologies that we were able to bring success to using that engine.  The latest efforts in IBM's Big Data capabilities is one of the latest such example.  However, having done that, I wanted to prove to myself that I could leverage a very different engine, hence a startup.

Now my choice of the startup hinged on a few criteria I had set for myself.  

One, it should already have paying customers, a growing revenue and a deep pipeline of prospects.  This was important since the skills required to bootstrap a startup in my mind are very different than the one to grow it, and I did not believe that I had the former skills or interest at this time.  

Second, I am a firm believer that data and analytics are the defining technologies for the next decade, and if pressed hard, I will grudgingly acknowledge that mobile is a close second.  So I wanted a startup, and a role in that startup,  that would have a good intersection of analytics and mobile.  

Finally, and especially in a startup, one must work with folks who you respect and like and provide complementary skills. 

All of this led me to Apigee.  My day 1 was Friday.  I will write about my transition (the shock and the awe and the gradual acclimitization I hope!) but most importantly, I will write about the technologies I will help create.  For those of you who have followed me for my information management and cloud thoughts, you all might be a little disappointed in my future postings because I will talk about some slightly different things, but data and analytics and cloud would be undercurrents that would run through it.  So un -blogroll me if needed!  I would not mind :)

I want to end with how my last week in IBM transpired.  As I was going through this decision, my immediate boss, several senior vice presidents, and many other bosses I have worked for, talked to me about the pros and cons of what I wanted to do.  In the end, they were disappointed (deeply perhaps), but as I talked to them, it became clear to me that their treating of me as a colleague and as a human being is what makes IBM a wonderful place.  And then I talked with my colleagues and that was so hard. We have had 21 wonderful years.  In the end, one of them captured it beautifully, "Anant, we are sorry to see you go and wish you the best of luck.  We know that you will always love IBM and IBMers, and that you also know that IBM is more than one person and will be even stronger in the future, so go realizing that you have helped in some small way towards IBM's success."

September 18, 2011 | Permalink | Comments (15)

Top 10 reasons to #knowsql, not just #nosql!!

First and foremost, see that I am saying "#knowsql, not just #nosql".  I am not saying "#knowsql, not #nosql." Second, I am not putting #hadoop in the #nosql category -- it is not a #nosql engine, it is "do what you want with it" engine.  Pig's and Hive's add data models to it, but that is a separate topic. 

10. SQL is the language that talks about "select project joins" but uses the word SELECT for project!

9. SQL databases do not ignore ACID properties unlike most of #nosql databases.  Have you tried to reason about "eventual" consistency a la Cassandra?

8. You have choice of implementation, not lock-ins to one technology, one implementation.

7. Your system does what you want and you do not have to tell how you want it done (Declarative vs. Procedural?)

6. Billions of R&D dollars are being spent on improving SQL databases per year.

5. 95% of the hottest database "system technologies" (like hardware acceleration, GPU exploitation, multi-core exploitation) focus on SQL, not #nosql.  {Obvious Anant perspective, can be questioned as to what I mean by "hottest".}  So SQL will get leaps and bounds of performance improvement over the next few years.

4. 95% of cloud projects are on SQL, not #nosql. {Anant perspective, but verifiable.  Working with @sogrady to see what numbers we can come up with.}

3. SQL databases can now handle RDF, XML, key value stores without sacrificing ACID.

2. Most real data persistence needs are sub ~200 TB, easily a sweet spot for SQL databases.

1. Every #nosql in the end either already has, or will, put up an SQL-like layer.  Look no further than GQL (google).

July 14, 2011 | Permalink | Comments (1)

The education part of my keynote...

People who were there at the hadoop summit recall the three parts to my keynote -- entertainment (watson), education (industry use cases we are seeing) and philosophy.  I have talked philosophy before, and also entertainment.  So let me focus on education (which some not to be named folks like JF called "boring" (JF, you know who you are :) but others like Joe Reger Jr. specifically asked for..

So here it goes.  My view is that we are seeing enterprise use cases around the three V's -- volume, variety and velocity, with variety being the most important, and volume not that much so (hence my disdain for the term "Big Data", but that ship has sailed so no sense in being grumpy about it, unlike "data scientist" which is still not an industry term, so maybe some grumpiness from me and others will reduce its chances of becoming an industry term :)

Slide10

and along the three dimensions, the ~100 engagements that IBM is engaged in are exhibiting various characteristics (edit based on comments received: this is not exhaustive, other industries such as telco etc are also using this extensively):

Slide13

I will post later, if you all are interested, the top use cases that elaborate on column 2 above..

July 13, 2011 | Permalink | Comments (4)

Why I do not like the term "data scientist"

Oh my god, am I a heretic among the believers?  Let me explain why:

  1. Geekification might enhance the importance in the engineering culture of Silcon Valley, but where the technologists need to find jobs (i.e. enterprises), it only adds to the perception that this is too hard, why bother hiring such geeks?
  2. It will also distract the community from what I think is a fundamental commandment, "Thou shall make the technologies more consumable."  If we continue to foster the concept that this set of capabilities are built by scientists, for scientists, we will lose sight of the fact that they should be built by scientists and engineers and interface folks and domain experts and ...  for scientists and engineers and interface folks and domain experts and business people.
  3. It is an un-necessary rebranding of well know disciplines.  I do not buy the fact that magically some new skills have to appear.  Yes sure, hadoop is different.  But skilled people grock new stuff using their existing repertoire of the basic skill of grocking large scale systems, analytics etc. 

Am I too grumpy today? :)

July 06, 2011 | Permalink | Comments (5)

»