Anant Jhingran's Musings

My Photo

About

Recent Posts

  • My Defrag talk
  • My two blogs on why the world of data is changing fast in the new app economy
  • If NoData is not an option, is OData the answer?
  • More on CHARM vs. RDBMS debate
  • Five changes in my thinking in five months since leaving IBM
  • APIs and Analytics
  • In the world of Big Data, what is the role of databases? Part III
  • (In) The new world of Big Data, how different is the world of databases? Part II
  • The Yin and Yang of Cloud and APIs
  • The new world of Big Data, how different is the world of Databases?
Subscribe to this blog's feed
Blog powered by Typepad

"We are an analytics company that happens to do {pick your vertical}"

I have increasingly begun to hear this as I talk to IBM clients.  Google and Yahoo saying that is one thing.  Even program traders saying it is understood.  But someone in manufacturing?  Or retail?  Or, as Daniel McCraffey from Zynga said it at the Strata Conf today, "we are an analytics company that happens to be in on-line gaming."  This has begun to warm the cockles of my heart (is that an expression that Americans understand? -- but of course my blog has a following across the world :) :)

Daniel picked upon something that I had said in a panel just before him.  I had said that increasingly companies are being driven from process-centricity to analytical-centricity.  In that transition, they need to be able to determine what questions to ask, not just ask the questions they know to ask.  I have said that before, so I do not want to be a broken record here.  But what Daniel does not realize is that one of the reasons I say this is because I have learnt from Daniel's and Ken Rudin's previous talks on analytical driven gaming where I had heard them use this very expression.  So that was a full circle :)

I have blogged about this partially in the past that at Zynga, the transformation was also organizational, not just technical.  This "emded data scientists right in the team developing the games" is a far reaching organizational statement that Daniel brought out again.

There is another aspect of Zynga's use of analytics that is very interesting.  While we tend to associate large data with gaming, and hadoop with large data, a lot of big lifting in Zynga is done on traditional, relational techniques.  Volume by itself, as I have said in the past, is not an indicator of the choice of technology.

There were lots of interesting talks today, including the panel I participated in.  More on it later.

May 25, 2011 | Permalink | Comments (0)

Dimensionality of the data vs. dimensionality of the queries

I recently talked about poly-structured data, and its hidden dimensionsality.  That is the data side.  But what about the query side?  Let me give an example. 

A nice, relational warehouse has its data that it manages relatively structured.  But what about the queries?  Most of the data warehouses are used to repeatedly run "questions that one knows to ask."  Yes sure, data warehouses (remember, as I said in my panel at IBM's Big Data meeting, "I am at the very least a mid-wife for IBM's data warehousing efforts," so I know what things of beauty they are :) can and are always used to answer ad hoc queries.  But their bread and butter is to be the workhorse of repetition -- i.e. the query pattern is structured and relatively fixed.  Mostly.

There are two kinds of ad-hocness of querying that we have to deal with -- one, an occasional "ad hoc" query, different than what has been seen before.  Something that we in warehousing world know very well.  Second, a repeated, iterative, ad-hocness to derive some key dimensions and/or mine the data for other patterns.  This is typically a discovery process, something I talked about in this post.  For lack of better phraseology, I call this "high dimensional" querying -- the structure of the query/analysis is flexible.  Relational warehouse deployments have not focused on this in any significant way though their capabilities have always been there.

Hadoop based systems are of course excellent in dealing with poly-structured (high dimensional) data, but they are also good in dealing with high-dimensional querying.  As I and others have said in the past, "Big data systems can be used to determine what questions to ask, and warehouses can be used to answer the questions that you know to ask."

With systems like Netezza supporting M/R and and other in-database analytics, relational systems are beginning to open up low dimensional (structured) data to high dimensional (unstructured) querying.  And with additions like Hive, HBase etc. to hadoop framework, Big Data systems are beginning to support low dimensional querying on low and high dimensional data.

Hence this figure (vertical arrows being supplanted by diagonals):

Slide1

May 23, 2011 | Permalink | Comments (0)

Pain vs. Gain for Different Cloud Layers

I have talked about in the past that cloud decisions are not made in vacuum, there is a definite tradeoff of pain vs gain.

On the gain side are obvious reductions in capex, opex and even increased business flexibility (by making IT available very easily so that more things get done). 

On the pain side are increased requirements for standardization, increased application model change, and risks (perceived or real) that come with increased sharing.

I have also talked of the fact that IaaS, PaaS and SaaS are not one thing.  I will elaborate on it in more detail in a later post (gotta write it!), but as I alluded to here.  For example, there are three PaaS models I see:

  • Pattern based deployment, a la IBM Workload Deployer
  • Standardized Shared but conformant to current application structures, such as Relational-Database-as-a-Service
  • Standardized Shared but requiring new application structures (i.e. rewrites to take advantage) such as Google App Engine's view of databases.

So the key question for all of us is: how linear is the pain/gain curve?  Is it linear, is it above the line, below the line?  I would be very much interested in your views here. Slide1

 

May 20, 2011 | Permalink | Comments (1)

More on Poly-Structured Data and its relationship with Hadoop & Big Data

First, as I have said, I want to thank Curt Monash for this term -- I had been using the term "schema variance" but I will move to poly-structure, or at least promote it the way to talk about it.  Curt has an excellent blog on it here. 

My view is relatively simple (to paraphrase Curt)

if (# entities >> # entity-structures) then structured, else poly-structured.  Let me explain.  A database with 10 tables but a billion records has 1 billion >> 10, hence structured.  However, an XML database with a billion rows, each different, has 1 billion = 1 billion, hence poly-structured.  So far ok?

When # entities >> # entity-structures, then we can not only do the single entity puts and gets, we can also "analyze" -- we can ask the "database" system to tell us which entities have what properties?  We can ask what entities are related to other entities.  And we can do that because we can reason and optimize at the entity structure level, and since # entity-structures are much smaller, we can afford these kinds of optimizations.

Now when # entities ~ # entity-structures, what can you do with the data?  Clearly, you can store and retrieve, efficiently.  That is the focus of many of the NoSQL databases -- couchDB to name one.  Give a key to an entity, store (put) it.  Give a key to an entity, retrieve (get) it.  RDF and linked data systems also typically operate at this level -- every entity can have a different set of properties and values (or subjects, predicates etc. -- I can never get it right!)

But what about analytics?  Can we ask these databases "how many entities satisfy a given property?"  These kinds of analyses are very difficult to do when #entities ~ # entity-structures, since you end up pointer chasing, which makes the computational complexity of these kinds of analytical problems to be very high.  One solution that people are beginning to look at is to use Lucene Indexing as side structures that index these poly-structured entities, and then using some advancements in Lucene to be able to answer these kinds of queries. 

That brings me to hadoop.  I am a firm believer that there are two roles for hadoop and Big Data.  One, where the data itself is poly-structured.  Second, when the analysis is poly-structured -- i.e. you are looking for some key dimensions in your structured and poly-structured data, but you do not want to be contstrained in presupposing what they are.  Either way, you want a flexibility in looking at the data different ways.  You could, always, have every analysis scan through all the data, and look out for some key dimensions.  Or you could start to build some "structure" on poly-structure.  Analysis 1 adds some key dimensions to the data.  Analysis 2 adds other dimension, or adds hierarchy to the dimension that Analysis 1 had added.

And guess what, if you want to work in that way, you need things like HBase to enable this kind of polys-tructured analysis.  Poly-structured stuff goes into hdfs.  Hadoop analysis happens.  Structure or dimensionality gets discovered.  It gets added to HBase.  More analysis happens.  More structure gets discovered.  More gets added to HBase. Soon, some folks can do their analysis off HBase, and forgo analysis on raw polys-structured data completely.

And if some dimensions become repeatable and important for daily, hourly analysis, they get promoted to a real database... 

So in effect, hadoop and hdfs will over time get surrounded by other data structures that handle structure better -- lucene indexing, hdfs, or in some cases real databases also...

Slide1

Makes sense?

 

May 17, 2011 | Permalink | Comments (1)

IBM's Big Data Capabilities

On May 11th, I had the privilege to be part of a small gathering of analysts, partners and customers to discuss IBM's POV on Big Data.  The gathering was at IBM's TJ Watson Research Center, where the Jeopardy event had been staged.


We talked about the fact that while volume (bigness) is important, there are two other V's that are equally, if not more, important.  Variety (dealing with all forms of semi-structured and unstructured data, or as Curt Monash told me, "poly-structured" data), as well as velocity (speed of analytics, not just batch) are also important.  In the context of the former, we announced our Apache Hadoop based IBM BigInsight Basic edition, V1.1.  In the context of the latter, we announced the availability of InfoSphere Streams, V2. 

The analysts tweets (follow #ibmbigdata on twitter.com) built upon that theme, adding "validity" (a point made by a customer speaker from Acxiom) as another V.  Whether we have three V's or 4 or even 5, the point that I made in a panel that I was on, was that volume, which gave the area its Big Data moniker was not the most important, at least for the clients we are dealing with.

We also had Eric Baldeschwieler from Yahoo speak on what is happening in Yahoo wrt Hadoop, and what is happening in the open source community.  The reason we wanted Eric to speak was to emphasize for our clients that while hadoop has been originating in the so-called web facing companies, the problems these companies are tackling, the IT infrastructure that they have (including the presence of warehouses) is not that remarkably different than our own client's landscape.  While yahoo might be at an extreme with respect to the number of nodes running hadoop (over 40K), the problems they have solved to make hadoop more robust and be a good analytical infratsructure, is equally applicable for our enterprise clients.

But perhaps the most important reason I wanted Eric to speak is for Yahoo and IBM together to emphasize that we are fully behind Apache Hadoop, and we want to prevent its fork, and we will build upon it, and contribute back to it.

We had lots of questions on IBM differentiation, and I will speak to that in a subsequent post, and more details about our offerings and partnerships too. 

But as always, and here I must give a deep appreciation to IBM organizers of these events, we had a client panel that was extremely well received.  Just like the client panel in the cloud forum that I talked about in the past, this one spoke to various client use cases -- bioinformatics, healthcare, trading and credit rating -- and demonstrated that Big Data is making a difference, here and now, in a wide variety of enterprise use cases.

Just last Monday, NY Times had an article about a McKinsey report, and then I read through the McKinsey report itself, that again points to the thing that has been obvious to many of us for the last year or so.  We are at the beginnings of a new wave in the enterprise.

And IBM is going to be a very strong participant here, deeply working with the open source community, making their adoption in the enteprise safe for our clients to use, since we will stand behind and innovate alongwith the community.

May 15, 2011 | Permalink | Comments (0)

Providers and Consumers of the Cloud -- two sides of the same coin

We tend to focus on the economics for the providers.  My last post on James Hamilton's talk was an example of one of them.  Multiplex, Standardize, Optimize -- across different layers of stack -- and voila, a cloud provider (could be internal IT) do a better (economical, elastic, scalable) job than acquisitions of IT per application.

But there is a second side to the story, which we tend to forget (or at least I do).  In the last two weeks I was reminded of how important cloud is to the consumption side.  Get instantaneous resources when needed, spin them up, spin them down (if necessary) -- accelerating the development of new analytics and new apps.  I.e. cloud makes innovation thrive by making the acquisition of IT trivially easy.

The first example was when I was talking with a friend of mine at a large internet company.  Getting better "lift" in the advertising on their site is virtually priceless.  The challenge is how do they have a frictionless innovation engine that allows their engineers/researchers/marketers to try different "lift" strategies -- and they have found that the ability to spin up and spin down large map reduce clusters allows them to innovate on large data and build better machine learning and lift algorithms.  The efficiency of the cluster deployment is important for IT, but in his mind, not as important as the innovation it unleashes. 

The second example deals with my (our) very own DBaaS.  We focus on multitenancy as *the* driver for efficiencies on the provider side (btw, i think multitenancy is only one trick of trade, not the only one, and perhaps not the best one for enterprises).  But the real advantage that my customers tell me is that it accelerates the innovation -- if the acquisition of database is much easier, then many more apps will get built that would never have been built before.

I am sure there are many many other examples (such as test and development), but this is important for all of us hanging around the IT side of the cloud.

April 21, 2011 | Permalink | Comments (0)

James Hamilton and Cloud Center Economics

those who know me know that James' MIX 10 talk is one of my favorites.  He lays out some very compelling arguments, which I got a refresher course on again today when I attended his keynote during the Stanford Computer Forum. 

Here are some of his key points:

1. Power is not the largest contributor to cost, servers are.

2. Networking is quite more expensive than should be -- perhaps some open source activities will help.

3. Cooling etc. have not advanced technologically, there is a lot of scope.  Amazon, Facebook etc. are becoming pretty innovative.

4. Variability in workload peaks and valleys is a tremendous opportunity in reducing peak to average utilization, thus increasing server efficiencies.  I have made a similar point that "virtualization" (i.e. multiplexing) gives a lot of utilization benefits.  But his argument is even better.  If someone believes in such multiplexing, then one would not invest in shutting off under-utlized servers -- at best, power is 13% of the total data center cost, and at best, 50% of that can be saved when servers are shut down.  Instead, focus on getting even more demand -- turn machines on, rather than shut them down, and get workloads that perhaps by spot pricing, would help you achieve better peak to average ratio (i.e. lower ratio).

5. Public clouds, representing the largest diversity, represent tremendous opportunities for such data center optimization, whereas private workloads do not.  Of course, he had to paint the private cloud guys as the "old" guys, but I excuse him for painting them (and many of us) that way because everything else he says makes so so much sense.

So thank you James.  It is great to see you as always, and your talks should be mandatory reading/listening/watching for everyone interested in clouds.

p.s. edited on 13th April with a link to James' slides from his Stanford Forum talk...

April 12, 2011 | Permalink | Comments (0)

PaaS -- it isn't one thing, is it?

At the same WWW conference that I talked about in my last post, I talked in details about the programming models for cloud (my interpretation, obviously :)  There are two slides in it which I want to emphasize today:

Slide14

 

while the above slide is mostly self-explanatory, I want to make a couple of points

1. Virtualization has been a term that has been used for "hardware" or "infrastructure".  To me, virtualization is "sharing", and hence applies to any layer of the stack.  Fundamentally, by virtualization, one can give the semblance of dedicated and elastic resources to an application user, even when shared.  This dedication and elasticity can happen for hardware, middleware or even application.

2. Standardization is key for cloud economics.  In hardware virtualization (first column), the middleware and app structures are non-standard, keeping the cost structures above the VM level in place.  In the middleware virtualization, middleware (such as database) is standardized, but the app structures are still disparate.  In the last column, every level is standardized and virtualized.

3. Now a good question to ask is, why not always have the last column, isn't it clearly better?  It is, partially.  There is a cost to that model of the apps and users of the apps willing to conform to standardization and sharing, and that cost might be an undue burden for some or many apps.  Therefore different workloads will start and end on different points in the spectrum.

Now point #3 brings me to the next slide.  Even when looks at the PaaS layer, which attempts to get to a standardized and virtualized platform layer, can we get some or many benefits if we only standardize but not necessarily virtualize?  So I have created a middle column between the IaaS and PaaS layer, and I call this pattern-based deployment of standardized stacks.  The idea is: if deployment topologies conform to some standard patterns and stacks, then the PaaS layer can do a lot to take the opex out of the system, example, automatically scale, take backups, handle failures etc. 

Slide15

The design point of IBM PaaS is column 2 and 3.  We have just announced the middle column in the form of IBM Workload Deployer, and you will be hearing a lot more on it, as well as on the virtualized PaaS (last column) such as Database-as-a-Service that I have talked about it in the past.

 

April 11, 2011 | Permalink | Comments (0)

NoSQL and Cloud -- marriage made in heaven or breakup before engagement?

In terms of hype, NoSQL is way up there.  Breathless excitement around key-value stores.  And combine it with cloud, you get a double dose of ecstacy.  Really?

I have blogged earlier on the connection between hadoop and cloud (as in, "barely a connection.")  I got a chance to express some views on NoSQL in a keynote I gave at the WWW 2011 Conference in Hyderabad, India. 

I made the following points

1. NoSQL databases might be needed if you need (though you'd be surprised by how much traditional databases can do, with their investments in RDF and XML support, as examples)

    a. ultra schema flexibility

    b. ultra high scalability

2. However, do not confuse it with cloud

    a. 99.9% of the cloud applications that use databases will use traditional databases.

    b. 95% of the cloud applications will not need ultra-high scalability.

Finally, if you fall into the .1% or 5% of the above and absolutely positively need a NoSQL, scalable database, you have a choice of many, including HBase and Cassandra.  I and other researchers at IBM are beginning to favor HBase more, because of its synergies with the broader hadoop ecosystem and its simpler consistency model. 

So anyway, do NoSQL if you must (though often I would ask, "Why?")  But please do not do it because you are doing some cloud application.  And I would be interested in the evolution of hbase/cassandra choice, though I have given a view on where we are leaning towards.

April 08, 2011 | Permalink | Comments (3)

Organizational Implications of Cloud

I am going to be addressing a series of cloud topics in my next few posts (and yes, regular bloggers have a right to say that intermittent bloggers are not true bloggers, so guilty as charged)..

We just concluded the IBM Cloud Forum in San Franciso.  In addition to a series of announcements by IBM (which I will go over in a separate post), I always find the client panels at such conferences the most interesting.  Suffice to say, this one did not disappoint.  The client panel was moderated by Lauren States (from IBM), and had panelists Scott Skellenger (from Illumina), Tony Kerrison (from ING) and Carlos Matos (from Kaiser Permanente). 

Lots has been written about cloud technologies -- IaaS, PaaS, SaaS, Business Process outcomes -- and lots has been written about cloud deployment as public, private and hybrid.  Sure there were some discussions about these in the panel.  But Scott and others (and Scott later on in breakout session) brought home the following points that I found fascinating, all organization issues, not technology issues:

1. If the organization does not think cloud as a radically new way of doing things, and does not align itself to think that way, it would fail down the road -- it would try to do the old things the new way and will not succeed.

2. IT organizations have typically thought of capex's and opex's.  But now it needs special skills in trading these off, it needs skills in vendor management that is different than procurement, it needs skills in budget cycles that allow for a "hump" (i.e an increase in investment) while cloud transition is going on.

3. According to Saugatuck, by 2014 (though I think that is too optimistic), 50% of the apps that IT run would be cloud-based.  What Scott pointed out was that even if 50% remain non-cloud, the nature of those apps will change.  Instead of those apps being focused solely on some application function, they would increasingly be focused on integration of the other apps (cloud or non-cloud).  So no IT organization can think that "ok, at least 50% will remain the same".  50% will remain, but not remain the same.

We have always known that technology is easy, organizational issues are hard, but the client panel brought to me a focus on some specifics organizational issues that I had not thought of before.  Good learning for me.

In subsequent posts, I will talk about some platform activity that I am helping bring out for IBM, and a "holistic" perspective on what IBM is addressing with its cloud offerings.

April 07, 2011 | Permalink | Comments (0)

« | »