Open Collaborative Research and the availability of data
I was recently on a panel at the IIT 2007 conference as I had written before. The topic was "Cutting Edge Research at Corporations," that I shared with Ashok Chandra (Microsoft), Raghu Ramakrishnan (Yahoo! Research), Sunil Shenoy (Intel). The panel was moderated by Don Clark of WSJ (do your favorite search to find articles by him, I could not find a home page for him).
When I raised the topic of "open collaborative" research, Raghu pounced on me. He said that science has been built upon people being able to verify other people's work by duplicating it. However, in computer science (especially in his and my area -- information management systems), we do not make it is easy for things to be duplicated. He gave several reasons for it.
- The code is often not released -- it might be commercial pieces of software (often the papers based on such commercial pieces of software are most respected), so there is no way to independently verify the system set up.
- The data are also not freely available. Now in many cases, scientific papers in computer science are based on "artificial data" that is reproducible since the papers give the few parameters to generate the artificial data. However, when the papers are based on real data, it is either "customer data" that is not made public, or some synthesis of public data (crawled for example) that is not easily reproducible outside the boundaries of the big search and enterprise software companies.
So he said, "Anant, why does IBM not make customer data publicly available?" I could have thought of many snap answers for someone from Yahoo!, but I think Ashok started speaking, so I lost my turn to be smart in front of the audience. He also said that he, and the SIGMOD, VLDB community are looking into it seriously, so I said, this man is doing good, not time for me to be frivilous. [We had been, for part of the panel, since Ashok, Raghu and I know each other very well, and were making fun of each other, but also I think making the panel more fun.]
But that got me thinking, Raghu does have a point. How are we a science if the 1 & 2 above do not hold (at least outside the theory field)? What can a company such as IBM do to advance this? What can Google, Yahoo, Microsoft, Oracle, SAP do? Thoughts?
Comments