Its Big Data week, yet again. In the last two months we have seen all of the dramas and confusions attendant upon emerging markets, yet none of the emerging clarity which one might expect when a total sea change is taking place in the way in which we extract value from data content. Then this week, with all the aplomb of an elephant determined not to be left behind in a world which has apparently decided that the hula hoop is the only route to sanity, Oracle announced its enterprize Big Data solution. Again. Only now it is called the Big Data Appliance. It started shipping on Tuesday. And the world will never be the same again.

At the heart of the Oracle launch is a Hadoop license. This baby elephant lies at the heart of almost everything. The two Hadoop – based commercializations, have both raised finance in the lead-up to 2012: Cloudera ($40m) and Hortonworks ($20m), while other sector players like MapR who also exploit Hadoop found 2011 a really good time to raise money. And this had a radiating effect on the whole data handling sector. Neo 4j, a database technology (NeoTechnology, based in Malmo and Menlo Park) for  graph storage and resolution raised $10m in a round led by Fidelity. Meanwhile, Microsoft signed a deal with Horton works, IBM said it would launch Hadoop in the Cloud, EMC (Greenplum) went for MapR, Dell announced a Hadoop-based initiative, and the world waits and wonders what Hewlett Packard will do, now that it has Autonomy for analytics.

So now we have plenty of initiatives, and, as usual, not much idea of who the next generation of users will be. The first generation speak for themselves. We can see the benefits that Facebook derive from being able to used Hadoop-based tools to find connections and meanings in their content that would have been impossible to cost-effectively reveal in a prior age. And the same would be true of such unlikely bedfellows as the Department of Homeland Security, or Walmart, or Sony (think Playstation Network), or the Israeli Defence Force, or the US insurance industry (via Lexis Risk), or Lexis Nexis (who announced a Big Data integration with MarkLogic), let alone the two players who effectively started all this: Yahoo! (Hadoop) and Google (MapReduce). So asking where it goes next is a legitimate question, but one which can only be answered if we accept that the next group of users are never going to recreate  the Google server farms in order to break into these advantageous processing environments. The next group of intensive users will have their XML content on MarkLogic, or their graphical data on Neo 4j. They will want to use the US census data remotely (so will contract with Amazon for process time on the Amazon web presence), and will use a large variety of third party content held in similar ways. Some of their own content will still be held locally on MySQL databases – like Facebook – while others will be working in part or fully in the Cloud, and combining that with their own NoSQL applications. But the essential point here is that no one will be building huge data warehousing operations governed by rigid and mechanistic filing structures. Literally, we are increasingly leaving the data where it is, and bringing the analytical software to it, in order to produce results that are independent of any single data source.

And this too produces another sort of revolution. The front door to working in this way is now the organizational software itself. When Lexis Risk announced at the end of last year that they were going to take HPCC open source, a number of critics saw that as turning their back to an exploitation opportunity. Yet it makes very real sense in the context of Oracle, Microsoft and IBM seeking to build their own “solutions”. Some businesses will want to run their own solutions, and will make a choice between open source Hadoop and open source HPCC. Others in systems integration will seek out open source environments to create unique propositions. But since it was always unlikely that Lexis Risk was going to challenge the enterprize software players in their own bailiwick, then open source is a way of getting a following, harvesting vital feedback, and earn not insignificant returns in servicing and upgrading users.

I am also delighted to see that other winners seem likely to be MarkLogic, since I have been proud of working with them and speaking at their meetings for a number of years. For publishers and information providers, it is now clear that XML remains the route forward. But MarkLogic 5 is clearly being positioned as the information service providers socket for plugging into the Big Data environment. Anyone who believes that scientists will NOT want to analyse all data in a segment, or engineers source all relevant briefs with their ancilliary information, or lawyers cross examine all documentation regardless of location, or pharma companies examine research files in the context of contra-indications should stop reading now and take up fishing. My observation is that Big Data is like Due Diligence: once someone does it, even if the first results are not impressive, all competitors have to do it. The risk of not trying to find the indicative answer by the most advanced methods is too great to take.

 

 

 

The news (BBC, 29 December) that Orang Utans in Milwaukee are using iPads to watch David Attenbrough while covertly observing each others behaviour reminds me at once of how “early cycle” our experience of tablet tech still is, while how little we still extract from the experience we have of all digital technologies. So, by way of apologizing for missing last week (minor knee procedure, but the medical authorities advised that no reader of mine could possibly deserve my last thoughts before going under the anaesthetic…) and wishing you all (both…?) a belated happy Christmas I am going to sort through the December in-tray.

The key trends of 2011 will always be, for me, the landmark strides made towards really incorporating content into the workflow of professionals, and the progress made in associating previously unthinkable data collections (not linked by metadata, structure and /or location) in ways that allowed us draw out fresh analytical conclusions not otherwise available to us. These are the beginnings of very long processes, but already I think that they have redefined “digital publishing” or whatever it is that we name the post-format (book, chapter, article, database, file) world we have been living in now for a few years and are at last beginning to recognize. Elsevier recognized it all right with their LIPID MAPS lipid structures App (http://bit.ly/LipidsApp) earlier this month and I should have been quicker to see this. This App on SciVerse does all of the workflow around lipid metabolisms  and is thus integral to the research into lipids-based diseases (stroke, cancer, diabetes, Alzheimer’s, arthritis, to name a few). The LIPIDS MAP consortium is a multi-institutional, research-based organization which has marshalled into its mapping all of the metadata and nomenclature available – common and systematic names, formula, exact mass, InChiKey, classification hierarchies and links to relevant public databases. Elsevier adds the entity searching that allows the full text and abstracts to support the mapping and in data analysis terms to draw the sting from a huge amount of researcher process effort. Whenever I hear the old Newtonian saw about “standing on the shoulders of giants” I replace shoulders with “platforms”.

So how do Elsevier pull off a trick like this? By being ready and spending years  in the preparatory stages. Elsevier, in my view, has become two companies, and alongside a traditional, conservative journal publisher has evolved a high tech science data handling company, conceived in Science Direct and reaching, via Scirus and Scopus a sort of  adolescence in SciVerse. This effort now moves beyond pure data into the worktool App, driven by SciVerse Applications (www.applications.sciverse.com) and the network of collaborating third party developers which is increasingly driving these developments (http://developers.sciverse.com). This is and will be a vital component. Not even Elsevier can do all these things alone. The future is collaborative, and here is the market leader showing it understands that, and knows that science goes forward by many players, large and small, acting together. And if developers can find, under the Elsevier technology umbrella, a way of exposing their talents and earning from them (as authors were wont to do with publishers) then another business model extension has been made. There is much evidence here of the future of science “publishing” – and while it may be doubted that many (two?) companies can accomplish these mutations successfully, Elsevier are making their bid to be one of them.

And there is always a nagging Google story somewhere left un-analysed, usually because one could either write a book on the implications or ignore them , on the grounds that they may never happen. But Google is the birthplace of so much that has happened in Big Data that I am loath to neglect BigQuery. With an ordinary sized and shaped company this would all be different. I could say for example that LexisNexis is taking its Big Data solution, HPCC (www.hpccsystems.com) Open Source because it wants to get its product implemented in many vertical market solutions without having to go head to head with IBM, Oracle or SAP. But Google clearly relishes the thought of taking on the major analytics players on the enterprize solutions platforms, and clearly has that in mind with this SQL based service, which has been around for about a year and now enters beta with a waitlist of major corporate users anxious to test it. And yet, wait a minute, Google, Facebook and Twitter led us into the No SQL world because the data types, particularly mapping, and the size of databases involved, pushed us into the Big Data age and past the successful solutions created in the previous decade in SQL enquiry. So is what Google is doing here driven mostly by its analysis of the data and capabilities of major corporates (Google doing market research and not giving the market what Google thinks is good for them!) or is this something else, a low level service environment that may take off and splutter into life, or may beta and burn like so many predecessors. Hard to tell but worth asking the question of the Google Man Near You. Meanwhile, the closest thing to a Big Data play in publishing markets remains MarkLogic 5.0. Coming back to where I started on Big Data, one of the most significant announcements in a crowded December had Lexis Nexis – law this time, not Risk Solutions – using MarkLogic 5 as the way to bring its huge legal holdings together, search them in conjunction with third party content and mine previously unrecognized connectivities. Except that I should not have said “mine”. Apparently “mining” and “scraping” are now out of favour: now we “extract” as we analyse and abstract!

However, I wish every scraper and miner seeking  a way forward every good wish for 2012. And me? Well, I am going to check out those Orang Utans. They may have rewritten Shakespeare by now.

 

 

« go backkeep looking »