My personal voyage in the world of software for search and data service development continues. I had the pleasure last week of hearing a Tableau (http://www.tableausoftware.com/) user talk about the benefits of visualization, and came away with a strong view that we do not need to visualize everything. After all, visualization is either a solution – a way of mapping relationships to demonstrate a point not previously understood – or a way of summarizing results in ways that enable us to take them in quickly. I did not think of it as a communication language, and if that is what it is then clearly we are only in the foothills. Pictures do not always sustain narrative, and sometimes we kid ourselves that once we have the data in a graph then we all know what it means. Visualization needs a health warning: “The Surgeon General suggests that before inhaling any visualization you should first check the axes.”! However, when data visualization gets focussed then it becomes really exciting. Check out HG Data (www.hgdata.com), a way of analysing a corporations complete span of relationships:

“While LinkedIn tracks the relationships between people in business, HG Data tracks the underlying relationships between the business entities themselves.”

Now that is a seriously big claim, but you can begin here to see plug-in service values from Big Data which will shape the way we look at companies in future. But my real object this week was elsewhere – in deep and shallow Space. A subject of speculation to me over 20 years ago was whether we would ever be able to analytically control the floods of data beginning to be received from satellites which was inundating space research centres. In its day, this was the first “drinking from the firehose” phenomenon, and it would appear to me retrospectively that we never really cracked this one, as much as learnt to live with our inadequacies. In the intervening time we have become experts at handling very large dataflows, because Google was forced to learn how to do it. And in the intervening years the flood has grown past tsunami, and ceased to be an issue about space research, and become an issue about how we run Earth.

So first lets update on the Space side of things. Those few research satellites that I encountered in 1985 have now been joined, according to Frost and Sullivan, by a vast telemetry and measurement exercise in the skys above us which will result in around 927 satellites by 2020. Some 405 will be for communication, with earth observation (151), Navigation (including automatic aircraft landing) and reconnaisance figuring high. Only 75 will be devoted to the R&D which initially piqued my interest in this. But since the communication, navigation and observation functions will measure accurately down to one metre, we shall inevitably find our lives governed in similar micro-detail by what these digital observers discover.

Now step over and look at SpaceCurve (http://spacecurve.com/). I had the pleasure of speaking to its founder, Andrew Rogers, a week or so ago and came away deeply impressed by the position they have taken up. Andrew is a veteran of Google Earth (and a survivor of the UK Met Office!) He is also a problem solver, big time. Taking the view that Google may have cracked its own problems but were not going to crack anything of this scale, he left, and the result is SpaceCurve:

“Immediately Actionable Intelligence
SpaceCurve will deliver instantaneous intelligence for location-based services, commodities, defense, emergency services and other markets. The company is developing cloud-based Big Data solutions that continuously store and immediately analyze massive amounts of multidimensional geospatial, temporal, sensor network and social graph data.
The new SpaceCurve geospatial-temporal database and graph analysis tools will enable application developers and organizations to leverage the real-time models required for more powerful geospatial and other classes of applications and to extend existing applications.”

As I understand it, what SpaceCurve is about is solving the next generation problem before we have rolled out the current partial solution. This is 2.0 launching before 1.0 is fully out of beta. The problems that Andrew and his colleagues solved in interval indexing and graph analysis are not a part of the current Big Data market leaders output, but they are very much in line with the demands of geospatial data flows. Here real time analytics just do not do the job if they are dependent on column stores assuming an order relationship. The thing to do is to abandon those relationships. SpaceCurve is not just looking at far bigger data environments: it suggests that they cannot be handled in ways that we currently envisage as being “big data”.

Despite the increased size of content handling, SpacCurve see themselves searching in a partially federated manner, since many data holders, and in particular governments, will not allow the data off the premises. Government and corporations share the need to be able to see provenance and determine authenticity, so SpaceCurve’s role in these massive data collections may be in part as an outsourcing custodial authority, looking after the data on the owner’s site. And indeed, the problem for SpaceCurve may be one of which markets it chooses first and where the key interest comes from – government and public usage, or the enterprize markets.

The next major release is due in 2013, so we shall soon find out. Meanwhile, it is striking that a major investor here, Reed Elsevier Ventures, has a parent who invested, through Lexis, in Seisint, also a deeply government aligned environment, and more recently in the Open Source Big Data environment, HPCC. Investing in the next generation is always going to make sense in these fast moving markets.

This may be the age of data, but the questions worth asking about the market viability of information service providers are no longer about content. They are about what you do to content-as-data as you seek to add value to it and turn it into some form of solution. So, in terms of Pope’s epigram, we could say that the proper study of Information Man is software. Data has never been more completely available. Admittedly, we have changed tack now on the idea that we could collect all that we need and put it into a silo and search it. Instead, in the age of big data, we prefer to take the programme to the data. Structured and unstructured. Larger collectively than anything tackled before the emergence of Google and Yahoo!, and then Facebook, and inspired by the data volumes thrown off by those services. And now we have Thomson Reuters and Reed Elsevier knee deep in the data businesses and throwing up new ways of servicing data appropriate to the professional and business information user. So shall we in future judge the strategic leadership of B2B, STM, financial services or professional information services companies by what they know about the decisions they need to make about implementing which generation of what software to have what strategic effect on their marketplaces? I hope not, since I fear that like me they may be found wanting.

And clearly having a CTO but not having the knowledge of the right questions to ask him, or what the answers mean is not sufficient either. In order to get more firmly into this area myself I wrote a blog last month called “Big Data: Six of the Best”, in which I talked about a variety of approaches to Big Data issues. In media and information markets my first stop has always been MarkLogic, since working with them has taught me a great deal about how important the platform is, and how pulling together existing disparate services onto a common platform is often a critical first step. Anyone watching the London Olympics next month and using BBC Sport to navigate results and entries and schedules, with data, text and video, is looking at a classic MarkLogic 5 job (www.marklogic.com). But this is about scale internally, and about XML. In my six, I wanted to put alongside MarkLogic’s heavy lifting capacities  someone with a strong metadata management tradition, and a new entrant, with exactly those characteristics, is Pingar (www.pingar.com). Arguably, we tend to forget all the wonderful things we said about metadata a decade ago. From being the answer to all questions, it became a very expensive pursuit, with changing expectations from users and great difficulties in maintaining quality control, especially where authors created it, fudging the issue for many information companies.

So Pingar, who started in New Zealand  before going global, appropriately started its tools environment somewhere else. Using the progress made in recent years in entity extraction and  pattern matching, they have created tools to manage the automatic extraction of metadata at scale and speed. Working with large groups of documents (we are talking about up to 6 terrabytes – not “biggest” data but large enough for very many of us) metadata development becomes a batch processing function. The Pingar API effectively unlocks a toolbox of metadata management solutions  from tagging and organization  at levels of consistency that we all now need, to integration of the results with enterprize content management, with communications and with collaboration platforms. Sharepoint connectivity will be important for many users, as will the ability to output into CRM tools. Users can import their own taxonomies effectively, though over time Pingar will build facilities to allow taxonomy development from scratch.

As members of the Pingar team talked me through this, two thoughts persisted. In the first instance, the critical importance of metadata. Alongside Big Data, we will surely find that the fastest way to anything is searching metadata databases. They are not either/or, they are both/and. I am still stuck with the idea that however effective we make Big Data file searching, we will also need retained databases of metadata at every stage. And everytime we need to move into some sort of ontology-based environment, the metadata and our taxonomy become critical elements in building out the system. Big Data as a fashion term must not delude us from the idea that we shall be building and extending and developing knowledge based systems from now until infirmity (or whatever is the correct term for the condition that sparks the next great wave of software services development in 2018!)

And my other notion? If you are in New Zealand you see global markets so much more clearly. Pingar went quickly into Japanese and Chinese, in order to service major clients there, and then into Spanish, French and Italian. Cross -linguistic effort is thus critical Marc Andriessen is credited with the saying “Software is eating the world (which always reminds me of an early hero, William Cobbett, saying in the 1820s of rural depopulation through enclosures and grazing around the great heathland that now houses London’s greatest and slowest airport: “Here sheep do eat men”). I am coming to believe that Andriessen is right, and that Pingar is very representative of the best of what we should expect in our future diet.

keep looking »