Two contrasted views of the future struggle against each other whenever we sit down to talk data strategy. One could be called the Syndication School. It says “forget getting all the data in one environment – use smart tools and go out and search it where it is, using licensing models to get access where it is not public.” And if the data is inside a corporate fire wall as distinct from a paywall? MarkLogic’s excellent portal for Pharmaceutical companies is an example of an emerging solution. 

But what happens if the data is in documented content files with insufficient metadata? Or if that metadata has been applied differently in different sources? Or three or four different sorts of content as data need to be drawn from differently located source files which need to be identified and related to each other before being useful in an intelligent study process. Let’s call this the Aggregation School – assembly has to take place before process gets going. But let’s not confuse it with bulk aggregators like ProQuest. 

And now put AI out of your mind. The term is now almost as meaningless as a prime ministerial pronouncement in the UK. This morning saw the announcement of three more really exciting new fundings in the Catalyst Awards series from Digital Science. BTP Analytics, Intoolab and MLprior are all clever solutions using intelligent analysis to service real researcher and industry needs. But the need to label everything AI is perverse: those who grew through 25 years of expert systems and neural networks will know the difference between great analytics and breakthrough creative machine intelligence. 

But while we are waiting, real problem-tackling work is going on in the business of aggregating multi- sourced content. The example that I have seen this week is dramatic and needs wider understanding. But let’s start with the issue – the ability, or inability, especially in the life sciences, for one researcher to reproduce the experiments created and enacted and recorded in another lab simply by reading the journal article. The reasons are fairly obvious – data not linked to article or not published; methodology section of article was a bare summary (video could not be published in article?); article only has abbreviated references section; article does not have full metadata coverage sufficient to discover what it does have; metadata schema used was radically different to other aligned articles of interest; relevant reproducibility data is not in article but in pre-print sever; conference proceedings; institutional or private data repositories: annotations, responses to blogs or commentaries, in code repositories; in thesis collections; or even in pre-existing libraries of protocols etc. And all or any of these may be Open, or paywalled.

In other words, the prior problem of reproducibility is not enacting the experiment by producing the same laboratory conditions – it is in researching and assembling all the evidence around the publication of the experiment. This time-consuming detective work is a waste of research time and a constraint on good science, and calling for AI does not fix it. 

But claim they are well down the arduous track towards doing so. And it seems to me both a fair claim and an object lesson in the real data handling problems when no magic wand technology can be applied. Profeza, an Indian-based outfit founded by two microbiologists, started with the grunt work and are now ready to apply the smart stuff. In other words they have now made a sufficient aggregation of links between the disparate data sources listed above to begin to develop helpful algorithms and begin to roll out services and solutions. The first, CREDIT Suite, will be aimed at publishers who want to attract researchers as users and authors by demonstrating that they are  improving reproducibility. Later services will involve key researcher communities, and market support services for pharma and reagent suppliers as well as intelligence feeds for funders and institutions. It is important to remember that whenever we think of connecting dispersed data sets the outcome is almost always multiple service development for the markets thus connected. 

Twenty years ago publishers would have shrugged and said “if researchers really want this they can do it for themselves. Today, in the gathering storm of Open, publishers need to demonstrate their value in the supply chain before the old world of journals turns into a pre-print sever before our very eyes. And before long we may have reproducibility factors introduced into methodological peer review. While it will certainly have competitors, Profeza have made a big stride forward by recognising the real difficulties, doing the underlying work of identifying and making the data linkages, and then creating the service environment. They deserve the success which will follow.