Two contrasted views of the future struggle against each other whenever we sit down to talk data strategy. One could be called the Syndication School. It says “forget getting all the data in one environment – use smart tools and go out and search it where it is, using licensing models to get access where it is not public.” And if the data is inside a corporate fire wall as distinct from a paywall? MarkLogic’s excellent portal for Pharmaceutical companies is an example of an emerging solution. 

But what happens if the data is in documented content files with insufficient metadata? Or if that metadata has been applied differently in different sources? Or three or four different sorts of content as data need to be drawn from differently located source files which need to be identified and related to each other before being useful in an intelligent study process. Let’s call this the Aggregation School – assembly has to take place before process gets going. But let’s not confuse it with bulk aggregators like ProQuest. 

And now put AI out of your mind. The term is now almost as meaningless as a prime ministerial pronouncement in the UK. This morning saw the announcement of three more really exciting new fundings in the Catalyst Awards series from Digital Science. BTP Analytics, Intoolab and MLprior are all clever solutions using intelligent analysis to service real researcher and industry needs. But the need to label everything AI is perverse: those who grew through 25 years of expert systems and neural networks will know the difference between great analytics and breakthrough creative machine intelligence. 

But while we are waiting, real problem-tackling work is going on in the business of aggregating multi- sourced content. The example that I have seen this week is dramatic and needs wider understanding. But let’s start with the issue – the ability, or inability, especially in the life sciences, for one researcher to reproduce the experiments created and enacted and recorded in another lab simply by reading the journal article. The reasons are fairly obvious – data not linked to article or not published; methodology section of article was a bare summary (video could not be published in article?); article only has abbreviated references section; article does not have full metadata coverage sufficient to discover what it does have; metadata schema used was radically different to other aligned articles of interest; relevant reproducibility data is not in article but in pre-print sever; conference proceedings; institutional or private data repositories: annotations, responses to blogs or commentaries, in code repositories; in thesis collections; or even in pre-existing libraries of protocols etc. And all or any of these may be Open, or paywalled.

In other words, the prior problem of reproducibility is not enacting the experiment by producing the same laboratory conditions – it is in researching and assembling all the evidence around the publication of the experiment. This time-consuming detective work is a waste of research time and a constraint on good science, and calling for AI does not fix it. 

But profeza.com claim they are well down the arduous track towards doing so. And it seems to me both a fair claim and an object lesson in the real data handling problems when no magic wand technology can be applied. Profeza, an Indian-based outfit founded by two microbiologists, started with the grunt work and are now ready to apply the smart stuff. In other words they have now made a sufficient aggregation of links between the disparate data sources listed above to begin to develop helpful algorithms and begin to roll out services and solutions. The first, CREDIT Suite, will be aimed at publishers who want to attract researchers as users and authors by demonstrating that they are  improving reproducibility. Later services will involve key researcher communities, and market support services for pharma and reagent suppliers as well as intelligence feeds for funders and institutions. It is important to remember that whenever we think of connecting dispersed data sets the outcome is almost always multiple service development for the markets thus connected. 

Twenty years ago publishers would have shrugged and said “if researchers really want this they can do it for themselves. Today, in the gathering storm of Open, publishers need to demonstrate their value in the supply chain before the old world of journals turns into a pre-print sever before our very eyes. And before long we may have reproducibility factors introduced into methodological peer review. While it will certainly have competitors, Profeza have made a big stride forward by recognising the real difficulties, doing the underlying work of identifying and making the data linkages, and then creating the service environment. They deserve the success which will follow.

Sitting through the summer months beside a misty inlet on the Nova Scotian coast it is all too easy to lose oneself in the high politics of OA and OER, of the negotiations between a country as large as California and a country as large as Elsevier. Or whether a power like Pearson can withstand a force as large as McGraw with added Cengage. I am in the midst of Churchill’s Marlborough: His Life and Times. There momentous events revolve around a backstairs word at Court. There great armies wheel in the Low Countries as Louis XIV and William of Orange contend for supremacy. Wonderful stuff, but the stiff of history? Nothing about peasants as soldiers, or about harvests and food supplies? Likewise, if we tell the story of the massive changes taking place in the way content is created and intermediated for re-use by scholars and teachers without starting with the foot-soldiers, by which I mean not just researchers and teachers but students and pupils as well, then I think we are in danger of mistaking the momentum as well as the impact of what is happening now. 

When our historians look back, hopefully a little more analytically than Churchill. I think they will be amazed by the slowness of it all. We are now 30 years beyond the Darpanet becoming the Internet. And over 20 of life in a Web-based world. Phone books are an historical curiosity and newspapers in print are about to follow. Business services have been transformed and the way most of us work and communicate and entertain ourselves is firmly digital. Yet nothing has been as conservative and loathe to change as  academic and educational establishments throughout the developed world, and they have maintained their success in imposing these constraints on the rest of the world. From examination systems to pre-publication peer review traditional quality markers have remained in place for the assurance, it is held, of governments, taxpayers and all participants in the process. And while the majority of inert content became digital very early in the the 30 year cycle of digitisation, workflow and process did not. Thus content providers were held in a hiatus. As change took place at the margins, you needed to supply learning systems as well as textbooks (who would have guessed that it would be 2019 before Pearson declared itself Digital First?). And by the same token, who could have imagined that we would be in 2019 before elife’s Reproducible Document Stack feasibly and technically allowed an “article” to contain video, moving graphics, manipulable graphs and evidential datasets?

It is not hard to identify the forces of conservatism that created  this content Cold War, when everyone had to keep things as they had always been, and as a result of which publishing consolidated – and is still consolidating into two or three big players in each sector, it is harder to detect the forces of change that are turning these markets into an arms race. These factors are mostly not to do with the digital revolution, much as commentators like me would like the opposite to be true. Mostly they are to do with the foot soldiers of Marlborough’s armies, those conscripted peasants, those end users. When we look back we shall see that it was the revolt of middle class American parents and their student children against textbook prices, the wish of the Chinese government to get its research recognised globally with out a pay wall, the wish of science researchers to demonstrate outcomes quicker in order to secure reliable forward funding and the wish of all foot soldiers to secure more interoperability of content in the device – dominated, data centric world in to which they have now emerged, that made change happen.

And how do we know that? You need an instrument of great sensitivity to measure change, or maybe change is a reflection of an image in the glass plate of some corporate office. Whatever else is said of them, I hold Elsevier to be a hugely knowledgeable reflection of the markets they serve. So I regard their purchase of Parity Computing as a highly significant move. When publishers and information providers buy their suppliers, not their competitors, it says to me that whatever tech development they are doing in their considerable in-house services, it is neither enough, or fast enough. It says that still more must be done to ensure that their content-as-data is ready for intelligent manipulation. It also says that the developments being created by that supplier are too important, and their investment value too great, to think of sharing them with a competitor using that supplier. 

Markets change when users change. But when the demand for change occurs, we usually have the technology – think of the 20 year migration from Expert systems and Neural Networks to machine learning and AI – to meet that new demand. The push is rarely the other way round. 

keep looking »