I am back ! Three months of infections and operations are over and I am again upright on a brand  new knee . Apologies to those who expected a continuous word stream . Even greater apologies to those who were enjoying the silence . 

Lying on my back and staring at the ceiling should have been a moment of zen- inspired rehabilitation . Instead it was punctuated by moments of intense annoyance when I read of people self-styled as “ publishers “.  sallying out to defend the “ book” and the “ article” and the  “ journal”. These things need no defence . Nor does the codex or the Sumerian clay tablet . Scholarship doe not and never has lived by format . Knowledge transmission will always find the appropriate channel , like water round a dam . So eventually I was better , and got up and went to an international publishing day organised by a leading software supplier in three cities simultaneously. But then a representative of an ancient university press got up and used the privilege of a hearing in this gathering to treat us all to a discourse on the advantages and disadvantages of publishing in….books or journals ! 

I had to take a firm grip . All of a sudden I had been robbed of 35 years of my life . But then , as I made my way home , I remembered another strand of my bedridden reading . The growing strength of the pre-registration movement in science research began to dawn on me when I read that PLoS was adopting pre- registration (https://plos.org/open-science/preregistration/). Then I recalled the eminently sensible investment by EBSCO in protocols.io ( https://www.ebsco.com/products/protocols). So here we have a publisher like PLoS recognising that a post-Open reality may be an urge amongst funders and researchers to improve the credibility and reproducibility of scientific findings and results . And the way to ensure that aims and objectives do not distort during process is to preregister the research objectives and do so with a description of the methodologies that will be employed to explore the hypothesis . 

If this catches on it will be important . Either the journal publisher or an independent site like protocols.io can then become the repository of comparative research methodologies . At the moment this material normally appears in the front half of most articles . It is variously treated in metadata terms by different publishers and often inadequately edited ( I am told that “ apply to authors for full details of techniques employed “ is still quite common ) . And the real point for publishers is the time-lag . The preregistration occurs before the research phase , and before any findings or evidential data are available ( a period of years ) . Thus the “article” effectively appears in two parts , at different times. And in different , but linked places? It is very possible , of course to link the preregistration site to the site where the findings are described and to the repository where the evidential data is held , but this does not sound , to me at least , very much like the journal as operated by journal publishers today . 

We have known for a long time of the heavy , intensive search usage by researchers seeking methodological models and templates in order to pursue a specific enquiry . The gradual removal of that activity into specialist service sites could have effects on usage levels generally and thus on library journal buying pressures . And another factor comes into play here as well . Methodological improvement is poorly communicated and seldom recognised. If a researcher , in reproducing an experiment , finds a quicker. easier , cheaper way to the same result , be it by a tiny fix or a larger short cut , this does not normally lead to a new article . Often communication is by hearsay , blog or email , and it is not attached necessarily to the searchable corpus of knowledge . Nor is the researcher who has made the improvement normally recognised . Preregistration is susceptible to annotating the protocols and recognising the source of the suggested improvements . protocols.io has that vision , of an annotated library of experimental methodology with acknowledgement of suggested changes and their source . 

During the 1990s and the early years of this century I took part as an evaluator in several rounds of Article of the Future discussions . Elsevier , to their great credit , were prominent in these . Structural improvements were made , metadata improved , huge flexibility introduced around the use of graphs , and their manipulation by readers , video was used for the first time but constrained by package size , likewise audio and slide presentations. By the end I had begun to feel that “ article” was becoming a much of a size constraint digitally as it has been , for many years , in print . The format words of the print world seemed , back then , to have outgrown their usefulness . Innovation  in end user Performance and expectation were making tradional format terms  redundant . It was just that we were too lazy to find new ones 

Two contrasted views of the future struggle against each other whenever we sit down to talk data strategy. One could be called the Syndication School. It says “forget getting all the data in one environment – use smart tools and go out and search it where it is, using licensing models to get access where it is not public.” And if the data is inside a corporate fire wall as distinct from a paywall? MarkLogic’s excellent portal for Pharmaceutical companies is an example of an emerging solution. 

But what happens if the data is in documented content files with insufficient metadata? Or if that metadata has been applied differently in different sources? Or three or four different sorts of content as data need to be drawn from differently located source files which need to be identified and related to each other before being useful in an intelligent study process. Let’s call this the Aggregation School – assembly has to take place before process gets going. But let’s not confuse it with bulk aggregators like ProQuest. 

And now put AI out of your mind. The term is now almost as meaningless as a prime ministerial pronouncement in the UK. This morning saw the announcement of three more really exciting new fundings in the Catalyst Awards series from Digital Science. BTP Analytics, Intoolab and MLprior are all clever solutions using intelligent analysis to service real researcher and industry needs. But the need to label everything AI is perverse: those who grew through 25 years of expert systems and neural networks will know the difference between great analytics and breakthrough creative machine intelligence. 

But while we are waiting, real problem-tackling work is going on in the business of aggregating multi- sourced content. The example that I have seen this week is dramatic and needs wider understanding. But let’s start with the issue – the ability, or inability, especially in the life sciences, for one researcher to reproduce the experiments created and enacted and recorded in another lab simply by reading the journal article. The reasons are fairly obvious – data not linked to article or not published; methodology section of article was a bare summary (video could not be published in article?); article only has abbreviated references section; article does not have full metadata coverage sufficient to discover what it does have; metadata schema used was radically different to other aligned articles of interest; relevant reproducibility data is not in article but in pre-print sever; conference proceedings; institutional or private data repositories: annotations, responses to blogs or commentaries, in code repositories; in thesis collections; or even in pre-existing libraries of protocols etc. And all or any of these may be Open, or paywalled.

In other words, the prior problem of reproducibility is not enacting the experiment by producing the same laboratory conditions – it is in researching and assembling all the evidence around the publication of the experiment. This time-consuming detective work is a waste of research time and a constraint on good science, and calling for AI does not fix it. 

But profeza.com claim they are well down the arduous track towards doing so. And it seems to me both a fair claim and an object lesson in the real data handling problems when no magic wand technology can be applied. Profeza, an Indian-based outfit founded by two microbiologists, started with the grunt work and are now ready to apply the smart stuff. In other words they have now made a sufficient aggregation of links between the disparate data sources listed above to begin to develop helpful algorithms and begin to roll out services and solutions. The first, CREDIT Suite, will be aimed at publishers who want to attract researchers as users and authors by demonstrating that they are  improving reproducibility. Later services will involve key researcher communities, and market support services for pharma and reagent suppliers as well as intelligence feeds for funders and institutions. It is important to remember that whenever we think of connecting dispersed data sets the outcome is almost always multiple service development for the markets thus connected. 

Twenty years ago publishers would have shrugged and said “if researchers really want this they can do it for themselves. Today, in the gathering storm of Open, publishers need to demonstrate their value in the supply chain before the old world of journals turns into a pre-print sever before our very eyes. And before long we may have reproducibility factors introduced into methodological peer review. While it will certainly have competitors, Profeza have made a big stride forward by recognising the real difficulties, doing the underlying work of identifying and making the data linkages, and then creating the service environment. They deserve the success which will follow.


« go backkeep looking »