Two contrasted views of the future struggle against each other whenever we sit down to talk data strategy. One could be called the Syndication School. It says “forget getting all the data in one environment – use smart tools and go out and search it where it is, using licensing models to get access where it is not public.” And if the data is inside a corporate fire wall as distinct from a paywall? MarkLogic’s excellent portal for Pharmaceutical companies is an example of an emerging solution. 

But what happens if the data is in documented content files with insufficient metadata? Or if that metadata has been applied differently in different sources? Or three or four different sorts of content as data need to be drawn from differently located source files which need to be identified and related to each other before being useful in an intelligent study process. Let’s call this the Aggregation School – assembly has to take place before process gets going. But let’s not confuse it with bulk aggregators like ProQuest. 

And now put AI out of your mind. The term is now almost as meaningless as a prime ministerial pronouncement in the UK. This morning saw the announcement of three more really exciting new fundings in the Catalyst Awards series from Digital Science. BTP Analytics, Intoolab and MLprior are all clever solutions using intelligent analysis to service real researcher and industry needs. But the need to label everything AI is perverse: those who grew through 25 years of expert systems and neural networks will know the difference between great analytics and breakthrough creative machine intelligence. 

But while we are waiting, real problem-tackling work is going on in the business of aggregating multi- sourced content. The example that I have seen this week is dramatic and needs wider understanding. But let’s start with the issue – the ability, or inability, especially in the life sciences, for one researcher to reproduce the experiments created and enacted and recorded in another lab simply by reading the journal article. The reasons are fairly obvious – data not linked to article or not published; methodology section of article was a bare summary (video could not be published in article?); article only has abbreviated references section; article does not have full metadata coverage sufficient to discover what it does have; metadata schema used was radically different to other aligned articles of interest; relevant reproducibility data is not in article but in pre-print sever; conference proceedings; institutional or private data repositories: annotations, responses to blogs or commentaries, in code repositories; in thesis collections; or even in pre-existing libraries of protocols etc. And all or any of these may be Open, or paywalled.

In other words, the prior problem of reproducibility is not enacting the experiment by producing the same laboratory conditions – it is in researching and assembling all the evidence around the publication of the experiment. This time-consuming detective work is a waste of research time and a constraint on good science, and calling for AI does not fix it. 

But profeza.com claim they are well down the arduous track towards doing so. And it seems to me both a fair claim and an object lesson in the real data handling problems when no magic wand technology can be applied. Profeza, an Indian-based outfit founded by two microbiologists, started with the grunt work and are now ready to apply the smart stuff. In other words they have now made a sufficient aggregation of links between the disparate data sources listed above to begin to develop helpful algorithms and begin to roll out services and solutions. The first, CREDIT Suite, will be aimed at publishers who want to attract researchers as users and authors by demonstrating that they are  improving reproducibility. Later services will involve key researcher communities, and market support services for pharma and reagent suppliers as well as intelligence feeds for funders and institutions. It is important to remember that whenever we think of connecting dispersed data sets the outcome is almost always multiple service development for the markets thus connected. 

Twenty years ago publishers would have shrugged and said “if researchers really want this they can do it for themselves. Today, in the gathering storm of Open, publishers need to demonstrate their value in the supply chain before the old world of journals turns into a pre-print sever before our very eyes. And before long we may have reproducibility factors introduced into methodological peer review. While it will certainly have competitors, Profeza have made a big stride forward by recognising the real difficulties, doing the underlying work of identifying and making the data linkages, and then creating the service environment. They deserve the success which will follow.

 There is a moment in the life of every start-up when the entrepreneur realises that what he is selling is not what people are buying. Over 35 years ago this thought stopped me in my tracks outside of a solicitor’s office in a small Hampshire town in the New Forest. I was running the innovative legal retrieval start-up called Eurolex for the Thomson Corporation, and my visit to this small legal practice was part of a programme to capture as much baffling insight as I could from early customers of the service. I was selling them enhanced computerised legal information retrieval, more effective than human enquiry, covering a wide range of sources , far bigger than their own library resources and demonstrating modern technology in the law practice. How could I fail? And how was that  people were not buying my better mousetrap in the quantities that my five year plan required?

So I told the senior partner of this  law practice that his partnership really needed  my services. I sold hard on innovation and painted a picture of them being able to outstrip big city rivals with smart research covering far more sources than you would normally expect in a small country practice. He was unimpressed. He said that his current manual research activity was “good enough“. He said he was attracted by my trial offer for totally different reasons. He had no more room for books and people in his offices and would have to expensively relocate to expand the practice. He said my trial terms made it cheaper to use me than add more people. He said my billing system was compatible with his and he could simply download time and cost into his invoicing from my service. He was, he made it clear, almost totally uninterested in improved legal information retrieval,  but he would definitely buy what I was selling. I sat in the car park for almost an hour afterwards trying to rethink what we were doing as a business and retrofitting what we had invented into the way lawyers actually  worked.

I have had the great privilege to be able to meet and talk with many companies who innovate. But I have noticed with interest when they start to realise what their customers are buying as distinct from what they themselves are selling. Last month I had the pleasure of talking to my friends at Morressier ( www.morressier.com)  , the Berlin-based service that gathers up all the information around posters and conferences as indicators of research group activity in scholarly communications. I realise that when I first met them I saw them as a collector of data that has been neglected and not curated. Now I see them as a way of judging the progress of a research project, through its interim activities and ability to describe its early results and objectives . Given the time pressures of scholarly research in a number of scientific disciplines, getting early indicators of potential and being able to gauge how close to completion key projects have reached becomes a high value component of predictive analysis of research outcomes. In my lifetime we have moved from taking over two years to publish a research report at the conclusion of a project to anticipating its likely outcomes at various stages of research  development. Now that Morressier have the data they can begin to apply the analysis. Combine that analysis with all the data about actual findings and you have a treasure trove of analytic feedback for funders, governments, research institutions, universities and research programmes. This company now becomes a potential powerhouse of trend research , alerting services, competitive analysis and consultancy , especially in fields that impact pharma , food science , agriculture , climatology, and any sector where governments and markets seek the earliest indicators of where to place the next bet. 

And  I have very similar feelings about another important growth point , Katalysis.io .This Amsterdam-based start up began, to my mind at least, as a technology-centric, as distinct from a data-centric, project built fundamentally on blockchain technology. Today, as it embarks on its next funding round, the impact of work which it is doing with major players like Springer Nature, is beginning to show.   Real  contact with intermediaries and end users has shown them how the market in information is being framed by concern about impact and dissemination – who downloaded the document , who read it , who passed it to whom? – and what its provenance was – can you trace it to a legitimate source , is it fake news etc? . In the face of this , the company has become a Track and Trace player, using technologies like data ledger and document tracking to meet these needs. This marks a real shift for those who can recall times when the word “ metadata “ was always followed by a discussion on discovery . Now it is more often followed by a discussion on impact and dissemination . katalysis.io have found a rich new seam of need to exploit , which should make their funding round straightforward  

These two young companies are united in their discovery of addressable need . And by something else . They both respond to markets where the need for speed and certainty in information undercuts still prevalent thinking derived from the world of printed journals about acceptable timing and measurement of effect . While scholarly communications was quick to “ go digital” it has been slow to “ think digital” . These two companies , whose work could flourish as well in any other vertical sector of information markets, are indicators of more profound change in scholarly communications as well .

keep looking »