CCC – FAIR Foundation Forum

“The evolving role of DATA in the AI era “

18 September 2023  Leiden

“If we regulate AI and get it wrong, then the future of AI belongs to the Chinese“. When you hear a really challenging statement within five minutes of getting through the door, then you know that, in terms of conferences and seminars, you are in the right place at the right time. The seminar leaders, supported by the remarkable range of expertise displayed by the speakers, provided a small group with wide data experience with exactly the antidote needed to the last nine months of generative AI hype: a cold, clean, refreshing glass of reality. It was time to stop reading press releases and start thinking for ourselves.

FAIR’s leadership, committed to a world where DATA is findable, accessible, interoperable, and reusable, began the debate at the requisite point. While it is satisfying that 40% of scientists know about FAIR and what it stands for, why is it that when we communicate the findings of science and the claims and assertions which result from experimentation, we produce old style narratives for human consumption rather than, as a priority, creating data in formats and structures which machines can use, communicate and with which they can interact. After all, we are long past the point where human beings could master the daily flows of new information in most research domains: only in a machine intelligence world can we hope to deploy what we know is known in order to create new levels of insight and value.

So do we need to reinvent publishing? The mood in the room was much more in favourof enabling publishers and researchers to live and work in a world where the vital elements of the data that they handled was machine actionable. Discussion of the FAIR enabling resources and of FAIR Digital Objects gave substance to this. The emphasis was on accountability and consistency in a world where the data stays where it is, and we use it by visiting it. Consistency and standardisation therefore become important if we are not to find a silo with the door locked when we arrive. It was important then to think about DATA being FAIR “by design“ and think of FAIRificationas a normal workflow process.

If we imagine that by enabling better machine to machine communication with more consistency then we will improve AI accuracy and derive benefits in cost and time terms then we are probably right. If we think that we are going to reduce mistakes and errors, or eliminate “hallucinations“when we need to be careful. Some hallucinations at least might well be machine to machine communications that we, as humans, do not understand very well! By this time, we were in the midst of discussion on augmenting our knowledge transfer communication processes, not by a new style of publishing, but by what the FAIR team termed “nano publishing“. Isolating claims and assertions, and enabling them to be uniquely identified and coded as triples offered huge advantages. These did not end with the ability of knowledge graphs to collate and compare claims. This form of communication had built in indicators of provenance which could be readily machine assessed. And there was the potential to add indicators which could be used by researchers to demonstrate their confidence in individual findings. The room was plainly fascinated by the way in which the early work of Tobias Kuhn and his colleagues was developed by Erik Shultes, who effectively outlined it here, and the GO FAIR team. Some of us even speculated that we were looking at the future of peer review! 

Despite the speculative temptations, the thinking in the room remained very practical. How did you ensure that machine interoperability was built in from the beginning of communication processing? FAIR were experimenting with Editorial Manager, seeking to implant nanopublishingwithin existing manuscript processing workflows. Others believed we needed to go down another layer. Persuade SAP to incorporate it (not huge optimism there)?Incorporate it into the electronic lab notebook? FAIR should not be an overt process, but a protocol as embedded and unconsidered and invisible as TCP-IP. The debate in the room about how best to embed change was intense, although agreement on the necessity of doing so was unanimous.

The last section of the day looked closely at the value, and the ROI, of FAIR. Martin Romacker (Roche) and Jane Lomax (SciBite) clearly had little difficulty, pointing to benefits in cost and time, as well as in a wide range of other factors. in a world where the meaning as well as the acceptance of scientific findings can change over time, certainty in terms of identity, provenance, versioning and relationships became foundation requirements for everyone working with DATA in science. Calling machines intelligent and then not talking to them intelligently in a language that they could understand was not acceptable, and the resolve in the room at the end of the day was palpable. If the AI era is to deliver its benefits, then improving the human to machine interface in order to enable the machine to machine interface was the vital priority. And did we resolve the AI regulatory issues as well? Perhaps not: maybe we need another form to do that!

The forum benefited hugely from the quality of its leadership, provided by Tracey Armstrong (CCC) and Barend Mons (GO FAIR). Apart from speakers mentioned above, valuable contributions were made by Babis Marmanis (CCC), Lars Jensen (NFF centre for protein research) and Lauren Tulloch (CCC).

There is a place in the recent Star documentary “Dopesick” where the investigators seeking to validate the claim of the manufacturers of OxyContin that it is non-addictive, search for the article which is said to be the source of the claim in the NEJM. They cannot find it in the year cited.

What is more, they cannot find it in the previous or subsequent years. This being the first decade of the century, they resort to keyword searching. Eventually they find the source which is said to validate this claim. It turns out to be a five-line letter in the correspondence columns, but the screen shows the thousands of references raised by the keyword searching and a long and harrowing task for the researchers. As I watched it, I reflected how far the world had moved on from those haphazard searching days, along with the realisation that we have not progressed quite as far as we sometimes think, as our world of information connectivity keeps having to track back before it goes forward.

The great age of keyword searching was succeeded in the following decade by the great age of AI and machine learning hype. We welcomed the world of unstructured data in which we could find everything because of the power and majesty of keyword searching and because of our effectiveness in teaching machines to become researchers. And we will never know how much we spent or how many blind alleys we went up as we tried to apply prematurely systems that only worked when we built in enough information to enable us to be categorical and certain about the results that we achieved. And in science, and in health science in particular, categorical certainty is what we must achieve at a bare minimum.

We will never know cost in time and money terms because it is not in the interests of anybody to tell us. But we can see, as we move forward again, that there is a wide recognition now that fully informative metadata has to be in place, that identifiers have to point to a knowledge container in order to point machine intelligence in the right direction.  All the work which we did in earlier decades on taxonomies, ontologies and metadata was not wasted, but simply became the foundation of the intelligent and machine interoperable data society which we seek now to build. We are developing a clear understanding that the route to the knowledge we seek lies in the quality of meta data, and the ability to rely upon primary data which is fully revealed in the meta data. Many of us now accept that primary data (by which I mean research articles and evidential data sets) will exist in future for most researchers as background: most of their research will take place in the meta data, and they will dip into the primary data only on rare occasions. There is a clear parallel here to other sectors, like the law, where the concepts and commentary become more important than the words of enactment.

And so as the age of reading research articles comes to an end, so the business of understanding what they mean has to take further strides forward. This means that existing players in the data marketplace have to reposition, and this week’s announcements from CCC exemplify that repositioning in a very clear manner. CCC is an extremely valuable component part of scholarly communications: its history in the protection of copyright and the development of licensing has been vital to helping data move to the right places at the right times. But CCC management know that that is not any longer quite enough: their role as independent producers of the knowledge artefacts that will make the emerging data marketplaces succeed is now coming to the fore. In two announcements this week they demonstrate that transition. One is the acquisition of Ringgold, an independent player in the PID marketplace. Amongst all meta data types, PID and DOI  alongside ORCID, have enabled the scope of the communications database marketplace to emerge.

PID means permanent identification. In the case of Ringgold, it means making sense of  organisational chaos thus disambiguating what are the most confusing aspects of the sector.  Just as ORCID sought to disambiguate the complex world of authorship, so Ringgold 

seeks to help machines reading machine readable data understand the nature and type of organisations and activities that are being described. And then extend this one stage further. It is one thing to know these fixed points of meta data knowledge, quite another to use and manipulate them effectively to create the patterns of connectivity upon which a data driven society depends. 

While announcing the acquisition of Ringgold, CCC also announced a huge step forward in the developing world of knowledge graph development. Putting labels on things is one thing: turning them into maps of interconnected knowledge points is another. CCC have been brewing this activity for some time, as their Covid research development work demonstrates, it is impressive to see this being done, and being done by a not-for-profit with a long history of sector service and neutrality to the sector major player competitiveness. We are increasingly aware that some things have to be done for the benefit of the entire sector if the expectations of researchers, and of the sector as a whole are to be achieved.

And of course, those competitive major players are also moving forward. The announcement in the last two weeks by Elsevier that they have made a final bid for the acquisition of Interfolio is a clear indicator. This implies to me a thesis that article production, processing and distribution is no longer in the development marketplace. We are now critically aware that the business of producing, editing and presenting research articles may be fully automated within the next decade. With even more development work going into the automation of peer review checking, human intervention in the cycle of research publication may become supervisory, and the bulk of production work may shift to the computer on the researcher’s desk. 

There will still be questions of brand of course. There will still be questions of assessment and valuation of research contributions to be made. And it is this latter area which becomes a major point of interest as Interfolio takes Elsevier into the administration of universities and the administration of funding in a vital interesting way. This is the place where new standards of evaluation will be developed. When the world does let go of the hand of the impact factor which has guided everybody for so long, then Elsevier want to be in the centre of the revaluation. In other words, if Elsevier as market leader formally competed most with fellow journal publishers, its key competitor in future development may well be companies like Clarivate. 

The critical path of evaluation will be the claims made by researchers for their work, and the way those claims are validated in subsequent work reproducibility. In this connection we should also note the work of FAR and it’s collaborators in developing ideas around “nano publishing”, using metadata to clearly outline claims made by researchers in articles and both their conclusions parts of articles and in other places as well. If this had been in place, the OxyContin investigators would have found it easy to find a letter back at the beginning of the century. 

End

« go backkeep looking »