DavidWorlock.com

Ivan Klimes: a founder of modern science, journal, publishing

dworlock — Mon, 22 Apr 2024 23:09:05 +0000

It has become fashionable recently to praise the work of Robert Maxwell as a great innovator in science publishing. Articles and memoirs have pointed out how he capitalised upon a wider public interest in science as well as greater spending on scientific research to create and commercialise the modern world of science journal publishing. It was typical, though, of Maxwell, not just to have discovered an opening, but to have discovered at the same time, a man who could capitalise upon the opening. The genius behind Robert Maxwell’s genius was Ivan Klimes, who died on April 20 2024 in Oxford. As Publishing Director at Maxwell’s Pergamon Press, he was the mastermind behind the so-called “salami slicing“ of journals to create new specialised outlets for the rapidly growing subdivision of disciplines in the sciences, especially in the life sciences. His broad range of scientific interests ensured the coverage of new topics as well as the commissioning of review articles and books, which brought researchers up-to-date, and in a position to review all aspects of controversial topics. The great monument to Ivan’s work at this time lies in the continuing market dominance of Elsevier, the final home of the Pergamon Press titles that he had created. The debt that they owed him was never acknowledged by the seller of these journals, or indeed by the buyers.

Ivan’s fascination with science, and science communication, came early and remained throughout his life. He started his career as a science journalist in his native Prague, becoming, in those days behind the Iron Curtain, a member of the Czechoslovakian Academy of sciences, and the leader of the union of science journalists. In 1968, he was speaking at a conference in Copenhagen, attended by himself and his wife, when the Prague Spring turned into a Russian invasion. He described driving out of Denmark and across the German border. Turn left for home – or turn right for the English Channel. All of us who knew him are grateful for the choice that the couple made. In England, he found employment with the IPC Magazine group, working on New Scientist. It was here that Robert Maxwell found him.Two Czech speakers bent on changing the course of global science publishing in English.

Life was always turbulent at Maxwell‘s Headington Hill Hall, Oxford, headquarters. Ivan left prior of the sale of the company, dismissed by Robert, and Kevin, Maxwell, then, respectively, chairman, and chief executive of the science publisher. Whether or not this was because of the industrial indiscipline of speaking Czech to the father in front of the son (who did not understand the language) is uncertain. More certain is the fact that they had dropped the pilot who guided them to this point. Soon after they sold the company and embarked upon the frenzy of acquisition and disposal that became the story of Maxwell Communications Corporation.

For a moment, Ivan was lost, and felt deeply isolated. To their great credit, the International Thomson Organisation (now Thomson Reuters) came forward and a plan was developed to create a new science journal publisher with an entirely new look. Rapid Communications of Oxford would not only move quicker to fulfil the niches that Ivan still saw as worth developing, but it would publish science research far more quickly and get information back into the hands of researchers with revolutionary speed. Far from taking the normal 12 to 18 months, the new company promised publication in 6 to 8 weeks. This breakthrough was enabled by the use of a new technology, the fax machine. Typescripts would be faxed to peer reviewers with the request that they were to be returned the same way. As an advisor to Ivan, both in his previous company, and now in this one, I was privileged to watch at first hand the enthusiasm that he applied, and the enthusiasm that he engendered in others as they watched him apply it. His openness and his anxiety to serve the science that he loved were infectious qualities. He founded his new company in one of Oxford, oldest habitable office buildings, a mediaeval bakehouse. at the very edge of the river. Standing upright in his first floor office, inevitably meant cracking one’s head against its ancient beams. Yet there, he continued with a succession of new launches, including a groundbreaking, neuroscience journal, until the folding of his company into the Thomson corporate group, and then it’s resale to Wolters Kluwer presaged his own retirement.

In addition to his great skills as an innovative publisher, Ivan was a man of huge, personal generosity, with a real capacity for mentoring and friendship. I can testify to both, having served with him when he was a Director of the British Publishers Association cooperative venture, Publishers Databases Ltd, and having enjoyed his company, and his vision and sagacity while he was a non-executive director of my own company, Electronic Publishing Services Limited. And as a dining companion he left nothing to be desired, especially in the days when he was a routine diner at the Hungarian restaurant, Gay Hussar, where his knowledge of central European cuisines was an education in itself. We should remember a pivotal figure in the creation of the modern science journal,in a publishing world that has now transitioning into scholarly communications, and for those who knew him, the loss of the most companionable, knowledgeable and empathetic of friends.

Gates opens door to journal exits?

dworlock — Mon, 15 Apr 2024 16:55:06 +0000

When I first talked about open access and the decline of the scientific journal, 20 years ago, it was fortunate that I had Dirk Haank available to tell the world not to listen to demented consultants with no skin in the game. When I spoke some 15 years ago, about the inevitable declined of the subscription science Journal, it was pleasing to hear Kent Anderson reassuring us, all that I was simply a mad dog out on license. Now, as I read the strategy revision for their open access policy published by the Gates Foundation, on April 7, I am very happy to indulge the Panglossian philosophers of the scholarly communications marketplace once again and while I wait for them to tell us that nothing has really changed and everything will go on just as before in the best of all journal, publishing worlds, I am heading down to the marketplace to link arms with Cassandra. We shall chant “ O woe! O woe ! The day of the open access, journal is nearly over, and it’s end can be told with confidence!“

Of course, this might take another 15 years. I’ve reached an age myself when time is not a very worrying factor. In the 57 years that have passed since I started work in the educational and academic publishing sector I have been acutely, aware that commercial publishers, while being politely prepared to entertain speculation about the future, have necessarily to attend to this year’s financial results and the expectations of investors. When my speculations were deemed too far-fetched, my clients in the boardroom tended to say “our strategies are clear – follow the money!” Today, my response to them would be quick, and immediate:“Watch what the funders are doing with the money, and then, follow the data! “

Many will argue that Gates is a small funder in terms of article contributions. It’s work creates around 4000 articles a year, and through its payment of APCs it contributes a mere $6 million per annum to the coffers of scholarly publishers . But it is an influential player and in its revised open access strategy it may have detected something which is present in the minds of the larger funders, and eventually of governments themselves. What is the duty of the funder in terms of ensuring that articles detailing research results are available to the community at large? In the time of Henry Oldenberg in the 1660s, the answer would have been to get them into the Transactions of the Royal Society. Today, it is to get them onto an authorised pre-print server with a CC-BY license as soon as possible after the research is completed and the article is ready, and to accompany it by linked datasets of the evidential material on a similar license on a similarly approved site. Speed is of the essence, access to all is key and critical. Subsequent reuse of the material in a journal, subsequent acts of peer review and downstream reuse are not the key concerns of the funding foundation. By this fresh twist in the end of its open access policy, the Gates Foundation have saved $6 million, which can now go back into the research fund . And by using F1000 , who already supply the internal Gates, publishing systems, to create F1000 Verixiv, the pre-print server of choice, they have provided tools, which researchers can use (or not) to fulfil the mandate.

If other funders follow this route, then the scholarly communications research community in science faces a choice. For many, more pressurised by getting the next research program underway than anything else, it will be simple to leave things there, and not necessarily press forward to eventual journal publication. For others, given the needs of institutions for publication, to secure tenure or satisfy other funders requirements, publication will remain essential until the way in which science results are assessed, begins to change.One of the things that I recall from conversations with Eugene Garfield, in the 1980s , was his repeated assertion that better ways than citation indexing would be found to assess the worth of science research articles. Like Winston Churchill on democracy, he maintained vigorously that what he had created was the “best worst way“ of doing the job. The challenge now, I would suggest, is whether some latter day Garfield can perform his 1956 breakthrough, and create a way of indexing and illuminating what is good science for a modern world. That measurement and indexation has to be available as soon as possible after the first appearance of the claim, wherever it appears in digital form.In the meanwhile, getting the knowledge immediately into the marketplace, and getting the data available to aide reproduceability supports other research in progress and supports integrity. And that is critical for funders and researcher alike.

Such new systems will emerge in their own time. In the meanwhile the way we measure, achievement, t which have been gamed and manipulated endlessly and need in any case to be renewed or replaced , experienceincreasing pressure,. This applies as much to peer review as anything else. If publishers are to stay in the loop, then they need to change their relationships as wellAs the relationship between Gates, andF1000 shows, whatever takes place in terms of “publication “ and where it takes place in the ecosystem may become more important to the institution or the funder to the researcher or the research lab. In terms of attracting sponsorships, investment, and industrial research cooperation, universities may have more interest in publication than most, especially if the research community sort out a better way of ranking science than by citation indexng.(Footnote: what a clever man that Vitek Tracz was! The Tesla of science publishing! Long after his retirement, we shall be using the tools he created for white label sponsored publishing! )

So there it is! Cassandra and I have now done a full lap of the forum, and I can feel that the rotten vegetables are getting ready to fly through the air! next time, if I survive, I plan to “follow the data” myself, and look at the role of publishers as data aggregators, data curators, and data traders. and we shall remember the old saying: “how do you know if the searcher is a person or machine? Well, only machines read the full article!“

AI and bias: Cui bono?

dworlock — Fri, 29 Mar 2024 18:23:25 +0000

Who benefits is never a bad question to ask. In my mind, after long years in the information industry, it is a question closely related to “follow the money”. And it is closely in my mind at the moment, since I have been reading the UK Information Commissioner’s consultation (https://ico.org.uk/) about-the-ico/what-we-do/our-work-on-artificial-intelligence/generative-ai-second-call-for-evidence/ on the use of personal data in AI training sets and research data. The narrative surrounding consultation invokes; for me, all sorts of ideas about the nature of trust.

Let me try to explain my ideas about trust, since I think the subject is becoming so controversial that each of us needs to state their position before we begin a discussion. For example, I trust in the brand of marmalade to which I am fairly addicted. My father was an advocate of Frank Coopers Oxford marmalade, and this is probably the only respect in which I have followed him. We certainly have over 100 years of male Worlock usage of this brand of marmalade. Furthermore, in modern times, the ingredients are listed upon the jar, together with any chemical additives. Should I suffer a serious medical condition as a result of my marmalade addiction, I can clearly follow the trial and find where it was made and the provenance of its ingredients. And in the 60 or so years that I have been enjoying it, it has not varied significantly in flavour, taste or ingredients.

I also believe, being a suspicious country man, in something that I call “the law of opposites” . Therefore, when people say that they “do no evil “ or claim that they practice “effective altruism “, then I wonder why they need to tell me this. My bias, then becomes the reverse of their intentions: I tend to think that they are telling me that they are good because they are trying to disguise the fact that they are not. This becomes important as we move from what I would term an open trust society – exemplified by the marmalade – into a blind trust society – exemplified by the “black box” technology, which , we are told, is what it is, and cannot be tracked, audited or regulated in any of the normal ways.

The UK Information Commissioner has similar problems to mine, but naturally at a greater level of intellectual intensity. In their latest consultation document, his people ask whether personal data can be used in a context without purpose. Under data privacy rules, the use of personal data, where permitted, has to be accompanied by a defined purpose. whether the data is used to detect shifts in consumer attitudes or to demonstrate the efficacy of a drug therapy, the data use is defined by its purpose. General models of generative AI, with no stated or specific purposes, violate current data protection regulation, if they use personal data in any form, and this should set us wondering about the outcomes, and the way in which they should earn our trust.

The psychologist Daniel Kahneman who died this week, earned his Nobel prize in economics for his work on decision-making behaviours. His demonstration that decisions are seldom made on a purely rational basis, but are usually derived from preferences based on bias and experience (whether relevant or not) should be ever present in our minds when we think about the outputs of generative AI.Our route to trusting those outcomes should begin with questions like: what is the provenance of the data used in the training sets? Do I trust that data and its sources? Can I, if necessary, audit the bias inherent in that data? How can I understand or apply the output from the process if I do not understand the extent and representativeness of the inputs?

I sense that there will be great resistance to answering questions like this. In time there will be regulation,. I think it is a good idea now for data, suppliers and providers to annotate their data with metadata, which demonstrates provenance, and provides a clear record of how it has been edited and utilised, as well as what detectable bias was inherent in its collection. One day, I anticipate, we shall have AI environments that are capable of detecting bias in generative AI environments, but until then we have to build trust in any way that we can. And where we cannot build trust we cannot have trust, and the lack of it will be, the key factor in slowing the adoption of technologies that may one day even surpass the claims of the current flood of Press releases about them. Meanwhile, cui bono? Mostly .it seems to me, Google, Microsoft, Open AI, Meta. Are they ethically motivated or are they in it for the money? For myself, I need them to clearly demonstrate in self regulation that they are as trustworthy as Frank, Coopers, Oxford marmalade.

50 feet above the surface of the moon

dworlock — Mon, 25 Mar 2024 18:06:08 +0000

In the history of software, as I have suffered it in the past 45 years, the most time wasting difficulty has been the false dawn syndrome. My first CTO, Norman Nunn Price was a grizzled Welshman with an unquenchable enthusiasm for the ability of software to solve all problems. As a young man, he had worked on radar in submarines in the Second World War. When, as his more youthful CEO, I sometimes questioned his predictions, the reply often included “look, we won the bloody war using this stuff didn’t we? “. But as the years passed by in our development of a start-up in legal information retrieval, we began to notice that when Norman and his team announced that the job was done, or the fix was in place, or the application was ready and the assignment was completed, we were actually at the beginning of another work phase, and not at a point of implementation. Once, in frustration, I pointed out forcibly to Norman that, despite his optimistic announcement that he had once more brought us successfully to a moon landing, it appeared that I was still 50 feet above the surface with no available mechanism to get me down there. It became a company saying.

I find myself using it regularly as I listen to the way in which data and analytics companies are learning to live with AI. I cannot fault the ambition. it is clear that many service providers are framing solutions that are going to provide really dramatic advances in value to the widest possible range of societal requirements. But once the service design and the value added has been determined, we come to that familiar place which the software engineers describe in terms of ETL – the whole business of extracting, transforming and loading data. it is here that we discover that our data is not like other data. It is too structured, or not structured at all. It has been marked up in a way that makes it difficult to transform, or it hasnot been marked up at all, which makes it difficult to transform. It either lacks metadata to guide the process, or has too much metadata, or nobody can understand and use the metadata. So we must pause and create a solution.

This is a well trodden track. Others have gone before us. The problems about integrating data into cloud data services like Databricks and Snowflake have slowed progress and added to costs for the past five years. It is interesting to see that the small industry has grown up to ease a problem, with companies like prophecy.com, emerging with effective solutions. One might imagine that the same will happen with AI. Data transformation will cease in time to be an issue, since a raft of services will have emerged to deal with common problems, and the data creators will have reacted and adapted to the issues that arise when data is ingested into AI environment of all sorts.

But of course, this will not stop the press releases, which will continue to claim that something has happened some time before it might possibly happen. Yet, it should moderate our expectations a little bit. Many feel that we have not yet hit the problems of getting first generation AI services fully operational, even if we are talking as if we were rolling out, second generation services, tried and tested by legions of users. 50 feet above the moon can be a good place to be if it provides an opportunity to pause for thought, and realign our thinking before we make the slow eventual descent to the lunar surface.

Show me your provenance!

dworlock — Sat, 09 Mar 2024 13:25:34 +0000

When I was forced to temporarily cease blogging a few years ago, (see personal note below) AI was a fact of life. Every year we saw improvements in the use of increasingly sophisticated algorithms. We noted the rise and rise of robotic process automation. Those of us with two decades of industrial memories recalled expert systems and neural networks. Those of us with four decades remembered hearing Marvin Minsky at MIT, telling us that he wanted the books in our libraries to speak to each other, to exchange and update knowledge and to build a new knowledge out of that exchange. Yet nothing here prepared us for 2023.

When historians get last year into some perspective they will probably conclude that what happened owed as much to the content creation requirements of online advertising, or the financial services requirement for a new wave of Silicon Valley investment frenzy as it did to a breakthrough in AI capabilities. yet what actually happened last year, even without such a perspective, is truly amazing. The year installed AI as a key strategic component in any strategic planning exercise in almost any commercial activity. Hyper-investment and hyperactivity resulting from it produced tools in generative AI which, a mere year later had immensely more powerful. Compare Chat GPT 3 To the current iteration of Gemini: a context window of 122,000 tokens to one of 1 million. Then look at the public recognition factor, and you find a world in which there is now a normal expectation that machine intelligence and machine interruptibility will be a part of everyone’s every day life. It is as if a switch has been flicked on, illuminating a new room into which we have walked for the first time. We all of us know that we can never now go back through that door or switch off that light. Pandora’s Law.

And we should not want to go back, either. What has happened should simply remind us that change does not happen evenly, and that the realisation of change sometimes takes longer to happen than we anticipate. But in 2023 I detected something else as well. A fear of change that was a little beyond normal anxiety. In the world in which I have worked for over 50 years the idea that content creation through the exercise of machine intelligence could be more threatening than beneficial gained a powerful currency and soon turned into dystopian editorials in both trade and consumer media. As a result we have come out of 2023, the year of AI megahype, with both an enhanced view of the speed and power with which machine intelligence will help, support, and change our society, and a hysterical fear of evils unknown which may result from quantum computers secretly plotting our downfall on the network. Since the invention of the wheel mankind has been learning to accommodate and live with the machine, and we shall surely do so in the world of AI as well. Yet, in the clan to which I belong, the data, services and solutions vendors who called themselves content companies and information providers a few years back (and then before that used to describe themselves in Gutenberg terms as publishers), there has been fear of a different sort. Whether it meant anything or not, they have always embraced the consolation of copyright, the belief that intellectual property can be described and identified and protected, as one of the bulwarks of their commercial viability. The idea that individual creativity could be mirrored by machine intelligence or that the machine might regurgitate, as a whole loan part, content acquired as part of training data, or that the value of content or data once described as “proprietary” could be lost in the machine intelligence age: these ideas are the very stuff of panic. Then add to them the knowledge that machine intelligence can produce “hallucinations “, that some related answers may not always be accurate and correct, and that the long-held belief that machines loaded with garbage do indeed produce rubbish, and we find integrity fears added alongside fear of theft and diminishing valuations.

One of my mentors of many years ago, recommending me to a potential client, commented that “while generally sound on strategy he can be unreliable on copyright “. I have over the years tried to be better behaved, but it is difficult because it takes so long to bring the heavy guns of copyright law to bear on problems that have usually departed long before adequate legislation is available to control them. Early regulation on AI, like the EU AI Act, seems , in any case, more bent on risk control than anything else.

While the Copyright lawyers are anxiously seeking reregulation for a machine age, I for one would take the arguments much more seriously if copyright holders paid real attention to marking their works with appropriate metadata and PIDs that indicated ownership and provenance. It is hard to imagine machine interoperable checking on the copyright status of works if those same works are not identified in ways that machines can recognise and understand. Then it becomes more possible to put pressure on AI developers to ensure that they licensed the genuine article, recognised the credentials of the real thing publicly, and increased the integrity of there solutions by showing users that only the real thing was used in the construction of the outcomes desired. This is beginning to happen in some encouraging ways: the fact that both Google and Open AI now accept C2 PA, the coding system developed for images and videos, shows what can be done by persuading people that being licit and responsible is good for business. Rather than have “fake“ hung round their necks, it is better to say that you will check and code every image that you use , especially in an American election year. In text and data there are similar emerging conventions. The ISCC– international standard content code – is now a draft ISO standard. The long- established GO FAIR provisions of the FAIR Foundation create metadata standards that render data “findable accessible Inter operable and reproducible “. Data and content owners who make it clear to interested parties and machines what the scope and ownership of their asset entails have a much better chance of working successfully with it in this New World. And in particular, they have a better chance of entering into proper andsatisfactory licensing agreements around it. If we are able to persuade the machine intelligence world that integrity is vital to business success, then we have a far better chance of creating the sort of licensing environments that pioneers like the Copyright Clearance Centre have advocated and piloted for years. Businesses in the network have to make for themselves the business conditions that work in the network.

So who will police and patrol all of this until law andregulation finally catches up, if it ever does? The publisher and copyright lawyer, Charles Clark, my fellow delegate to the European Commission Legal Information Observatory, invented the maxim “the answer to the machine lies in the machine”. It was never better applied than at this point. If you want to find bias in machine intelligence then the simplest way to do so is programmatically. If you wish to know whether training data has been derived from legitimate known sources that will vouch for accuracy and currency, ask the machine to interrogate the machine. For the AI companies, the price of reputation may be breaking open the black box and demonstrating good practice in creating answers from the very best inputs.

PERSONAL NOTE : I maintained this blog continuously from 2009 to 2021. I suffered eyesight problems which have left me with some 40% of my vision. My road back to this form of communication has taken three years, during which I’ve had the huge pleasure of writing two books, drafting a third and eventually returning to blogging. Writing in the world of text to speech and speech to text software is different. As I say on the end of all of my communications at work “ if you find errors of syntax, grammar or spelling in what I’ve written, please remember that it is much harder for me to edit than ever before, so try to smile indulgently. On the other hand, if you think that I have written utter gibberish, please contact me immediately!“

The evolving role of DATA in the AI era

dworlock — Thu, 21 Sep 2023 16:22:50 +0000

CCC – FAIR Foundation Forum

“The evolving role of DATA in the AI era “

18 September 2023 Leiden

“If we regulate AI and get it wrong, then the future of AI belongs to the Chinese“. When you hear a really challenging statement within five minutes of getting through the door, then you know that, in terms of conferences and seminars, you are in the right place at the right time. The seminar leaders, supported by the remarkable range of expertise displayed by the speakers, provided a small group with wide data experience with exactly the antidote needed to the last nine months of generative AI hype: a cold, clean, refreshing glass of reality. It was time to stop reading press releases and start thinking for ourselves.

FAIR’s leadership, committed to a world where DATA is findable, accessible, interoperable, and reusable, began the debate at the requisite point. While it is satisfying that 40% of scientists know about FAIR and what it stands for, why is it that when we communicate the findings of science and the claims and assertions which result from experimentation, we produce old style narratives for human consumption rather than, as a priority, creating data in formats and structures which machines can use, communicate and with which they can interact. After all, we are long past the point where human beings could master the daily flows of new information in most research domains: only in a machine intelligence world can we hope to deploy what we know is known in order to create new levels of insight and value.

So do we need to reinvent publishing? The mood in the room was much more in favourof enabling publishers and researchers to live and work in a world where the vital elements of the data that they handled was machine actionable. Discussion of the FAIR enabling resources and of FAIR Digital Objects gave substance to this. The emphasis was on accountability and consistency in a world where the data stays where it is, and we use it by visiting it. Consistency and standardisation therefore become important if we are not to find a silo with the door locked when we arrive. It was important then to think about DATA being FAIR “by design“ and think of FAIRificationas a normal workflow process.

If we imagine that by enabling better machine to machine communication with more consistency then we will improve AI accuracy and derive benefits in cost and time terms then we are probably right. If we think that we are going to reduce mistakes and errors, or eliminate “hallucinations“when we need to be careful. Some hallucinations at least might well be machine to machine communications that we, as humans, do not understand very well! By this time, we were in the midst of discussion on augmenting our knowledge transfer communication processes, not by a new style of publishing, but by what the FAIR team termed “nano publishing“. Isolating claims and assertions, and enabling them to be uniquely identified and coded as triples offered huge advantages. These did not end with the ability of knowledge graphs to collate and compare claims. This form of communication had built in indicators of provenance which could be readily machine assessed. And there was the potential to add indicators which could be used by researchers to demonstrate their confidence in individual findings. The room was plainly fascinated by the way in which the early work of Tobias Kuhn and his colleagues was developed by Erik Shultes, who effectively outlined it here, and the GO FAIR team. Some of us even speculated that we were looking at the future of peer review!

Despite the speculative temptations, the thinking in the room remained very practical. How did you ensure that machine interoperability was built in from the beginning of communication processing? FAIR were experimenting with Editorial Manager, seeking to implant nanopublishingwithin existing manuscript processing workflows. Others believed we needed to go down another layer. Persuade SAP to incorporate it (not huge optimism there)?Incorporate it into the electronic lab notebook? FAIR should not be an overt process, but a protocol as embedded and unconsidered and invisible as TCP-IP. The debate in the room about how best to embed change was intense, although agreement on the necessity of doing so was unanimous.

The last section of the day looked closely at the value, and the ROI, of FAIR. Martin Romacker (Roche) and Jane Lomax (SciBite) clearly had little difficulty, pointing to benefits in cost and time, as well as in a wide range of other factors. in a world where the meaning as well as the acceptance of scientific findings can change over time, certainty in terms of identity, provenance, versioning and relationships became foundation requirements for everyone working with DATA in science. Calling machines intelligent and then not talking to them intelligently in a language that they could understand was not acceptable, and the resolve in the room at the end of the day was palpable. If the AI era is to deliver its benefits, then improving the human to machine interface in order to enable the machine to machine interface was the vital priority. And did we resolve the AI regulatory issues as well? Perhaps not: maybe we need another form to do that!

The forum benefited hugely from the quality of its leadership, provided by Tracey Armstrong (CCC) and Barend Mons (GO FAIR). Apart from speakers mentioned above, valuable contributions were made by Babis Marmanis (CCC), Lars Jensen (NFF centre for protein research) and Lauren Tulloch (CCC).

CCC Joins the critical digital enablers

dworlock — Wed, 04 May 2022 13:46:12 +0000

There is a place in the recent Star documentary “Dopesick” where the investigators seeking to validate the claim of the manufacturers of OxyContin that it is non-addictive, search for the article which is said to be the source of the claim in the NEJM. They cannot find it in the year cited.

What is more, they cannot find it in the previous or subsequent years. This being the first decade of the century, they resort to keyword searching. Eventually they find the source which is said to validate this claim. It turns out to be a five-line letter in the correspondence columns, but the screen shows the thousands of references raised by the keyword searching and a long and harrowing task for the researchers. As I watched it, I reflected how far the world had moved on from those haphazard searching days, along with the realisation that we have not progressed quite as far as we sometimes think, as our world of information connectivity keeps having to track back before it goes forward.

The great age of keyword searching was succeeded in the following decade by the great age of AI and machine learning hype. We welcomed the world of unstructured data in which we could find everything because of the power and majesty of keyword searching and because of our effectiveness in teaching machines to become researchers. And we will never know how much we spent or how many blind alleys we went up as we tried to apply prematurely systems that only worked when we built in enough information to enable us to be categorical and certain about the results that we achieved. And in science, and in health science in particular, categorical certainty is what we must achieve at a bare minimum.

We will never know cost in time and money terms because it is not in the interests of anybody to tell us. But we can see, as we move forward again, that there is a wide recognition now that fully informative metadata has to be in place, that identifiers have to point to a knowledge container in order to point machine intelligence in the right direction. All the work which we did in earlier decades on taxonomies, ontologies and metadata was not wasted, but simply became the foundation of the intelligent and machine interoperable data society which we seek now to build. We are developing a clear understanding that the route to the knowledge we seek lies in the quality of meta data, and the ability to rely upon primary data which is fully revealed in the meta data. Many of us now accept that primary data (by which I mean research articles and evidential data sets) will exist in future for most researchers as background: most of their research will take place in the meta data, and they will dip into the primary data only on rare occasions. There is a clear parallel here to other sectors, like the law, where the concepts and commentary become more important than the words of enactment.

And so as the age of reading research articles comes to an end, so the business of understanding what they mean has to take further strides forward. This means that existing players in the data marketplace have to reposition, and this week’s announcements from CCC exemplify that repositioning in a very clear manner. CCC is an extremely valuable component part of scholarly communications: its history in the protection of copyright and the development of licensing has been vital to helping data move to the right places at the right times. But CCC management know that that is not any longer quite enough: their role as independent producers of the knowledge artefacts that will make the emerging data marketplaces succeed is now coming to the fore. In two announcements this week they demonstrate that transition. One is the acquisition of Ringgold, an independent player in the PID marketplace. Amongst all meta data types, PID and DOI alongside ORCID, have enabled the scope of the communications database marketplace to emerge.

PID means permanent identification. In the case of Ringgold, it means making sense of organisational chaos thus disambiguating what are the most confusing aspects of the sector. Just as ORCID sought to disambiguate the complex world of authorship, so Ringgold

seeks to help machines reading machine readable data understand the nature and type of organisations and activities that are being described. And then extend this one stage further. It is one thing to know these fixed points of meta data knowledge, quite another to use and manipulate them effectively to create the patterns of connectivity upon which a data driven society depends.

While announcing the acquisition of Ringgold, CCC also announced a huge step forward in the developing world of knowledge graph development. Putting labels on things is one thing: turning them into maps of interconnected knowledge points is another. CCC have been brewing this activity for some time, as their Covid research development work demonstrates, it is impressive to see this being done, and being done by a not-for-profit with a long history of sector service and neutrality to the sector major player competitiveness. We are increasingly aware that some things have to be done for the benefit of the entire sector if the expectations of researchers, and of the sector as a whole are to be achieved.

And of course, those competitive major players are also moving forward. The announcement in the last two weeks by Elsevier that they have made a final bid for the acquisition of Interfolio is a clear indicator. This implies to me a thesis that article production, processing and distribution is no longer in the development marketplace. We are now critically aware that the business of producing, editing and presenting research articles may be fully automated within the next decade. With even more development work going into the automation of peer review checking, human intervention in the cycle of research publication may become supervisory, and the bulk of production work may shift to the computer on the researcher’s desk.

There will still be questions of brand of course. There will still be questions of assessment and valuation of research contributions to be made. And it is this latter area which becomes a major point of interest as Interfolio takes Elsevier into the administration of universities and the administration of funding in a vital interesting way. This is the place where new standards of evaluation will be developed. When the world does let go of the hand of the impact factor which has guided everybody for so long, then Elsevier want to be in the centre of the revaluation. In other words, if Elsevier as market leader formally competed most with fellow journal publishers, its key competitor in future development may well be companies like Clarivate.

The critical path of evaluation will be the claims made by researchers for their work, and the way those claims are validated in subsequent work reproducibility. In this connection we should also note the work of FAR and it’s collaborators in developing ideas around “nano publishing”, using metadata to clearly outline claims made by researchers in articles and both their conclusions parts of articles and in other places as well. If this had been in place, the OxyContin investigators would have found it easy to find a letter back at the beginning of the century.

End

CCC Announces Robust Knowledge Graph Capabilities Through CCC Expert View

The Data Quality Imperative

After Content : Scholarly Communications After Articles ?

dworlock — Fri, 19 Nov 2021 18:47:11 +0000

KEYNOTE: AFTER CONTENT : THE EMERGING WORLD OF INFORMATION AND INTELLIGENCE

David R Worlock , Chief Research Fellow , Outsell Inc.

Although we are wearing life-jackets , we struggle in the water . The turbulence surrounding climate change and Covid 19 is so great that we are tossed in the wake of theses vessels . In just 24 months , our ideas about the Future , the sunlit uplands of our visions of technology-enhanced work and leisure , the improvement of the human condition , the notions of incremental progress and exponential growth , have been shaken . Suddenly the Future is something endangered , something to be preserved , something to be secured – and sometimes something to be feared . This moment calls for courage and decisiveness . We have to clarify our objectives and build towards our desired outcomes regardless of tradition or orthodoxy . It is too late to say that the water is rising : we are in the water already and floundering . What we did before as researchers , librarians , publishers , intermediaries , is less than relevant to what we do next . What we do next will include things we have never tried before , so we will have to learn quickly and move flexibly .

Metaphors can only be stretched so far . In reality a good taste of the Future is already with us , though , as William Gibson so accurately forecast , it is not evenly distributed. While preparing for this Keynote through the autumn , I learnt at the CERN/University of Geneva discussions on innovation in scholarly communications that some scholars already envisage publishing on low cost open platforms managed and run by researchers and their institutions . Yet at the Frankfurt Book Fair , it was easy to sink back into the atmosphere of a scholarly journals world , post-print , but still adhering to the practises and principles of Henry Oldenburg and the Proceedings of the 1660s . Yet everyone was aware that something had happened . 450,000 Covid and Covid-related articles had been published in the previous 24 months . Everyone had seen submission growth during lockdown . Everyone paid lip service to the idea that something impressively vague – usually “AI” – would get us out of the hole . Everyone , as always in the fragmented workflow model of scholarly communications , wanted only to concentrate narrowly on their own piece of the action , regardless of what was happening elsewhere .

If any of the participants were able to take a holistic view of what is happening to researchers , then I think that these conclusions , amongst others , would offer themselves :

KNOWLEDGE IS TOO BIG TO BE CONTAINED. It has been for many years . And it is certainly too big for convenient interrogation and access if we pretend that we are still using physical formats in digital form . And what about the evidential data ? In some disciplines the urgency of searching everything at once meets the wall of paywall science . Researchers looking to test existing methodologies or find new ones want to find the description , not the article . Machine intelligence can only address the issues if the data is organised and accessible .
ARTICLES ARE NOT FOR READING . It is now hard to ask questions like “ Are you up to date ?” Without machine intelligence few people in research can be current in a traditional sense . The accessibility of global knowledge and the huge increases in knowledge discovery output in China and to a lesser extent India , made this difficult a decade ago . Now , without intelligent alerting and increasing use of intelligent machine summarisation , research roles would have been submerged by the struggle to keep abreast .
RESEARCHERS DO NOT WRITE ARTICLES. Indeed , in one sense they never did . In some disciplines articles reporting research reports have always had pre-formatted sections , and compiling literature reviews or citations has often been semi-automated . Other sections are often drawn from grant applications, or by using machine intelligence to draw and compile results from a laboratory log or a Jupyter lab notebook . The data narrative intelligence that now writes so many of our sports reports and business news analysis can equally well support the workbench productivity of researchers .

These three facets of our future , all available now within plain sight , argue a certain view of change . Yet the change will not be dictated by publishers or indeed librarians . Their roles will develop and alter as a result of the decisions made in the research community about the future . Indeed , change has already taken place as a result of Open Access . Recent reports indicate that some 33% of articles are now published this way , and the STM report forecasts that in research intensive countries – the UK specifically – that figure will reach 90% by the end of 2022. Many , including myself , think that the average may be higher globally , and close to 50%. OA has taken 20 years to arrive , but it has come because of increasing researcher and research institution approval . Yet for many researchers , asking questions about the way science was communicated was the smallest issue at hand , even if it was the most easily addressed . OA was simply the preface to the book entitled “ Open Science “, in which scientists question and debate every aspect of the process of research and discovery . We should be very glad of this . If science is indeed our only hope of rescue from these storm tossed waters in which we bob helplessly , then the very least that we want is for it to be accurate , ethical , constructively competitive where that helps , completely collaborative where that helps , and squarely based on the evidence available to all . That evidence will be largely data . As Barend Mons , Professor of BioSemantics at Leiden and Director of FAIR says , “ Data is more important than articles “ . And this is where the future begins .

“As most article writing is increasingly done using machine intelligence ….the article can be fragmented and each element published when ready .”

The picture painted so far shows machine intelligence intervening to ameliorate human issues with handling content . The future is about building the structures which will accomplish that . As so often , it is not about inventing something wholly new to do the job . Artificial intelligence has been with us in principle for 60 years , and from the Expert Systems and the Neural Networks of the past 20 years we can produce a mass of practical experience . This is not to say that there are no problems : what we can do depends critically on the quality of our inputs . In many sectors of working life , data bias remains a real problem , and algorithms can as a result inaccurately represent the intentions of their creators . The positive facts are that we can plan to use a whole range of AI-based tools to address our issues . Deep learning , machine learning, the now widespread use of NLP , the increasing effectiveness of semantic-based concept searching and comparison, as well as other forms of intelligent analytics have all been deployed effectively over the past five years and intensively in Covid research . And yet , we have not yet entered the Age of Data in Scholarly communications , despite the daily practices of many researchers .

“ Our sense of priorities is upside down . Data is always more important than articles , and will continue to be in the age of machine intelligence “ .

We cannot seem to break away from the notion that communication is narrative . The journal article , a report on an experiment or a research programme , is in itself a narrative . It is a story told by humans to transfer knowledge and information . This means nothing to machine intelligence . The metadata that guides machine to human communication will be far less effective in promoting machine to machine interoperability. If we want to use this interoperability to , for example , rerun an experiment in simulation or in reality , or find every place where similar results have been recorded using similar methodologies but described in different words or languages , then we need an augmented set of sign posts to shorthand the way that two machines speak to each other . And we need protocols and permissions that license two machines to negotiate data across the fragmented universe of science data , and across its innumerable paywalls .

“Considering that most of the readers of scholarly articles are now machines we should prepare thos articles so that machines can interact with them even more effectively “

FAIR and GO FAIR have made great strides in making this new world possible . There is a role for publishers and librarians in helping to ensure that data is linked to articles , saved in a safe repository , fully accessible and with efforts being made to develop the business models that improve metadata and thus machine to machine exchange .There is an even bigger role to ensure that all parts of the article are fully discoverable as separate research elements through added metadata to support the full interactivity of machine intelligence . It is predictable that in time most readership of articles will be by machine intelligence , and that much of what researchers know about an article will come from the short synopsis or the infograms provided by alerting services or impact and dissemination players , who will have an important role in signposting change and adding metadata ( Cactus Global’s R Discovery and Impact Science , or Grow Kudos , are good examples . ) Researchers will predictably become adept users of annotation systems ( hypothes.is) , writing their thoughts directly onto the data and content-as-data to create collaborative routes to discovery . Wider fields of data will become more routinely available , as DeepSearch9 have demonstrated with the deep web and with medical drug trials. Some researchers will desert long form article writing altogether , preferring to attach results summaries directly onto the data and distinguish them with a DOI , as they have done for many years in cell signalling , and as members of the Alliance of Genomic Resources do in their Model Organism databases . Here again DOIs and metadata connect short reports ( MicroPublishing ) to the data . And , if we follow the excellent development work of Tobias Kuhn , we shall be publishing explicitly for machine understanding ( “ nano-publishing “.)

Of course , we are still a long way away from the prevalence of this very different world of scholarly communications , at least for the generality of researchers . And if this is really the way we are going then we should expect to see some stress points on the way , some indicators that the main structures of the content world of article publishing is beginning to bend and buckle . We should also expect to see the main concerns of Open Science beginning to have an impact . Every observer of these developments will have their own litmus list of indicative changes : here are mine :

1 . Article Fragmentation . Over the last thirty years I have several times acted as a Judge in contests to create The Article of the Future . Some of these , notably the one created by Elsevier , showed huge technical ability . We are now used to the article that contains video , oral interviews and statements , embedded code , graphs that can be re-run with changed axes , and in healthcare ( OpenClinical) embedded mandates that can be carried over into clinical systems . Some of these artefacts need to be searched in a data driven environment if we are to find exactly the moment in the video where certain statements were made , for example. Articles stored in traditional databases are not normally susceptible to this type of enquiry . I expect to see articles appearing in parts across time , linked to the early stage research activity (morrissier.com, and Cassyni) which makes seminars , conference speeches and other material created prior to the termination of a research project , available and accessible as indicators of early stage research results .

The influence of Open Science on the redevelopment of the article will be acute . Pre-registration , the process by which research teams publish their hypothesis and methodology at the very beginning of the research process , is designed to prevent any subtle recalibration of expectations with results in the process of formulating the published report . PLoS has implemented a service that trials this idea. At the same time Open Science demands that the searchable record should give much better coverage of successful reproduction of previously published findings , as well as coverage of failed experimentation and of failure to reproduce previous results . All of this has obvious value in the scientific argument : little of it is in tune with the practises of most journal publishers . I expect to see journal publishing becoming much more like an academic notice board , with linked DOIs and metadata helping researchers to navigate the inception to maturity track of a research programme , as well as all of the third party commentary associated with it.

2 PrePrints and Peer Review . Critics of what is happening currently as scholarly communications gradually eases itself into a born digital framework for the first time, point to the over-production of research and in particular to the rise of the pre-print as proof of too much uncategorised , lightly peer reviewed material in the system of scholarly communication . There are always voices that want to go back the way we came . Others point out that if we can successfully search the deep web – 90% greater than Google – then searching a few preprint servers should not be too much of a challenge , especially if we get DOIs and metadata right first time . And in thinking about this we should factor in the idea that developing the sophistication of our identifiers , increasing the range and quality of metadata applied throughout the workflow of scholarly communication , and extending the reach of semantic enquiry remain bedrock needs if scholarly communication is going to function , let alone become more effective . By the time that these processes reach maturity , we will have long ceased to refer to any of this material as “articles “. We will simply refer to “research objects “ in a context where such an object might be a methodology , a peer review , a literature review, a conference speech , an hypothesis , an evidential dataset or any other discrete item . Progress in this direction will be the way in which we measure the real “digital transformation “ off scholarly communications .

3 When do we do Peer Review ? In 2021 , two of the physicists , both over 80 ,who won the Nobel Prize were distinguished for work accomplished in the 1970s and 1980s . Open Science points out that our current peer review system does not account for changes in appreciation of scholarly results over time . In addition , the current system can shelter orthodoxy from criticism , and in the narrow confines of a small sub-discipline , is open to being ‘gamed ‘, if not corrupted . Many subscription publishers cling to peer review , along with VOR ( Version of Record ) , like a comfort blanket , sensing that this may give them the ‘stickiness’ to remain important in an age of rapid change . It helps that for many publishers peer review is something they organise but do not pay for , leaving an uneasy feeling that it may not survive a reluctance amongst researchers to volunteer ( a shortage is being felt in some disciplines already ) where neither pay nor public recognition is available .

Two factors complicate this issue . One is timing in the publishing process . Do we really need an intensive review at this point? Funders have reviewed the research programme and the appointed team , and will be able to do due diligence against those expectations . The availability of much more information around reproducibility or the lack of it amongst the flow of research objects is important here , but takes time post-publication . The ability of critics and supporters to add commentary within this workflow will become important , providing the critical input of named individuals who are prepared to stand behind their views . The introduction of scoring systems that are able to assess the regard with which a body of work is held , and index changes to that over time will be critical developmental needs . And then the second factor contributes : AI – based analysis has already proved successful in reducing the element of checking and verifying which is part of each peer review . The UNSILO engine , a part of Cactus Global , executes some 25 checks and is widely used to reduce time and effort in publication workflows . As work like this becomes even more sophisticated and intelligent , it will not simply improve the quality of research objects , but will create its own evidential audit trail , reassuring researchers that key elements have been checked and verified .

4 Open Access/Open Platform. The rush to embrace change is so prominent in certain parts of our society that we tend to turn the changed element into the New Orthodoxy well before its maturity date . This is certainly the case with Open Access , when perhaps the question we should be asking is “ How long will Open Access survive ?” OA is a volume based business model . This is important to recall when there is pressure for APCs to reduce , and when Diamond OA becomes a topic of real interest and concern . Diamond OA often relies on the voluntary efforts of researchers and small scholarly societies , and these efforts can prove to be sporadic . Predictably , Open Access will lead to an even greater concentration of resources in a very fragmented industry . While Springer Nature and Elsevier are described as behomeths within scholarly communications, they are far from the size of major media , information or technology players . OA will drive more Hindawis into more Wileys .

Alongside this we must note changes in publishing workflows . As APCs stabilise and tend to decrease, margins will be maintained by the increasing application of RPA , Robotic Process Automation . The technology today which can write a legal contract proves equally adept at reading and correcting a proof , resolving issues in a literature search or creating a citation listing . Yet publishers who today look at process cost reduction as a way of staying in business must also factor in the the elimination of barriers to entry that this involves . We shall reach a point where mass self-publishing of research objects , whether still in articles or not , becomes very feasible . The successors of the Writefulls , the Grammarlys and the WeAreFutureProofs of today will become the desktop tools of the working researcher . And then the F1000s of today , or their ORC derivatives , or the Octopus Project recently funded by UKRI , will assume the status of Open Platforms , the on-ramps to move articles and then research objects into the bitstream of scholarly discussion and evaluation . This too will give an opportunity to address the most glaring omission in today’s scheme of process: the lack of a cohesive dissemination element . The irony here is that , for many participants , getting published means ‘ everyone knows about it ‘. Clearly they do not . Some publishers offer large volumes of searchable content behind paywalls , and the whole sector talks learnedly about “ discoverability “. Why , in the age of knowledge graphs and low/no cost communications , a publisher would not feel able to alert every researcher in a given sector to the appearance of fresh materials linked to their research interests, is a mystery . The gap has been partially filled by social media players like ResearchGate , but as long as the social media remains advertising based some researchers will reject this . Players like ResearchSquare, Cactus RDiscovery and Impact Science, and Grow Kudos all address these issues in various ways , but gaining impact from meaningful dissemination remains a blind spot for many publishers .

5 Metrics It is obvious enough that new systems of metrics will grow out of the evaluated bitstream of scholarly communication . While citation indexation fades for some , it does not go away . Using altmetrics to create new measures , like Dimensions , provides a welcome variation , but is still far from being a standard. If it looked at one point that Clarivate was going to revive ISI to recreate the Impact Factor , then it has also looked in recent years as if Open Science advocates have set their faces against the impact factor as a indexation that can be so easily and obviously gamed . There is then a vacuum at the heart of the digital transformation of scholarly communications : we still do not know how to rank and index contributions so that searchers can see at a glance how colleagues rate and value each other’s work . When we do – and I have jumped the gun by naming it “Triple I “ already , for the Index of Intervention and Influence , it will capture and evaluate every network participation , from grant application to pre-registration intent to early stage poster and seminar and conference contributions , to blogs and reviews and on to the researcher’s own results and their reception and evaluation . Here at last the distortion of the pressure to Publish or Perish “ will be laid to rest .

CONCLUSION. I have tried to describe here a world of scholarly communications in motion . We need to watch very carefully in the next few years if we are to validate the direction and judge the speed . As we move into 2022 , the way in which so called “transformative agreements “ are renewed or replaced will offer up plenty of clues . We need to validate experimentation in forms of communication , both long and short term . While many publishers assert that authors will not accept that data leading to reproducibility should be made available, PLoS have maintained one service in which data is linked by reference to the article after being placed in a safe data repository like Dryad or Figshare . They report no resistance to these requests . Unless we all experiment we will never know.

The approaches made by Open Science as a generalised movement for change and reform will be critical , as will the speed and completeness with which these ideas are accepted and implemented , especially by funders . The issues here will be both big and small . Retractions , the way they are notified and the way in which the discovery of retracted material is flagged to users , is a finite area that has required reform for many years . On the other hand , the moves in several countries and many institutions to de-couple article and book production from promotion or preferment in academic institutions has wide implications . Remove “publish or perish “ and one of the main supports of the publishing status quo goes with it . It will not stop researchers being measured on the quality or impact of their contributions to scholarly communications , but it may well be that those contributions can be just one element of a multi-faceted rating .

Data and AI will continue to be central to the possible directions of change . Just as SciHub challenged the paywalls of the industry half a generation ago , s the announcement of the launch of The General Index in October marks a critical moment for researchers . There are alternative means of knowledge access and evaluation . There is nothing illegal about Carl Malamud’s enterprise , but using text and data mining techniques to create an index of terms and five word expressions of concepts in 107 million scholarly articles – just the beginning says the team – and making it free to use and downloadable is a huge achievement . It means that the age of going to the source document , the version of record , recedes even further from the researcher’s priorities except as a last resort or if the wording was of critical importance . For those who have long held the view that most research in the literature would eventually be done only in the metadata , this is an early dawn .

Some will read what I have said and conclude that this is just another “ the end of publishing “ talk . This would be wrong . I want to reach out to the hundreds and thousands of data scientists , software engineers and architects who have joined what were once traditional publishing houses in the last decade. You have a key role and a huge opportunity as the digital transformation of scholarly communications at last gets underway. The data analytics , the RPA systems , the dissemination environments , the new services summoned up by the Open Science vision – all of these and many more provide opportunities to reboot a service industry and create the support services that researchers need and value .

USEFUL REFERENCES

AI enabled scholarly workflow tools and other support services :

Scholarcy.com

Scite https://scite.ai

protocols.io

morrissier.com

Cassyni.com

UNSILO. https://unsilo.ai/about-us/

Barend Mons. Seven Sins of Open Science. ( slide set ) https://d1rkab7tlqy5f1.cloudfront.net/Library/Themaportalen/Open.tudelft.nl/News%20%26%20Stories/2018/Open%20Science%20symposium/Spreker%204%20open-science-Barend%20Mons_web.pdf

Open Science. The Eight Pillars of Open Science. UCL London https://www.ucl.ac.uk/library/research-support/open-science/8-pillars-open-science

The General Index: the next challenge to the Paywall

dworlock — Wed, 27 Oct 2021 16:51:40 +0000

The General Index poses a publisher question

Science advances by virtue of standing on the shoulders of giants , but sometimes you need a stepladder. Longtime public access activist Carl Malamud believes he is providing one in his newly launched ( 7 October ) General Index , a way of filleting scientific knowledge and spitting out the essential bones which may yet rival SciHub , the Azerbaijan-based pirate site of full text science articles , as the no-cost way to search scientific literature without paying publishers for the privilege . In a world of pinched science budgets this may be appealing . Even more appealing may be the thought of getting to the essence without full text searching and the elimination of false leads and extraneous content.

It used to be a joke that one day the metadata around science research articles would be so good that you could pursue most searches through the metadata without troubling yourself with the text of the article . Indeed , in some fields , like legal information , the full text of cases could be a nuisance and concordances , citation indexes and other analytical tools could be used to get quickly to the nub of the question . Today these are built into the search mechanism and the search strategy . Mr Malamud has a long history in public and legal information ( see public.Resource.Org , his not for profit foundation and publishing platform ). At one point he challenged Federal law reporting on cost and campaigned to become U S Printer . But he is a very serious computer scientist and his target now is the siloed , paywalled world of non-Open Access science publishing . And the point of attack is both shrewd and powerful .

The weakness of the publishers is that their paywalled information cannot be searched externally in aggregate in a single , comprehensive sweep . Just like SciHub , Mr Malamud enables “ global” searching to take place . He has built an index . Currently he covers 107 million articles in all of the major paywalled journals . He has indexed n-grams – single words and words in groups of 2, 3, 4, and 5 . He has built metadata , IDs and references to the journals . And , he claims , he has done this without beaching anyone’s copyright . He points out that facts and ideas are not copyright , and that his index entries do not attract copyright since they are too short to be anything else but fair dealing . Publishers will no doubt try to test this legally , probably in the US or UK since common law jurisdictions look more favourably on economic rights . In the meanwhile it is worth pondering the words of part of his publication statement:

“The General Index is non-consumptive, in that the underlying articles are not released, and it is transformative in that the release consists of the extraction of facts that are derived from that underlying corpus. The General Index is available for free download with no restrictions on use. This is an initial release, and the hope is to improve the quality of text extraction, broaden the scope of the underlying corpus, provide more sophisticated metrics associated with terms, and other enhancements.”

It is very clear from this that science publishing , if it attacks the General Index , is going to do so on very tricky grounds . Looking like monopolists is nothing new , but actually persuading researchers that they are instrumental in building reputation and career advancement weakens as an argument when the publisher is being pilloried for restricting access to knowledge . Building a new business in data solutions and analytics is a road that several have taken , but only the largest are very far advanced . This might be a time for the very largest to get together to discuss grouping services for researchers , but free , and without anti-trust implications ? Old style subscription journal publishing is getting boxed into a corner , with Open Platform publishing advancing quickly now , with applications like Aperture Neuro (https://www.ohbmbrainmappingblog.com/blog/aperture-neuro-celebrates-one-year-anniversary-with-new-publishing-platform-and-first-published-research-object ) and work like the Octopus research project at UKRI that I have mentioned previously .

In all of this , Data , the vital evidential output from research and experimentation , remains neglected . Finding a business model for making data available , marked up , richly enhanced with metadata and fully machine to machine interoperable remains a key challenge to everyone in scholarly communications . Even when Mr Malamud’s 5 terabytes of data ( compressed from 38 ) is installed it will only be a front end steering device to guide researchers more quickly to their core concerns – and those will eventually result in looking at the underlying data rather than the article .

The references below include a Nature article with the only comment from a major publisher that I have seen so far . I wonder if they are talking about it in Frankfurt!

Data poetry: Ode to The General Index

https://www.nature.com/articles/d41586-021-02895-8

Data poetry: Ode to The General Index

Is Your Boardroom Discussion Futures Literate ?

dworlock — Wed, 20 Oct 2021 12:42:14 +0000

We need to talk seriously about Futures Literacy . And we need to do it now , before it is too late . The decisions being taken in our boardrooms are getting bigger and bigger . And if they are not , then we should be very worried indeed . This month we come to CoP 26 , exposing once again the need to take urgent steps to address climate change . The Board cannot simply leave all of this to the politicians , who will always be guided by what will give them electability. The decisions on climate , upon investment in change and most of all on speed of deployment , will be critical in meeting targets and , eventually , in escaping the worst effects of hundreds of years of exploitation and neglect . Yet for many of us , as we steam towards the Metaverse at ever increasing speed , it seems as if we have a parallel set of concerns. We know that we have to think about investing in the technologies that surround information content and data in the information industry . We also know that next year our customers will have different and enhanced expectations of us . We are sophisticated now as businesses , handling online service functions , raising fresh capital and working cohesively with stakeholders . Then why , O why, is the Pygmy in the room the way that we discuss the Future ?

I am now past fifty years of working , as a manager , a Director , a CEO and as an advisor to many boards . My experience of experience is that you do not really learn very much from it in periods of rapid change . When I started it did not matter much if a senior manager could not distinguish Linotype from Monotype . Today it does not matter much if a manager cannot discuss Digital Twins or tell you how a GAN network operates . What concerns me is the nature of the dialogue , the discipline of the approach , the “ empirical rigour “ in the discussion , since these are the necessary supports for planning , and , above all , for planning timing , which are needed if we are to sustain any hope of making sense of what we need to do beyond Q2.

All too often , even at board level , discussion devolves to the anecdotal brilliance of someone’s daughter and the app she found on Google , or the son who downloaded a course and passed Math without needing a tutor , or a visionary who someone has seen speaking on YouTube , or a book which someone had heard of but never actually read … This anecdotalisation of the Future makes me want to scream . I take it that we sit on Boards because we are charged by the stakeholders , beyond our governance duties , with the maintenance and growth of Value through Time . The Future is thus our mandate , not something to obfuscate around . We need to talk frankly about how we anticipate change , and just as we should be watchful now for bias in data , we need to start with a careful self-audit of our own bias about the future .

The most valuable work that I know in this area comes from UNESCO , and from Riel Miller, their head of Futures Literacy . The case he makes is impressive and has the huge merit of moving us away from an extrapolation-based thought process , where we all try to second guess future trends from what we have experienced in our own lives . In the first instance , our own experiences are collected randomly . In the second , this method gives us no way of testing probability or timing . Far better then to try to develop strategies about the Future by creating , or reframing , our thinking through developing hypotheses, altering all of the variables and testing our assumptions . This sounds to me like a managerial version of scientific method , and a discipline devoutly to be wished for when we come to consider the lazy thinking around much of the Futurism that we read and hear . In the information industry , after all , we say that we are driven by data science . Some attempt to think scientifically may well be overdue .

So how do we go about the business of reframing our corporate thinking about the future ? Riel Miller’s suggestion is the Futures Literacy Labs concept , though I would not recommend this in some of our industry corporate frameworks as a board level activity . However , the opportunity to put some senior directors , key managers and some younger fast track recruits into a regular meeting context where a discussion discipline is maintained around forming and testing concepts, could both inform board decision making and spark small scale experimentation to test developed ideation . And this would be especially valuable and useful if the primary concentration was on our users and how they will work . This then forces us to think hard about how we continue to add value for them . It could stop this low level assumptive discussion of generalities – “ Of course , AI is the future of everything “ – and ground our arguments in the vital qualities that they seem to lack – Context and Timing. Above all , it widens the responsibility for the future – this does not rest with the CEO , the CSO or the CTO . It rests with all of us .