The General Index poses a publisher question

Science advances by virtue of standing on the shoulders of giants , but sometimes you need a stepladder. Longtime public access activist  Carl Malamud believes he is providing one in his newly launched ( 7 October ) General Index , a way of filleting scientific knowledge and spitting out the essential bones which may yet rival SciHub , the Azerbaijan-based pirate site of full text science articles , as the no-cost way to search scientific literature without paying publishers for the privilege . In a world of pinched science budgets this may be appealing . Even more appealing may be the thought of getting to the essence without full text searching and the elimination of false leads and extraneous content. 

It used to be a joke that one day the metadata around science research  articles would be so good that you could pursue most searches through the metadata without troubling yourself with the text of the article . Indeed , in some fields , like legal information , the full text of cases could be a nuisance and concordances , citation indexes and other analytical tools could be used to get quickly to the nub of the question . Today these are built into the search mechanism and the search strategy . Mr Malamud has a long history in public and legal information ( see public.Resource.Org , his not for profit foundation and publishing platform ). At one point he challenged Federal law reporting on cost and campaigned to become U S Printer . But he is a very serious computer scientist and his target now is the siloed , paywalled world of non-Open Access science publishing . And the point of attack is both shrewd and powerful . 

The weakness of the publishers is that their paywalled information cannot be searched externally in aggregate in a single , comprehensive sweep . Just like SciHub , Mr Malamud enables “ global” searching to take place . He has built an index . Currently he covers 107 million articles in all of the major paywalled journals . He has indexed n-grams – single words and words in groups of 2, 3, 4, and 5 . He has built metadata , IDs and references to the journals . And , he claims , he has done this without beaching anyone’s copyright . He points out that facts and ideas are not copyright , and that his index entries do not attract copyright since they are too short to be anything else but fair dealing . Publishers will no doubt try to test this legally , probably in the US or UK since common law jurisdictions look more favourably on economic rights . In the meanwhile it is worth pondering the words of part of his publication statement:

“The General Index is non-consumptive, in that the underlying articles are not released, and it is transformative in that the release consists of the extraction of facts that are derived from that underlying corpus. The General Index is available for free download with no restrictions on use. This is an initial release, and the hope is to improve the quality of text extraction, broaden the scope of the underlying corpus, provide more sophisticated metrics associated with terms, and other enhancements.”

It is very clear from this that science publishing , if it attacks the General Index , is going to do so on very tricky grounds . Looking like monopolists is nothing new , but actually persuading researchers that they are instrumental in building reputation and career advancement weakens as an argument when the publisher is being pilloried for restricting access to knowledge . Building a new business in data solutions and analytics is a road that several have taken , but only the largest are very far advanced . This might be a time for the very largest to get together to discuss grouping services for researchers , but free , and without anti-trust implications ? Old style subscription journal publishing is getting boxed into a corner , with Open Platform publishing advancing quickly now , with applications like Aperture Neuro ( ) and work like the Octopus research project at UKRI that I have mentioned previously . 

In all of this , Data , the vital evidential output from research and experimentation , remains neglected . Finding a business model for making data available , marked up , richly enhanced with metadata and fully machine to machine interoperable remains a key challenge to everyone in scholarly communications . Even when Mr Malamud’s 5 terabytes of data ( compressed from 38 ) is installed it will only be a front end steering device to guide researchers more quickly to their core concerns – and those will eventually result in looking at the underlying data rather than the article . 

The references below include a Nature article with the only comment from a major publisher that I have seen so far . I wonder if they are talking about it in Frankfurt!


 We need to talk seriously about Futures Literacy . And we need to do it now , before it is too late . The decisions being taken in our boardrooms are getting bigger and bigger . And if they are not , then we should be very worried indeed . This month we come to CoP 26 , exposing once again the need to take urgent steps to address climate change . The Board cannot simply leave all of this to the politicians , who will always be guided by what will give them electability. The decisions on climate , upon investment in change and most of all on speed of deployment , will be critical in meeting targets and , eventually , in escaping the worst effects of hundreds of years of exploitation and neglect . Yet for many of us , as we steam towards the Metaverse at ever increasing speed , it seems as if we have a parallel set of concerns. We know that we have to think about investing in the technologies that surround information content and data  in the information industry . We also know that next year our customers will have different and enhanced expectations of us . We are sophisticated now as businesses , handling online service functions , raising fresh capital and working cohesively with stakeholders . Then why , O why,  is the Pygmy in the room  the way that we discuss the Future ? 

I am now past fifty years of working , as a manager , a Director , a CEO and as an advisor to many boards . My experience of experience is that you do not really learn very much from it in periods of rapid change . When I started it did not matter much if a senior manager could not distinguish Linotype from Monotype . Today it does not matter much if a manager cannot discuss Digital Twins or tell you how a GAN network operates . What concerns me is the nature of the dialogue , the discipline of the approach , the “ empirical rigour “ in the discussion , since these are the necessary supports for planning , and , above all , for planning timing , which are needed if we are to sustain any hope of making sense of what we need to do beyond Q2. 

All too often , even at board level , discussion devolves to the anecdotal brilliance of someone’s daughter and the app she found on Google , or the son who downloaded a course and passed Math without needing a tutor , or a visionary who someone has seen speaking on YouTube , or a book which someone had heard of but never actually read … This anecdotalisation of the Future makes me want to scream . I take it that we sit on Boards because we are charged by the stakeholders , beyond our governance duties , with the maintenance and growth of Value through Time . The Future is thus our mandate , not something to obfuscate around . We need to talk frankly about how we anticipate change , and just as we should be watchful now for bias in data , we need to start with a careful self-audit of our own bias about the future .

The most valuable work that I know in this area comes from UNESCO , and from Riel Miller, their head of Futures Literacy . The case he makes is impressive and has the huge merit of moving us away from an extrapolation-based thought process , where we all try to second guess future trends from what we have experienced in our own lives . In the first instance , our own experiences are collected randomly . In the second , this method gives us no way of testing probability or timing . Far better then to try to develop strategies about the Future by creating , or reframing , our thinking through developing hypotheses, altering all of the variables and testing our assumptions . This sounds to me like a managerial version of scientific method , and a discipline devoutly to be wished for when we come to consider  the lazy thinking around much of the Futurism that we read and hear . In the information industry , after all , we say that we are driven by data science . Some attempt to think scientifically may well be overdue . 

So how do we go about the business of reframing our corporate thinking about the future ? Riel Miller’s suggestion is the Futures Literacy Labs concept , though I would not recommend this in some of our industry corporate frameworks as a board level activity . However , the opportunity to put some senior directors , key managers and some younger fast track recruits into a regular meeting context where a discussion discipline is maintained around forming and testing concepts,  could both inform board decision making and spark small scale experimentation to test developed ideation . And this would be especially valuable and useful if the primary concentration was on our users and how they will work . This then forces us to think hard about how we continue to add value for them . It could stop this low level assumptive discussion of generalities – “ Of course , AI is the future of everything “ – and ground our arguments in the vital qualities that they seem to lack – Context and Timing. Above all , it widens the responsibility for the future – this does not rest with the CEO , the CSO or the CTO . It rests with all of us .