Adventures in DataLand: Raw, Real, Synthetic and Open

Filed Under Uncategorized | Leave a Comment

The conversation often goes like this:

“What do you think are the most important issues for the information industry today?”

“Well, of course it’s AI, and getting these AI developers to act responsibly around data“.“

You mean, act responsibly and transparently and identify the data used and held in their models?“

“Yes, of course, and acting responsibly also means paying a decent license fee for our data content!”

“Yes, they have to realise that they cannot ignore the powerful legal and moral position of those who hold copyrights in valuable data. “

I too am a firm advocate of data licensing for AI modelling reuse. When IP is used for any purpose, I believe that it has to be recognised, the usage has to be by consent, and that proper acknowledgement in monetary terms needs to be made to recognise the effort and curation involved. In saying this, of course, I also want to make it clear that I know that most data owned by most B2B organisations that use it in information services were not the original IP of these owners, but that the current owners have obtained the data in the course of creating an information service of some sort of another. In doing this, they often edited it, structured it, improved it, added metadata to it and created value as a result. The original owners – governments, private citizens, research organisations, corporate bodies etc create the data by virtue of their existence and their activity, and in some instances need it to be collected and manipulated for reasons of public policy, research and innovation, compliance activity or reputation management. For most information service providers, the date that they have collected is the most valuable commodity in their world – “the oil of the virtual world“. They prize it highly and they think it is unique. They are right to value it, but we are all becoming gradually aware that there is more data in the world than is contained in the worlds commercial databases, the Cloud or even the Internet.

in the course of looking for and trying to map the various emerging data licensing agencies, the breadth of possibility becomes clear. The powerhouse that is CCC, the Copyright Clearance centre (www.copyright.com) is central to everything and is concentrated around scientific and medical data.).ProRata (prorata.ai )builds AI-based attribution and monetization technologies and solutions that credit and compensate content owners for the value of their work. Human Native (humannative.ai ) says“ Better AI starts with better data. We bring together suppliers of high quality, premium data with reputable AI developers—come join the ecosystem“. Created by Humans (createdbyhumans.ai) calls itself.: “The AI rights licensing platform for books“ while Narrativ ( narritiv.ai) is a licensing site for voices and voice data. And the Data Llicensing Alliance run by Dave Myers, is more than four years old, and seeks to build a marketplace of buyers and sellers in STEM data (www.diadata.com).

Yet all of this rich variety exists in the domain of human creativity. The needs in data terms of AI models are not confined to human creativity. The potential use of data derived from machine intelligence now becomes a factor in creating AI models, and just as we have heard about synthetic data in terms of financial services, so we are now beginning to think about synthetic data in terms of AI modelling.. The announcement last week of the funding of. SandboxAQ by Nvidia takes this former Google startup into new territory.

SandboxAQ (www.sandboxaq.) is, it says, “ leading the next wave of enterprise AI with Large Quantitative Models (LQMs) — grounded in physics and built to simulate real-world systems. Across biopharma, chemicals, advanced materials, cybersecurity, healthcare, navigation, and more—LQMs provide the scientific accuracy and computational scale to solve the world’s most complex challenges.“ So, in Financial services and in scientific research and innovation at least, we can make our own data and not be wholly dependent upon the world of owned and traded data. And as this new scenario becomes apparent, some of us will begin to wonder what it’s affect will be unreal world data evaluations.

The use of AI in this way to create logical extensions of existing knowledge is already well established. I notice that the industry is beginning to refer to “synthetic“ data as opposed to the “real world data“ (inevitably,RwD) found in books and journals, government reports and newspapers. Of course, AI businesses, large or small, point to the licensing cost of data as a crippling tax which will restrict innovation. So far it does not seem to have strangled the competitive appetites of Silicon Valley, but will it stop start up innovators in small markets and niche sectors?

It seems that the data industry is thinking about that already. There is serious activity now around the idea of Open Data in this context (or it already exists in Open Science) not just as a way of sharing datasets amongst researchers, but also as a way of using Open Data to help small scale developers in build effective models without severe licensing costs. Common Pile vo1 is a development of this type.(https://huggingface.co/blog). The duty of ensuring that data is complete, accurate, and has not been distorted or polluted is a vital one, and ensuring that building effective models is not limited to the developers who have the deepest pockets is important as well. The huge collaboration that has built the common pile ( University of Toronto and Vector Institute, Hugging Face, the Allen Institute for Artificial Intelligence, Teraflop AI, Cornell University, MIT, CMU, Lila Sciences, poolside, University of Maryland, College Park, and Lawrence Livermore National Laboratory) are trying to build public standards in terms of both quality of data and of transparency. We should all be grateful for their work.

So now we have data in a variety of forms. Information industry data that can be exchanged and traded shares the business of AI model development with Open Data resources built and released for the very purpose, and with AI created data built as a way of testing probability and computing the logical data extensions of the world we already know., Is this also a pointer towards the ability of the machines to create the resources required by the machine. Perhaps we should be thinking not just about the value of data and data licensing transactions, but also about the duration and lifespan of data licensing markets themselves.

Jun

24

Adventures in DataLand: Raw, Real, Synthetic and Open

Search

Recently Written

Categories

Archives

Blogroll

Links

Share & Subscribe

Admin