This French start-up just proved OpenAI wrong. It claims you can train AI on non-copyrighted data

The Common Corpus aims to create a space for open science. - Copyright Canva

Published on 02/04/2024 - 10:49•Updated 11:20

The Common Corpus aims to create a space for open science.

Last year, OpenAI said it was “impossible” to create tools such as ChatGPT without access to copyrighted material. But one French start-up has proved you can.

It comes at a crucial time when legal battles over copyrighted material grow, the biggest case being the New York Times suing OpenAI and its investor Microsoft for allegedly using news articles to train ChatGPT.

Now, Common Corpus may have found the solution to legal headwinds as it has unveiled the largest public dataset for training large language models (LLMs).

This international initiative, coordinated by the French start-up Pleias, includes researchers and other open science AI companies such as HuggingFace, Occiglot, Eleuther, and Nomic AI.

It is also supported by Langu:IA, a project run by the French culture ministry’s French language unit which aims, among other things, to "facilitate access to data in French and in the languages of France for LLM training and specialisation".

The Corpus boasts the largest English-speaking dataset to date with 180 billion words, which includes 21 million digitised newspapers and millions of books. But it is also multilingual and has the largest open data set in French (110 billion words), German (30 billion words), Spanish, Dutch, and Italian.

“I think [the Corpus is] very important so we can create an incentive for competition [with companies like OpenAI],” Pleias cofounder Pierre-Carl Langlais told Euronews Next.

He said it is good for cooperation because “once you release a corpus you have shared interest to make it better and avoid duplication”.

Some European publishers, such as the French newspaper Le Monde, have entered into agreements with OpenAI to license their content for training.

While specific terms of these agreements remain undisclosed, Langlais said it is “a really big concern because it means that they may have to obey US companies and it’s especially worrying as it’s one of the most important media in France”.

“So it's a big issue that is creating this kind of command system,” he added.

Langlais believes that the Corpus is therefore essential as it can leverage the playing field by lowering the value of copyrighted data.

Different types of open content

There are limitations when it comes to Common Corpus as it uses non-copyrighted material.

In Europe, for a text to not be subject to copyright, it must be 70 years after the death of the author. This means that the dataset is not trained on newer material.

“Obviously, it comes with a range of issues about having the language be up to date…I think also ethical issues may be different, but for now, it's only one part of the open content we have,” Langlais said.

The other two parts he said that will make the data more recent are open administrative data, which he says is “actually big in Europe because we have this big commitment to circumvent this [data],” and the open science movement, which makes scientific research available to everyone.

Langlais said another way to improve the Common Corpus is to use synthetic data, which is artificially generated data that replicates the patterns, relationships, and characteristics found in real-world data.

In 2022, MIT researchers found that synthetically trained models performed even better than models trained on real data for videos that have fewer background objects.

But Langlais believes the purpose of the Common Corpus is having “a common idea is to make it better,” he said.

“And so a lot of our initiative is to ensure that it's going to be richer, it's going to be more diverse, it can be changed,” he said, adding that in the future he hopes to include more European languages in the project.

Comments