KEYWORDS: Copyright Infringement, OpenAI, Data Safety Generative AI Solution, Data Solutions, Multimodal AI dataset, Copyright granted dataset, Licensed data authorization, AI data copyright, High-quality AI dataset, Professional domain corpus dataset
In a recent interview with the WSJ, Murati, the CTO of OpenAI, was asked about the training data used for Sora. She responded by stating that they had used both public and licensed data. However, she remained evasive about the specific sources.
According to WSJ, Murati said,” I’m actually not sure about that.”[1]
This interview has generated a lot of attention and discussion in the industry.
For over a year, there has been a major copyright controversy surrounding the data used to train AI models, making it a hot topic globally.
Just a few weeks ago, on the heels of the New York Times lawsuit against OpenAI, three U.S. digital news outlets, The Intercept, Raw Story, and AlterNet, filed a copyright infringement lawsuit against OpenAI on February 28.
The lawsuits allege that OpenAI and Microsoft are aware of potential copyright infringement. According to the publications, the companies intentionally removed important copyright information from the training data [2].
The New York Times filed a lawsuit last December claiming that ChatGPT was copying their journalistic work. In September, U.S. novelists sued OpenAI for copyright infringement, while in June, two authors accused OpenAI of using their copyrighted books to train ChatGPT for commercial purposes[3].
Various entities, including news media, actors, journalists, authors, and the Writers Guild of America, have filed lawsuits against OpenAI, Stability AI, Meta, Alphabet, and other AIGC R&D companies for unauthorized use of copyrighted works for model training. The rise of Large Language Models (LLMs) in the market faces intellectual property challenges in the courts[4].
1. Training Data Copyrights Becoming a Key Concern for Generative AI Commercialization
The class action lawsuit against OpenAI reflects the media industry’s current concerns about Generative AI technology.
Generative AI is short for Generative Artificial Intelligence. Well-known technologies such as Large Language Models (LLM), Multimodal Large Language Models (MLLM), or Generative Pre-trained Transformer (GPT) are integral components of Generative AI.
With the rapid development of Generative AI technology, various data-related problems arise that require effective solutions.
1.1 Copyright Risks of Public Data on the Internet
The process of training LLMs by exploiting the large amount of data accumulated by the Internet is essentially based on the “feeding” of massive data.
However, there is a huge amount of legally unauthorized content on the Internet. According to the existing laws and regulations on copyright protection, there will be a risk of copyright infringement if such Internet data is accessed and used without appropriate authorization.
1.2 Problems of Accuracy and Specialization of Internet Data
LLMs have demonstrated the ability to represent knowledge beyond that of most humans by analyzing and predicting massive amounts of Internet data.
However, LLMs have obvious limitations in terms of accuracy and specialization, and the answers they provide in specialized areas often do not guarantee accuracy and completeness.
This is due to the lack of high-quality domain data on the Internet. The knowledge systems of different domains are highly specialized, and it may be difficult for an LLM to cover all the details and characteristics of the domain. Even within the same domain, the requirements of different scenarios and tasks may be very different, so LLMs may not be effective enough to solve specific problems.
Therefore, overcoming the challenges of accuracy and expertise is a key aspect of using LLMs in practical applications.
In this case, maadaa.ai has covered the benefits and challenges of LLM in different industries, as well as how to cope with the data challenges in enterprise scenarios, which will be listed at the end of the article.
2. Specialized Domain Data For LLM training: E-books And E-documents
Last June, a lawsuit was filed claiming that the software programs, known as LLMs, that power ChatGPT are infringing derivative works. The AI system cannot function without the information extracted from the material, which violates the exclusive rights of the plaintiffs under the Copyright Act.
OpenAI is accused of illegally downloading thousands of copyrighted books to train its AI system. The company used a dataset of over 7,000 novels from the BookCorpus collection without the consent, credit, or compensation of the authors.
Later versions of OpenAI’s models used larger amounts of copyrighted works. In a 2020 paper, OpenAI disclosed that 15% of its training dataset came from two Internet-based book corpora, “Books1” and “Books2”[5].
Thus, it is worth noting that while a variety of content was used to train LLMs, books were the core corpus material in the training dataset because they provide the best examples of high-quality, long-form writing.
3. Generative AI Data Solutions From maadaa.ai
The lawsuit between the New York Times and OpenAI highlights the importance of ethical considerations in AI development and the responsible use of AI technology in all fields.
maadaa.ai is committed to creating “data-centric” specialized Generative AI data services and a series of Generative AI dataset products, to promote the sustainable development of Generative AI technology and accelerate its industry adoption.
maadaa.ai has officially launched large-scale, high-quality dataset products for Generative AI model development:
- Large-Scale Professional Domain Corpus Dataset — Chinese
- Multi-modal Generative AI Large Datasets — Licensed
3.1 Product Features:
- Licensed Data Authorization: All data are properly licensed to ensure copyright compliance during the training and application of generative AI models.
- Diverse Data Types: The dataset covers a wide range of large-scale data types including text, images, videos, and audio, fully meeting the needs of multimodal AI model development.
- High-Quality Professional Annotation: The dataset includes image-text corpus, video-text corpus, etc., all of which are accurately semantically annotated and professionally calibrated to ensure the accuracy of Generative AI model training.
- Industry Domain Customizable: Covering nearly 100 industries and application scenarios with specialized datasets, supporting the customization of high-quality datasets for industry-specific Generative AI model development.
3.2 Typical Application Scenarios:
Generative AI-enabled search engine, chatbot, professional Q&A, professional assistants, domain-specific content generation, etc.
With the rapid development of Generative AI technology, the issue of data copyright has become a focus of attention in the industry. In-depth discussions and regulations on the legality and ethics of using data for AI technology have been initiated worldwide.
As a professional data service provider, maadaa.ai strongly supports the sustainable development of Generative AI technology and the industry landing by launching genuine licensed high-quality multimodal dataset products.
Further Reading:
- ChatGPT for Enterprise Scenarios — How to cope with the data challenges
- ChatGPT For Fashion Industry: New Opportunities and Challenges
- ChatGPT for E-Commerce — Benefits and Challenges
- Unlocking the Potential of Personalized Fashion with ChatGPT (open datasets included)
- Generative AI is Accelerating the Revolution of the Advertising Industry
- ChatGPT in Finance: Assessing the Benefits and Challenges for Financial Institutions
- How ChatGPT and Generative AI can transform Healthcare (Part.1)
- How ChatGPT and Generative AI can transform Healthcare(Part.2)
Reference List:
- https://futurism.com/video-openai-cto-sora-training-data
- https://www.theverge.com/2024/2/28/24085973/intercept-raw-story-alternet-openai-lawsuit-copyright
3. https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
4. https://www.natlawreview.com/article/generative-ai-systems-tee-fair-use-fight
5. https://www.hollywoodreporter.com/business/business-news/authors-sue-openai-novels-1235526462/