AI Model Training
It has been more than a year since the launch of ChatGPT and the hype that the chatbot has brought to artificial intelligence (“AI”) does not seem to be slowing down. In fact, the momentum seems to be through the roof with OpenAI, the company behind ChatGPT, recently teasing the release of ChatGPT 5.0 that is promised with overall enhanced performance compared to the current versions of ChatGPT 3.5 and ChatGPT 4.0.
The proliferation of generative AI has also sparked debates on “legal” AI models training. As we have briefly mentioned in our earlier article on “Addressing Copyright Infringement and Challenges in AI Training”, AI does not inherently develop its own “intelligence”, rather it requires a substantial amount of data for training. Consensus is divided on whether it is acceptable or legal for companies to train AI models using publicly available information scraped from the internet, without the permission of the publishers or authors.
Scraping Public Internet Information for AI Training – Is It Fair?
OpenAI’s stance in this regard is clear – “Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness”. There are many however that do not agree with OpenAI and have taken OpenAI and some other companies to court for using their content or data to train AI models without authorisation.
Adversaries of using publicly available information for AI training typically rely on copyright infringement as the main argument. Some of their concerns are that they have no control over how their materials and contents will be used. Their life work could end up becoming the fuel of a generative AI without guardrails, spouting disinformation, biased and discriminating remarks, sexual and explicit content, or worse – generating actual harmful content such as teaching others how to build an explosive or engineering a fatal car crash.
Given that there is still no absolute certainty on whether scraping copyright-protected content from internet sources to train AI model would constitute copyright infringement, websites and database operators have resorted to implementing technical and/or technological measures to prevent web scraping, as well as contractually prohibiting web scraping in their online terms and conditions.
These efforts may have been part of the reasons why AI companies are increasingly entering into licensing deals with publishers and content providers to use their content and data for AI training. Despite its hard stance, OpenAI has also entered into licensing agreements with The Associated Press and Axel Springer SE for the licensing of content for AI training. Reddit has also announced the signing of a content licensing deal with Google for the licensing of its content to Google for AI training. The trend does more than just allowing AI companies to secure legitimate sources of training content, it also creates new revenue streams for organisations with content banks under their management.
Licensing of AI Training Dataset – Key Points to Consider
Whether you are a company looking to license content to train your AI model, or you are a company looking to commercialise your content bank, there are a few considerations that you should be mindful of before entering into a deal:
- i) What is the value of the dataset being licensed: One key consideration is of course the licensing fee. This is directly affected by the value of the dataset, which in turn is determined based on its size and quality. While it is easy to assess the value of the dataset at the point of licensing, one must not forget that the dataset will grow over the licensing term, as new content gets published. The license fee to be agreed upon should take into consideration the value of the existing content, as well as future content to be added to the library or database. Services of an IP valuer would be helpful in this connection.
- ii) Exclusivity: Exclusivity of the training license should be clarified and agreed upon up front. The licensor should be allowed to license its dataset to third parties for AI training, or even use it for training of its own AI that it may want to deploy in the future as business diversifies.
- iii) Consequences of Contract Expiration or Termination: Parties often overlook details on how to unwind a licensing deal. This point is particularly crucial for licensing of AI training dataset, given the difficulties and challenges that come with removing datasets from a trained AI model or “Machine Unlearning”. Complete removal of datasets from a trained AI model may not be possible with current technology, or even if it is possible to achieve, it may require retraining of the AI model, which is expensive and time consuming, or it might affect efficiency, utility and accuracy of the AI model. Licensor may have to be prepared to leave its datasets with the AI model.
- iv) Continuing Obligation: This is more of a consideration for the licensor. Oftentimes, AI companies would expect the underlying dataset to grow, as the content under management of the licensor expands. While AI companies commit to paying a fixed amount of license fee each year, they may be inclined to impose a service level or an undertaking to expand the dataset. Licensor should carefully consider the reasonableness of the service level or undertaking before agreeing.
- v) Rights of the Licensor to Grant License: As with all licensing deals, licensors’ ability to grant the required license is crucial, and this is often reflected as a key warranty in the agreement. In the context of AI training dataset licensing, the licensors are usually publishers or administrators of a collective database or library. Mandates given to the publishers or administrators may not include rights to grant licenses in respect of the materials of the contributors and as such, permissions of the individual contributors may have to be sought before entering the licensing deal, along with options to opt out of the training regime.
- vi) Terms of Use: It would also be in the licensors’ interest to regulate how the training dataset will be used, particularly on the type of AI that will be trained. The last thing that licensors would want is obviously for its dataset to be used to train an unethical AI.
- vii) Data Privacy and Security: With the increasing scrutiny on data privacy regulations, it is imperative for the parties to clearly address data privacy and security concerns in the licensing agreement. This includes specifying how the data will be processed, transferred, stored, and protected throughout the duration of the agreement. Licensors may even want to go a step further to consider whether to anonymise the dataset prior to delivering the same for AI training. Given how AI training works, it is always easier to sanitise the training dataset prior to it being fed to the AI algorithm.
- viii) Intellectual Property Rights Ownership: Clarifying the ownership of intellectual property rights related to the trained AI models will be crucial in the licensing agreement. Parties should clearly define who retains ownership of any new intellectual property created as a result of using the licensed dataset for AI training, and having clear ownership provisions could help avoid unnecessary future disputes and to ensure that each party retains the legal rights they are entitled to under the agreement.
With AI deployment gaining traction, there are bound to be more companies looking to deploy their own AI, who are in need of training dataset. Given how AI model training works, AI training dataset licensing should not be treated just like any other intellectual property licensing deal. The licensing agreement should be carefully crafted to suit the nuances of each deal so that parties are able to achieve their respective goal while at the same time protecting their interests.
If you are in need of legal advice or assistance in relation to licensing of content for AI training or in relation to your next exciting project involving AI, our dedicated team of professionals in the Technology Practice Group is here to help. You may reach out to our partners below to discuss more on how we may assist. We look forward to working with you on your exciting venture.
About the authors
Lo Khai Yi
Partner
Co-Head of Technology Practice Group
Technology, Media & Telecommunications, Intellectual
Property, Corporate/M&A, Projects and Infrastructure,
Privacy and Cybersecurity
Halim Hong & Quek
ky.lo@hhq.com.my
Ong Johnson
Partner
Head of Technology Practice Group
Transactions and Dispute Resolution, Technology,
Media & Telecommunications, Intellectual Property,
Fintech, Privacy and Cybersecurity
johnson.ong@hhq.com.my
More of our Tech articles that you should read:
Exploring the Legal Implications of AI as Inventors: UK Patent Law Perspective
Whether AI-Generated Work Could be Protected by Copyright Law
Addressing Copyright Infringement and Challenges in AI Training
Artificial Intelligence and Cybersecurity: A Double-Edged Sword Fight