
AI, copyright and LLMs
What are the copyright and confidentiality issues arising from use of public and private Large Language Models (LLMs)? Justin Harrington explains.
- Details
This is the fourth article in Justin’s AI and Public Sector series. The previous articles are here:
Large Language Models (LLMs) like ChatGPT, Gemini, and Claude have quickly moved from cutting-edge research into everyday use, including within the UK public sector. These powerful tools, capable of generating human-like text and assisting with a range of administrative, research, and policy tasks, offer real value to public bodies under pressure to deliver more with less.
But while the benefits are exciting, their use raises important legal questions, especially around copyright and confidentiality. For public sector workers and organisations in England and Wales, understanding these risks is crucial to using LLMs responsibly and lawfully.
In this article, we explain the key copyright and confidentiality issues that come with using LLMs, whether public (open, web-based models) or private (internally managed or customised AI systems), and what public sector bodies should do to manage them.
Understanding the difference: Public vs. Private LLMs
At the outset, it’s useful to distinguish between two common types of LLMs:
- Public LLMs are models accessible via the internet, typically run by third parties (e.g. ChatGPT by OpenAI, Bard by Google). Users submit prompts via web interfaces or APIs, and responses are generated from models trained on vast internet datasets.
- Private LLMs are custom or locally hosted models, often trained on an organisation’s internal data. These may be managed in-house or via a secure cloud provider, offering more control over data handling and outputs.
Each presents different challenges when it comes to copyright and confidentiality.
Copyright Issues
1. Copyright in Training Data
Many public LLMs are trained on massive datasets scraped from the internet—news articles, blogs, books, forums, and more. Much of this content will be copyright material.
Is it lawful to use this material to train an AI model without the rights holder’s permission?
In the UK, we have a copyright exception in Section 29A of the Copyright, Designs and Patents Act 1988 (CDPA), which states that text and data mining (TDM) is allowed for non-commercial research under certain conditions. However, commercial use, such as TDM for training an LLM (whether public or private) for revenue-generating purposes, does not benefit from this exception.
This issue is currently being litigated before the UK courts. Getty Images has brought a claim for copyright, trade mark and database right infringement in the High Court against Stability AI in respect of its AI image generation software. While there have been preliminary decisions already, the case is expected to be heard later this year. Separately, Mumsnet has also brought a claim against OpenAI, claiming infringement on similar grounds.
For now, if you’re using a public LLM, you’re relying on the model provider to have obtained lawful access to the works making up the training data. If you’re using a private LLM, especially one trained on third-party content, you must ensure you have the rights to use that content for training. In both cases, you’ll want the contract with your supplier to explicitly deal with this. Because of the processes they have used and the clear risks involved, you may find that a supplier of a public LLM may be reluctant to provide you with legal assurance on this point.
2. Copyright in AI Outputs
This has two aspects to it.
Firstly, if you ask a model to draft a policy briefing or generate a report, is there a risk that that policy briefing or report infringes works included within the LLM?
By way of example, an AI tool that is asked to create a report on the reorganisation of council functions may look similar to a report prepared by external consultants for a different authority. It is possible that the tool came up with the output quite independently based on the user’s input. But the degree of similarity may mean that it is difficult to convince a judge of this.
Secondly, is it even possible for there to be copyright in a work produced by an AI system? This is an area of debate because (it is argued) the threshold for originality is not met by a work created by a machine. If that is the case, provisions in your contract with your supplier will become more important. Where there is no copyright in a work, there is scope for the supplier to claim (as many do in their standard terms) that all output is owned by them and/or is subject to other contractual provisions regarding use.
There is an extra wrinkle in the UK, which has legislation specifically relating to works created by AI. Section 9(3) CDPA provides that, where a work is “computer-generated, the author shall be taken to be the person by whom the arrangements necessary for the creation of the work are undertaken.” But so far, courts in the UK have consistently found there to be a human author for all works in respect of which this issue has arisen, leading many commentators to argue that Section 9(3) will never apply.
This therefore, remains an issue to be clarified and confirmed.
Confidentiality Issues
The second major area of risk is the handling of confidential or sensitive information, an especially serious concern for public sector organisations dealing with sensitive data, case records, or internal deliberations.
1. Submitting Sensitive Data to Public LLMs
When using a public LLM, anything you enter into the system is sent to a third party. Depending on the provider’s terms of use, your data may be:
- Stored
- Reviewed by humans for quality control
- Used to improve the model
- Subject to processing in countries outside the UK – obviously problematic if your data includes personal data
This creates obvious risks if you input confidential information, such as details of a service user, legal advice, or a council’s draft policy. Once entered, you may lose control over that data entirely. There is also a risk that certain prompts could lead the AI tool to reveal confidential information to third parties, circumventing the rules set by the original developers, as was the case with Chat GPT.
Many LLM providers have released “enterprise” versions with stricter controls (see for example, Microsoft’s response to this issue FAQ: Protecting the Data of our Commercial and Public Sector Customers in the AI Era | Microsoft Community Hub), but this does not eliminate the need for careful input discipline. Staff must be trained not to enter confidential or identifiable information into public LLM tools unless a clear agreement is in place that provides safeguards.
2. Use of Confidential Information in Private LLMs
Private LLMs can be trained on internal datasets, but you still need to assess whether you are lawfully entitled to use those datasets. Key questions include:
- Is there a lawful basis for processing if personal data is involved?
- How is the data kept secure?
- Are access controls in place to prevent misuse or access to data by a third party?
Without appropriate governance and oversight, there is a risk of breaching confidentiality obligations or mishandling personal data, potentially leading to complaints, reputational damage and/or enforcement action.
Managing the Copyright and Confidentiality Risks: Practical Steps for Public Sector Bodies
LLMs have the potential to improve productivity and speed up responses. But to avoid copyright and confidentiality issues, public sector organisations should take the following steps:
1. Develop a Clear AI Use Policy
Create internal policies that explain how staff can use LLMs, which AI systems are approved, and what types of information or prompts must not be input.
2. Review Terms of Service Carefully
If using third-party tools, check the provider’s terms with respect to thereuse of data and/or prompts, and intellectual property infringement and copyright in outputs. Prefer tools with enterprise agreements offering confidentiality and copyright warranties and indemnities.
3. Monitor Legal Developments
The UK Government is still shaping its approach to AI regulation, so it is important to stay informed and adjust your approach as the landscape evolves. At the time of writing, the UK Government has launched a consultation on copyright and AI. It initially made clear it favoured an approach whereby works would be available for use as training data unless an opt-out was completed by the right holder. But it now appears this is no longer the Government’s favoured approach.
Conclusion
LLMs offer real promise for innovation in the public sector, but they come with complex copyright and confidentiality challenges. By understanding the risks and taking proactive steps to manage them, public sector organisations can harness AI effectively while meeting their legal and ethical responsibilities.
Justin Harrington is a partner at Geldards.