People are using large language models (LLMs) like ChatGPT to answer questions, help with research, and churn out articles and books. And yes, some people use LLMs for editing, proofreading, and reference formatting, at least for their own work.
My position on text generators is very simple: I will not use them in my writing or editing work.
Why? Read on.
LLMs work by ingesting mass amounts of text that they then use to produce statistically likely word sequences from the users’ prompts. That can result in text being reproduced verbatim or closely paraphrased, usually without attribution.1See, e.g., Lee, Jooyoung, Thai Le, Jinghui Chen, and Dongwon Lee. “Do Language Models Plagiarize?” In Proceedings of the ACM Web Conference 2023, 3637–47, 2023. https://doi.org/10.1145/3543507.3583199. That is plagiarism; it’s wrong without the use of LLMs, and it’s cost people their careers and degrees. I don’t see it being any different when it’s done by or with the help of LLMs.
Any LLM-produced work that’s meant for publication would need to be scoured for plagiarism. Yes, there are tools that help with that, but they can have problems with false positives and negatives and with identifying even slightly paraphrased text. And yes, if you’re a client working with a new-to-you writer, you probably want to look for plagiarism before you publish their work. So what’s the difference? A human writer at least knows they’ve plagiarized or that they’ve been careless enough in their note-taking that it’s possible something slipped through. An LLM isn’t capable of knowing it’s plagiarized, nor do its users have any way of knowing short of verifying the originality of every passage.
OpenAI hereby assigns to you all its right, title and interest in and to Output. This means you can use Content for any purpose, including commercial purposes such as sale or publication, if you comply with these Terms. OpenAI may use Content to provide and maintain the Services, comply with applicable law, and enforce our policies. You are responsible for Content, including for ensuring that it does not violate any applicable law or these Terms.
That seems clear enough. We own the content that OpenAI’s models generate from our input prompts. But section 3(b) goes on to say, “Due to the nature of machine learning, Output may not be unique across users and the Services may generate the same or similar output for OpenAI or a third party.” That seems to leave those users open to claims of plagiarism from each other (leaving aside for the moment that the content is in effect plagiarized from other sources by their inclusion in the models’ training data).
And, from 9(e), “You may not assign or delegate any rights or obligations under these Terms…”. I’m not a lawyer, but that seems to say that ownership of the content cannot be transferred. But copyright/ownership transfer is often required for publication of written material — especially work done for hire.
So if we generate content with ChatGPT, we may own it, but we may not be able to transfer that ownership, and it’s possible that another ChatGPT user owns output that is identical or similar enough to make the original ownership disputable. That sounds fun.
There are (at least) two general issues related to copyright with LLM-generated text. First, the models are trained on material scraped from the web and books. An April 2023 preprint shows that GPT-4 and ChatGPT have memorized at least large parts of books like Harry Potter and The Game of Thrones.3Chang, Kent K., Mackenzie Cramer, Sandeep Soni, and David Bamman. “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4.” arXiv, April 28, 2023. https://doi.org/10.48550/arXiv.2305.00118. Much of that work is copyrighted by the original authors or publishers, and there’s no evidence the producers of the LLMs have sought permissions to use the material. A long-dead blog of mine is included in The Washington Post’s list of websites used to train ChatGPT, and I know I wasn’t asked for permission for this use.
The other issue is whether the material produced by a text generator can be copyrighted. That hasn’t been firmly nailed down yet, but the U.S. Copyright Office has recently ruled that AI-generated art cannot be copyrighted, though work derived from or containing that art may be copyrightable.4Dreben, Ron. “Generative Artificial Intelligence and Copyright Current Issues.” Morgan Lewis, March 23, 2023. https://www.morganlewis.com/pubs/2023/03/generative-artificial-intelligence-and-copyright-current-issues. So, it may not be possible to assert copyright on text generated by an LLM.
There are numerous reports of LLMs producing inaccurate information: they’ve accused people of sexual harassment without any proof, cited nonexistent papers as sources for its output, and generally made stuff up. Even when the domain is fairly limited, LLMs aren’t accurate. Upwork, following the new logic of “if it’s a textbox, we must connect it to an LLM,“ is testing AI-generated job posts. An Upwork client recently posted their experience with it to Reddit: not only did the generator omit key details from the client’s prompt, it created job requirements that weren’t in the original brief.
None of this should be surprising. LLMs are designed to generate each word, phrase, and sentence based on what’s statistically likely to follow based on the user’s prompt, the previously generated text, and the material the model was trained on. There’s no attempt to ensure the generated text is true. If it is true, that’s purely by chance.
As Neil Gaiman put it, “ChatGPT doesn’t give you information. It gives you information-shaped sentences.”
Worse, the generated text is phrased confidently. It sounds true to people who don’t know the subject well. If you don’t know enough about a topic to know whether the LLM’s output is accurate, you need to verify everything it produces. And if I’m going to do that, why wouldn’t I just do the research myself?
Authorship and Publication Issues
Tainting a manuscript by using an LLM to generate all or part of it may limit publication options. Some academic outlets have already banned or limited AI-generated work (Science5Thorp, H. Holden. “ChatGPT Is Fun, but Not an Author.” Science 379, no. 6630 (January 27, 2023): 313–313. https://doi.org/10.1126/science.adg7879. and ICML, for example), while others, such as Nature, require LLM use to be documented in the Methods (or similar) section.
Finally, much of the text I’ve seen LLMs generate is pure pablum. The text is generic and vague, and it lacks analysis. Going back to Upwork’s job post generator for a moment: each post it generates sounds exactly the same. Yes, the text can be cleaned up and the wording improved in editing, but how much time would that really save over writing the material myself to begin with?
Any one of these issues would be enough to make me pause before using an LLM in my work. All of them together? Oh hell no.
These words are 100% organic.
- 1See, e.g., Lee, Jooyoung, Thai Le, Jinghui Chen, and Dongwon Lee. “Do Language Models Plagiarize?” In Proceedings of the ACM Web Conference 2023, 3637–47, 2023. https://doi.org/10.1145/3543507.3583199.
- 2Retrieved April 9, 2023, and last updated March 14, 2023.
- 3Chang, Kent K., Mackenzie Cramer, Sandeep Soni, and David Bamman. “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4.” arXiv, April 28, 2023. https://doi.org/10.48550/arXiv.2305.00118.
- 4Dreben, Ron. “Generative Artificial Intelligence and Copyright Current Issues.” Morgan Lewis, March 23, 2023. https://www.morganlewis.com/pubs/2023/03/generative-artificial-intelligence-and-copyright-current-issues.
- 5Thorp, H. Holden. “ChatGPT Is Fun, but Not an Author.” Science 379, no. 6630 (January 27, 2023): 313–313. https://doi.org/10.1126/science.adg7879.