Garbage In, Garbage Out: Sabotaging Your Own Content to Thwart AI Training

In its ongoing efforts to satisfy ChatGPT’s colossal appetite for content on which to “train” its algorithms, OpenAI continues to announce major licensing deals with partners ranging from traditional publishing stalwarts like Time and The Atlantic to user-generated content platforms like Reddit and Stack Overflow (and its sister site Stack Exchange, which will be referred to interchangeably). The pace of this deal making has accelerated in the wake of lawsuits against OpenAI filed by the likes of The New York Times, The Chicago Tribune and even Games of Thrones author George R.R. Martin, arguing that OpenAI relies on copyright infringement to train its AI “models.”

Though OpenAI’s proactive strategy of securing licensing agreements demonstrates a formal acknowledgment of copyright concerns, it also underscores a growing tension between the creators of user-generated content and the platforms that use this content to train AI. As OpenAI navigates these legal waters with contractual agreements, the response from the user community reveals a spectrum of concerns that extend beyond legal formalities. This shift from corporate deal-making to community reaction highlights the complex interplay between technological advancement and user rights.

Because they cannot beat them, so to speak, OpenAI is joining them instead, with fistfuls of money at the ready.

Not all users of these deal making platforms are pleased by these developments. Some users of Stack Overflow, a haven for programmers with tens of millions of detailed, user-generated messages filled with code, advice, and encouragement, have taken to sabotage as a way to rebel by attempting to delete or maliciously “revise” their prior posts to include intentional errors and bugs.

Garbage in, garbage out, in other words.

The Power Loom Riots

These desperate efforts to poison the well are reminiscent of a 200-year-old outbreak of rioting instigated by the introduction of an advanced, automated manufacturing process that turned traditional notions of productivity on their head. (Hat tip to arstechnica and user mdrejhon.)

At the dawn of the nineteenth century, the adoption of the power loom led to riots in northwest England, as skilled craftsmen working on hand looms were displaced by the new technology. Known as the Power Loom Riots, the civil strife included the destruction of newly installed power looms and the activation of the military to quell the unrest. Skilled hand loom weavers went from earning sixteen shillings a day to sixteen shillings a week as the operators of the new power looms became the ascendant techno-elite.

Framework for Legal Analysis

Like the “programmed” punch cards used by the power looms to automate the production of complex woven patterns, which themselves were the inspiration for the progenitor of all modern computers, Babbage’s Analytical Engine, the AI algorithms of GPT and its ilk stand to rewrite our understanding of what it means to be the creator, author, or designer of any particular content—including what it means to “own” that content or have the legal right to control how it is used or whether such use is attributed to the original author.

But just as punch cards did serve a purpose, using expansive datasets in AI training offers significant benefits. For instance, by analyzing vast amounts of diverse information, AI models like ChatGPT can achieve generate responses that are more accurate, relevant, and contextually appropriate. This capability enhances user experience, making digital assistants more helpful and interactive across various applications. Furthermore, the continual improvement in AI’s ability to process and understand complex datasets drives innovation across industries, leading to smarter, more efficient technological solutions that benefit society at large.

Copyright Law, Terms of Service and User-Generated Content

Under U.S. copyright law, the author of “any original work fixed in a tangible medium” automatically holds copyright. This includes posts and comments made by users on platforms like Reddit and Stack Exchange. Users retain copyright to their content, granting the platform a license rather than transferring ownership.

You would be forgiven, upon hearing that Reddit and Stack Exchange users own the copyright to their contributions, for concluding the users have legal protection against the use or misuse of those contributions without the users’ consent. The problem is that the Terms of Service agreements, which are mandatory and nonnegotiable, and which every single user agrees to as a condition of using either website, grants the platforms an extremely broad license to the users’ content, contractually giving the platforms carte blanche to do essentially whatever they want with the users’ content, including in some cases forcibly preventing users from deleting or editing it after the fact.

Reddit’s terms grant the platform a broad license to use, reproduce, and modify user content: “By submitting Content to Reddit, you grant us a worldwide, royalty-free, sublicensable, and transferable license to use, store, display, reproduce, modify, create derivative works.”

Stack Exchange also retains a comprehensive license over user submissions, with users granting “a perpetual, irrevocable, royalty-free, sublicensable, transferable license to use, reproduce, distribute, prepare derivative works of, display, and perform the content.”

User Complaints and Platform Pushback

Reddit users have voiced frustration over their content being used to train ChatGPT without their explicit consent, with one commenting that “all this does is make me want to remove all my posts and discontinue use of Reddit.”

Stack Exchange users have expressed similar discontent in terms evocative of the Power Loom Riots’ sabotage of new technology. Protest messages are not good enough, says user Bongle. Instead, he suggests users edit their prior posts to be “subtly wrong” by purposefully introducing errors and bugs to the code.

These complaints underscore a serious disconnect between the legal permissions granted by terms of service and the expectations of users. This disparity may invite legal scrutiny and demands for more explicit consent mechanisms.

Platform Responses

To its credit, Reddit’s terms of service gives users the final say on whether users’ past contributions are deleted: “When Reddit users delete their posts or other content, the site deletes it everywhere, with no ghostly remnants lingering in unexpected locations.”

Stack Overflow, in stark contrast, “does not let you delete questions that have accepted answers and many upvotes because it would remove knowledge from the community,” further warning that “potentially useful” content “should not be removed except under extraordinary circumstances.

What about fair use?

As mentioned above, prior to OpenAI’s spate of high-profile deals, it was getting more familiar with being sued by content companies than signing eight- or nine-figure license agreements with them. In these lawsuits, without the double protection of licensing agreements with companies who themselves irrevocably license almost all conceivable rights to their users’ content, OpenAI’s fate likely will be determined by courts’ application of the fair use doctrine under copyright law to the novel context of training artificial intelligence. Such decisions will have to reckon with the U.S. Supreme Court’s 2021 decision in Google LLC v. Oracle America, Inc., where the Court found Google’s wholesale use of over 11,000 lines of Oracle’s Java code in Google’s Android mobile operating system was sufficiently transformative to qualify as fair use.

Whether such a transformative, fair use analysis can be applied to justify OpenAI’s ravenous appetite for training content is the subject of a forthcoming Part II to this post.