Sabotear tu propio contenido para frustrar el entrenamiento de la IA

Índice

In its ongoing efforts to satisfy ChatGPT’s colossal appetite for content on which to “train” its algorithms, OpenAI continues to announce major licensing deals with partners ranging from traditional publishing stalwarts like Time and The Atlantic to user-generated content platforms like Reddit and Stack Overflow (and its sister site Stack Exchange, which will be referred to interchangeably). The pace of this deal making has accelerated in the wake of lawsuits against OpenAI filed by the likes of The New York Times, The Chicago Tribune and even Games of Thrones author George R.R. Martin, arguing that OpenAI relies on copyright infringement to train its AI “models.”

Though OpenAI’s proactive strategy of securing licensing agreements demonstrates a formal acknowledgment of copyright concerns, it also underscores a growing tension between the creators of user-generated content and the platforms that use this content to train AI. As OpenAI navigates these legal waters with contractual agreements, the response from the user community reveals a spectrum of concerns that extend beyond legal formalities. This shift from corporate deal-making to community reaction highlights the complex interplay between technological advancement and user rights.

Because they cannot beat them, so to speak, OpenAI is joining them instead, with fistfuls of money at the ready.

Not all users of these deal making platforms are pleased by these developments. Some users of Stack Overflow, a haven for programmers with tens of millions of detailed, user-generated messages filled with code, advice, and encouragement, have taken to sabotage as a way to rebel by attempting to delete or maliciously “revise” their prior posts to include intentional errors and bugs.

Garbage in, garbage out, in other words.

The Power Loom Riots

These desperate efforts to poison the well are reminiscent of a 200-year-old outbreak of rioting instigated by the introduction of an advanced, automated manufacturing process that turned traditional notions of productivity on their head. (Hat tip to arstechnica and user mdrejhon.)

At the dawn of the nineteenth century, the adoption of the power loom led to riots in northwest England, as skilled craftsmen working on hand looms were displaced by the new technology. Known as the Power Loom Riots, the civil strife included the destruction of newly installed power looms and the activation of the military to quell the unrest. Skilled hand loom weavers went from earning sixteen shillings a day to sixteen shillings a week as the operators of the new power looms became the ascendant techno-elite.

Framework for Legal Analysis

Like the “programmed” punch cards used by the power looms to automate the production of complex woven patterns, which themselves were the inspiration for the progenitor of all modern computers, Babbage’s Analytical Engine, the AI algorithms of GPT and its ilk stand to rewrite our understanding of what it means to be the creator, author, or designer of any particular content—including what it means to “own” that content or have the legal right to control how it is used or whether such use is attributed to the original author.

But just as punch cards did serve a purpose, using expansive datasets in AI training offers significant benefits. For instance, by analyzing vast amounts of diverse information, AI models like ChatGPT can achieve generate responses that are more accurate, relevant, and contextually appropriate. This capability enhances user experience, making digital assistants more helpful and interactive across various applications. Furthermore, the continual improvement in AI’s ability to process and understand complex datasets drives innovation across industries, leading to smarter, more efficient technological solutions that benefit society at large.

Copyright Law, Terms of Service and User-Generated Content

Under U.S. copyright law, the author of “any original work fixed in a tangible medium” automatically holds copyright. This includes posts and comments made by users on platforms like Reddit and Stack Exchange. Users retain copyright to their content, granting the platform a license rather than transferring ownership.

You would be forgiven, upon hearing that Reddit and Stack Exchange users own the copyright to their contributions, for concluding the users have legal protection against the use or misuse of those contributions without the users’ consent. The problem is that the Terms of Service agreements, which are mandatory and nonnegotiable, and which every single user agrees to as a condition of using either website, grants the platforms an extremely broad license to the users’ content, contractually giving the platforms carte blanche to do essentially whatever they want with the users’ content, including in some cases forcibly preventing users from deleting or editing it after the fact.

Reddit’s terms grant the platform a broad license to use, reproduce, and modify user content: “By submitting Content to Reddit, you grant us a worldwide, royalty-free, sublicensable, and transferable license to use, store, display, reproduce, modify, create derivative works.”

Stack Exchange also retains a comprehensive license over user submissions, with users granting “a perpetual, irrevocable, royalty-free, sublicensable, transferable license to use, reproduce, distribute, prepare derivative works of, display, and perform the content.”

User Complaints and Platform Pushback

Reddit users have voiced frustration over their content being used to train ChatGPT without their explicit consent, with one commenting that “all this does is make me want to remove all my posts and discontinue use of Reddit.”

Stack Exchange users have expressed similar discontent in terms evocative of the Power Loom Riots’ sabotage of new technology. Protest messages are not good enough, says user Bongle. Instead, he suggests users edit their prior posts to be “subtly wrong” by purposefully introducing errors and bugs to the code.

These complaints underscore a serious disconnect between the legal permissions granted by terms of service and the expectations of users. This disparity may invite legal scrutiny and demands for more explicit consent mechanisms.

Platform Responses

To its credit, Reddit’s terms of service gives users the final say on whether users’ past contributions are deleted: “When Reddit users delete their posts or other content, the site deletes it everywhere, with no ghostly remnants lingering in unexpected locations.”

Stack Overflow, in stark contrast, “does not let you delete questions that have accepted answers and many upvotes because it would remove knowledge from the community,” further warning that “potentially useful” content “should not be removed except under extraordinary circumstances.

What about fair use?

As mentioned above, prior to OpenAI’s spate of high-profile deals, it was getting more familiar with being sued by content companies than signing eight- or nine-figure license agreements with them. In these lawsuits, without the double protection of licensing agreements with companies who themselves irrevocably license almost all conceivable rights to their users’ content, OpenAI’s fate likely will be determined by courts’ application of the fair use doctrine under copyright law to the novel context of training artificial intelligence. Such decisions will have to reckon with the U.S. Supreme Court’s 2021 decision in Google LLC v. Oracle America, Inc., where the Court found Google’s wholesale use of over 11,000 lines of Oracle’s Java code in Google’s Android mobile operating system was sufficiently transformative to qualify as fair use.

Whether such a transformative, fair use analysis can be applied to justify OpenAI’s ravenous appetite for training content is the subject of a forthcoming Part II to this post.

Buscar

Basura entra, basura sale: sabotear tu propio contenido para frustrar el entrenamiento de la IA

The Power Loom Riots

Framework for Legal Analysis

Copyright Law, Terms of Service and User-Generated Content

User Complaints and Platform Pushback

Platform Responses

What about fair use?

Buscar

The Power Loom Riots

Framework for Legal Analysis

Copyright Law, Terms of Service and User-Generated Content

User Complaints and Platform Pushback

Platform Responses

What about fair use?

Artículos relacionados

El registro de una marca comercial en Venezuela: lo que deben saber las empresas internacionales

Deepfakes, clonación de voces y suplantación de identidad mediante IA: las normas internacionales ya están aquí, pero no coinciden

Requisitos para los agentes de la DMCA en las empresas en línea

Los secretos comerciales en la economía de la inteligencia artificial: por qué las empresas necesitan ahora una protección más sólida

Registro de marcas en Cuba: por qué las empresas deberían pensar en el futuro

¿Una marca abandonada está bloqueando tu solicitud de marca en EE. UU.? Entender las cancelaciones de la TTAB

La ampliación de la UE y la marca de la Unión Europea: ¿su marca de la UE cubrirá automáticamente a los nuevos Estados miembros?

Cómo una solicitud o registro de marca extranjera puede ayudarle en los Estados Unidos: Sección 44(d) y Sección 44(e)

Registro de una marca comercial en Canadá: lo que las empresas internacionales deben saber

Negociación de disputas sobre marcas registradas: proteger su marca sin gastar dinero

Secretos comerciales y cómo proteger su información más valiosa

Diez errores en la monetización de la propiedad intelectual que destruyen silenciosamente el valor de la cartera global de marcas registradas

Registro de marcas comerciales en Vietnam: lo que necesita saber

La banda «Chrome Hearts» de Neil Young se enfrenta a una demanda por infracción de marca registrada: qué significa y cómo podría desarrollarse el caso.

La importancia de realizar comprobaciones de marcas registradas