AI vs. IP Clash Renews Scrutiny of Tech's Data Gathering Activities
Copyright lawsuits also highlight flaw in popular generative AI products
AI has a "leaky data" problem. Shroki et al. (2017) previously demonstrated that adversarial attacks can determine with relatively high probability whether a data record was part of a data set used to train an inference model. Revealing the presence of data records in training sets creates the potential for privacy-type injuries. For instance, identifying whether a person's medical records were used to train a medical diagnostic model would suggest the person has a particular medical condition (assuming, of course, anonymized data can be de-anonymized and correlated to a particular person using other available data, which is not a simple task).
A trained model's ability to "remember" or "memorize" its training data can also have other potential injurious consequences, including economic ones. Two recent copyright infringement lawsuits—one brought by three artists and the other by imagery seller Getty Images—illustrate this point. The lawsuits also put the spotlight once again on tech’s seeming unquenchable appetite for data.
In the first lawsuit, Anderson et al. v. Stability AI et al., No. 3:23-cv-00201 (N.D. Cal, Jan. 13, 2023), Plaintiff artists accuse defendants Stability AI Ltd., Stability AI, Inc., Midjourney, Inc., and DeviantArt, Inc. of violating their copyrights by downloading and storing copies of Plaintiffs' artwork, using the copied works to train the model powering Stable Diffusion, and allowing users of Stable Diffusion to create "fakes" of their works using the model's text-guided generative image feature. Plaintiffs allege that DreamStudio (by Stability), the Midjourney Product, and DreamUp (released by DeviantArt) all use the Stable Diffusion software stack as the engine behind their respective image-generating apps.
In the second lawsuit, Getty Images (US), Inc. v. Stability AI, Inc. No. 1:23-cv-00135 (D. Del. Feb. 3, 2023), Getty raises similar allegations against Stability AI, offering several examples of its copyrighted images as evidence of alleged copying (both lawsuits raise other claims, including, among others, under trademark, unfair competition, and/or publicity rights law).
Image "generative AI" systems like Stable Diffusion, which can output whimsical graphic and photo-realistic images, have literally transformed our social fabric from what it was just months ago. Generative AI apps, including text generative systems like ChatGPT, are so popular that courses now teach "prompt engineering," the process of crafting just the right set of input phrases to guide a model's output toward one's predilections and interests (to wit, "an image of a pig on an alien planet," as shown above). But to those who earn income from creative imagery, image generative AI is seen as a threat to their ability to market their own creative works through other channels.
The question of whether generative AI models retain information contained in their training data is an issue no U.S. court has directly considered to date, and thus a judicial decision in this area could establish a precedent with significant potential effect on data-based technologies like AI. Intellectual property laws, specifically copyright law, is a good framework for evaluating this open question, given the nature of IP rights, i.e., ownership and exclusivity.
Under the U.S. Copyright Act, the owner of copyright in an original work of authorship has the exclusive rights to reproduce one's copyrighted work in the form of "copies." 17 U.S.C § 106. A "copy" is a material object in which the work is fixed in a tangible medium by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. See 17 U.S.C §§ 101, 102. Works of authorship include pictorial, graphic, and sculptural works, the so-called "visual arts." Relevant to the present lawsuits, "tangible medium" includes hard drives and servers. See Mourabit v. Klein, No. 19-2142-cv., slip op. (2d Cir. Jun. 8, 2020) (citing 1 Nimmer on Copyright § 2.03). That is, the tangible medium storing the artists' works and Getty's imagery include data files saved on servers (the relevant artistic expression in images is coded as pixel values from 0 to 255 for red, green, and blue).
In the Anderson complaint, Plaintiffs allege that Stability AI scraped over five billion images from websites for use as training data for Stable Diffusion, but did not seek consent from either the creators of the images or the website hosts from which they were scraped. Stability AI did this, they contend, using LAION, a publicly-available repository of image data. The complaint acknowledges that the LAION datasets do not contain actual image data, but rather lists of URLs where images may be found together with the "ALT texts" associated with those images (i.e., descriptions of the general content shown in the images). The LAION website states that images and descriptions were downloaded to calculate so-called "CLIP embeddings," which are "similarity scores" between images and their textual descriptions, but the images themselves were subsequently "discarded."
The Anderson Plaintiffs also state that the training images pulled from URLs were then (and currently are) "embedded and stored" as compressed copies "within Stable Diffusion." Stable Diffusion was subsequently made available to the public whereby the compressed copies (embedded and stored within Stable Diffusion) were used in generating output in response to user-supplied text prompts.
Plaintiffs contend that "a trained diffusion model can produce a copy of any of its Training Images—which could number in the billions" and therefore "[a] diffusion model can be considered an alternative way of storing a copy of those images." A "diffusion model uses statistical and mathematical methods to store these images in…[an] efficient and compressed manner," they argue.
For its part, Getty’s complaint cites a paper by Nicholas Carlini et al., entitled Extracting Training Data from Diffusion Models, posted to the arxiv preprint server on January 30, 2023 (notably, a mere three days before Getty’s court filing). In the paper, Carlini and his co-authors contend that “state-of-the-art diffusion models do memorize and [can] regenerate individual training examples" (the authors provide technical definitions for what they mean by "memorize" in this context). While this assertion appears compelling, Stability AI and the other defendants are not without precedential defenses.
For example, Stability AI might try to defend itself on the grounds that its alleged use of copyrighted works was “transitory” (it only used images for as long as it took to download and process them) or was a "fair use." Making copies of copyrighted material is a fair use (i.e., not infringement) when the "use by reproduction in copies…or by any other means specified by [§ 106], [is] for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research." 17 U.S.C. § 107. Judicial decisions over the years have expanded and clarified the list of fair uses. In Authors Guild v. Google, Inc., 804 F. 3d 202 (2nd Cir. 2015), for example, which tested the boundaries of the fair use defense, Google argued that indexing copyrighted books to enhance user searches for the works was not infringement. The Second Circuit, agreed, finding that Google's making of a digital copy of the authors’ books to provide a search function was a "transformative use" of the works, one that augments public knowledge by making available information about Plaintiffs' books without providing the public with a substantial substitute for matter protected by the Plaintiffs' copyright interests in the original works or derivatives of them.
This is not to suggest that the Google facts mirror those in the Anderson or Getty cases, or that the Google outcome predisposes the present lawsuits. But what can be concluded is that, whatever the outcomes of the Anderson and Getty cases, there could be significant consequences for the AI industry ahead. Court decisions based on copyright law in favor of Plaintiffs in Anderson and Getty could dampen (at least temporarily) other AI developers' data collection and use activities, just as earlier court decisions in the privacy law context altered the trajectory of corporate biometric data collection activities in the last couple of years.