Microsoft’s Megatron AI is under intense scrutiny and legal fire, facing allegations from a group of high-profile authors that the model is dependent on nearly 200,000 pirated books for its training. This lawsuit highlights a critical point of contention in the AI industry: the ethical and legal implications of data sourcing for large language models. The authors allege that this vast collection of pirated material was used to enable the AI to generate text that closely resembles their original writings.
The plaintiffs, including acclaimed writers Kai Bird and Jia Tolentino, are demanding a court order to prevent further copyright infringement by Microsoft and seeking statutory damages of up to $150,000 per allegedly misused work. They argue that generative AI, which produces various forms of media, relies heavily on these expansive datasets to learn and replicate human creative expression. Their complaint specifically details the role of the pirated books in shaping the AI’s output.
Microsoft has not yet issued a statement regarding the lawsuit, and the authors’ attorney has opted not to comment. This legal action follows recent significant rulings in California concerning other AI companies, Anthropic and Meta, demonstrating the nascent and evolving legal framework surrounding AI and copyright.
The scope of copyright challenges against AI developers is broad and growing. Major media organizations, music labels, and photography companies have all initiated lawsuits, asserting their rights over content used for AI training. Tech companies often invoke the “fair use” doctrine, contending that their AI models produce “transformative” new content and that imposing fees for training data could stifle innovation in the AI sector.