On November 3rd, Matthew Butterick and the law firm of Joseph Saveri filed a class action lawsuit against Copilot in the San Francisco US federal court. They are challenging whether the use of open-source code to train the OpenAI Codex that powers GitHub Copilot violated the open source licenses associated with the repositories. They cite 11 popular open source licenses, like MIT, GPL, and Apache, that require attribution of the author and the attached copyright to any derivative works. They are also citing violation of GitHub’s own TOS, DMCA 1202, California’s Consumer Privacy Act, and a smattering of other laws.
Key to any class action lawsuit will be the legal discovery, in which I believe Matthew is hoping to expose the exact training data used by Copilot to train the model. The larger question is whether Copilot’s suggestions can be considered a derivative work that requires attribution based on the open source licenses attached to the training data. If Copilot is in fact, simply suggesting code snippets that are pulled directly from the source material, then we’ve got what amounts to piracy. If, instead, Copilot is interpreting the source material in a novel way, then things get a bit more murky. What is certainly true, in my experience of Copilot, is that it offers no attribution or reference for where the code snippets it suggests come from. At a minimum, I should be able to pull up the source of a suggestion to better understand the context.
Matthew Butterick claims that since the suit launched, he has received an overwhelming response from developers, including a list of other AI companies aping what Copilot does. He has not ruled out adding additional companies to the suit if warranted. This is a vast and difficult topic to parse in a simple lightning round post, so I’ll leave it here to pick up in a future Chaos Lever main, once I do a LOT of reading and some nominal amount of thinking.