Skip to content

Meta, Google and OpenAI used proprietary data to train LLMs, report

Gary Marcus is a leading AI researcher who is increasingly horrified by what he sees. He founded at least two AI startups, one of which was sold Aboveand has been researching this topic for over two decades. Just last weekend, that Financial Times called him “Perhaps the loudest AI questioner” and reported on it Marcus assumed he was being targeted by a critic Sam Altman Post on X: “Give me the confidence of a mediocre deep learning skeptic.”

Marcus stepped up his criticism the very next day after appearing in the FT. writes on his Substack about “generative AI as a Shakespearean tragedy”. The topic was a Bomb report from The New York Times that OpenAI violated YouTube’s terms of service by scraping over a million hours of user-generated content. What’s worse, Google’s need for data to train its own AI model was so insatiable that it did the same thing, potentially violating the copyrights of the content creators whose videos it used without their consent.

Marcus noted back in 2018 that he had expressed doubts about the “data-hungry” training approach that aimed to provide AI models with as much content as possible. In fact, he listed eight of his warnings dating back to that time Diagnosis of hallucinations in 2001, everything comes true like a curse on MacBeth or Hamlet that manifests itself in the fifth act. “What makes this tragic is that many of us tried so hard to warn the field that this is where we would end up,” Marcus wrote.

While Marcus declined to comment AssetsThe tragedy goes far beyond the fact that no one listened to critics like him and Ed Zitron, another prominent skeptic quoted from the FT. According to the Just, which cites numerous background sources, both Google and OpenAI knew their actions were legally dubious—relying on the fact that copyright law in the age of AI had yet to be litigated—but felt that They had no choice but to continue pumping data into their companies using large language models to stay ahead of their competition. And in Google’s case, the company may have suffered damage from OpenAI’s massive scraping efforts, but its own rule-bending to scrape the same data left it with a proverbial arm on its back.

did OpenAI uses YouTube Videos?

Google employees became aware that OpenAI was using YouTube content to train its models, which would violate both its own terms of service and potentially the copyright protections of the creators who own the videos. In this quandary, Google decided not to publicly denounce OpenAI for fear of drawing attention to its own use of YouTube videos to train AI models Just reported.

A Google spokesperson said this Assets The company has “seen unconfirmed reports” that OpenAI used YouTube videos. They added that YouTube’s terms of service prohibit “the unauthorized scraping or downloading” of videos, which the company “has long used technical and legal measures to prevent.”

According to Marcus, the behavior of these big tech companies was predictable because data was the key ingredient in developing the AI ​​tools that these big tech companies were in an arms race to develop. Without high-quality data like well-written novels, podcasts from knowledgeable hosts, or expertly produced films, the chatbots and image generators run the risk of spitting out mediocre content. This idea can be summed up with the data science saying “crap in, crap out.” In a comment for Assets Jim Stratton, chief technology officer at HR software company Workday, said “Data is the lifeblood of AI,” which makes the “need for high-quality, timely data more important than ever.”

Around 2021, OpenAI ran into a data shortage. OpenAI desperately needed more human language instances to further improve its ChatGPT tool, which was still about a year away from release, and decided to source it from YouTube. Employees discussed that compressing YouTube videos might not be allowed. Eventually, a group including OpenAI President Greg Brockman implemented the plan.

The fact that a high-ranking figure like Brockman was involved in the plan was, according to Marcus, a testament to how groundbreaking such data collection methods were for the development of AI. Brockman did so “most likely knowing he was in a legal gray area — and yet still eager to feed the beast,” Marcus wrote. “If everything falls apart, whether for legal or technical reasons, this image may remain.”

When reached for comment, an OpenAI spokesperson did not respond to specific questions about using YouTube videos to train its models. “Each of our models has a unique dataset that we curate to help them understand the world and remain globally competitive in research,” they wrote in an email. “We leverage multiple sources, including publicly available data and non-public data partnerships, and are exploring synthetic data generation,” they said, referring to the practice of using AI-generated content to train AI models.

Mira Murati, OpenAI’s chief technology officer, was asked in one Wall Street Journal interview whether the company’s new Sora video image generator was trained using YouTube videos; She replied, “I’m not actually sure.” Last week, YouTube CEO Neal Mohan replied by saying that while he didn’t know whether OpenAI actually used YouTube data to train Sora or another tool, if it did, it would violate the platforms’ rules. Mohan did it mention that Google uses some YouTube content to train its AI tools based on some contracts with individual YouTubers. A statement that a Google spokesperson reiterated Assets in an email.

Meta decides that the license agreement would take too long

OpenAI was not alone in facing the lack of sufficient data. Meta also dealt with the topic. When Meta realized that its AI products were not as advanced as OpenAI’s; It held numerous meetings with top executives to find ways to secure more data to train its systems. Executives considered options such as paying a $10 per book royalty on new releases and purchasing the publisher Simon & Schuster outright. At these meetings, executives admitted that they had already used copyrighted material without the authors’ permission. Ultimately, they decided to move forward, even if it meant possible lawsuits in the future, they said New York Times.

Meta did not respond to a request for comment.

Meta’s lawyers believed they would be protected by compensation in the event of litigation In 2015, Google won against a consortium of authors. At the time, a judge ruled that Google could use the authors’ books without having to pay a licensing fee because the company used their work to build a search engine that was sufficiently transformative to be considered fair use.

OpenAI argues something similar in a case raised against it New York Times In December. The Just claims that OpenAI used its copyrighted material without compensating it. While OpenAI denies Your use of the Materials falls within fair use because they were compiled to train a large language model and not because it is a competing news organization.

For Marcus, the hunger for more data was proof that the entire AI concept was built on it shaky ground. So that the AI ​​can do it live up for the Hype with which it was billed, it simply requires more data than is available. “All of this happened because of the realization that their systems simply cannot succeed without even more data than the Internet data they were already trained on,” Marcus wrote on Substack.

OpenAI appeared to admit this was the case in written testimony to the House of Lords in December. “It would be impossible to train today’s leading AI models without using copyrighted material,” the company wrote.

Subscribe to the Eye on AI newsletter to stay up to date on how AI is shaping the future of business. Log in for free.

//platform.twitter.com/widgets.js