Skip to content

Reddit says it has made $203 million so far from licensing its data

Reddit's prospects as it moves toward a stock market listing have a lot more to do with relationships with AI vendors like OpenAI than you might expect.

In its IPO prospectus presented today With the US Securities and Exchange Commission, Reddit repeatedly emphasized how much it believes it can earn (and has earned) from data licensing deals with companies that train AI models in its more than 1 billion posts and more than 16 billion comments.

“In January 2024, we entered into certain data licensing agreements with an aggregate contract value of $203.0 million and terms ranging from two to three years,” the prospectus reads. “We expect a minimum of $66.4 million of revenue to be recognized during the year ending December 31, 2024 and the remainder thereafter.”

Now, it's a mystery which AI vendors are licensing Reddit data so far. Earlier this week, Bloomberg and Reuters reported that a “large anonymous artificial intelligence company” – possibly Google – had entered into a licensing agreement worth about $60 million on an annualized basis. But OpenAI wouldn't be a surprising customer either, especially considering OpenAI CEO Sam Altman has 8.7%. bet on Reddit (making him the third-largest shareholder) and was once a member of the company's board of directors.

Why is Reddit data valuable? As Reddit explains, AI models “learn” from examples to create essays, code, emails, articles, and more, and providers like OpenAI search the web for millions or billions of these examples to add to their training sets. Some examples are in the public domain. Others are not, or, in the case of Reddit content, are subject to restrictive licenses that require subpoena or specific forms of compensation.

Reddit previously did not prevent access to its data for AI training purposes. But last year he changed course, arguing that your data should not be, in the words of CEO Steve Huffman, “[given] to some of the largest companies in the world for free.”

“Reddit data is a critical building block of today's AI technology and many important language models,” the prospectus continues. “We believe that Reddit's huge corpus of data and conversational insights will continue to play a role in training and improving large language models. As our content updates and grows daily, we hope that models will want to reflect these new ideas and update their training using data from Reddit.”

Content producers, from media libraries to news publishers, are increasingly turning to data licensing deals with AI providers as chatbots like OpenAI's ChatGPT threaten to sap traffic. A recent model from The Atlantic found that if a search engine like Google integrated AI into search, it would respond to a user's query 75% of the time without needing to click on your website.

Vendors, in turn, have been encouraged to seek licensing deals as they face an avalanche of lawsuits alleging they have no legal basis to train their models with data without permission or payment. Recently, The New York Times accused OpenAI to effectively build competitors in news publishers using their works, harming their business.

OpenAI has agreements in place with Shutterstock as well as editors, including Axel Springer, owner of Politico and Business Insider. The licenses are reported However, it is quite small: it maxes out at $5 million a year.