Large language models can generate text strings based on word patterns learned from web pages, books, and other bodies of text in your training data. In addition to ChatGPT, programs make up the guts of search chatbots like Microsoft Bing Chat and google bardand underlie a growing number of applications that produce professional, creative copy in an instant. Their counterparts that generate AI compounds illustrations and videos draws patterns from image data sets, such as photos collected from Pinterest and Flickr.
Often the data sets used in AI development are built through unofficial means, such as shipping software that extracts content from websites. In the US, this is generally considered legal, although copyright issues and website terms of use go against the practice. they have left it in dispute.
Some websites like Reddit and Stack Overflow have been more attractive. They offer downloadable “data dumps” or real-time data portals to help software access their content known as APIs. In the case of Stack Overflow, LLM developers get data through a combination of dumps, APIs, and scraping, Chandrasekar says, all of which can be done for free today.
But Chandrasekar says that LLM developers are violating Stack Overflow’s terms of service. Users own the content they post on Stack Overflow, as described in their TOS, but it’s all under a Creative Commons license that requires anyone who uses the data later to mention where it came from. When AI companies sell their models to clients, “they can’t attribute to each and every community member whose questions and answers were used to train the model, thus violating the Creative Commons license,” says Chandrasekar.
Neither Stack Overflow nor Reddit have released pricing information. “We’re working on that as we speak,” says Reddit spokesperson Tim Rathschmidt, “and will share more with partners in the coming weeks.” Stack Overflow will study Reddit’s strategy and consult with its own potential customers, some of whom have already reached out about accessing the data, Chandrasekar says.
A potential roadmap for pricing could come from Elon Musk, who this month raised prices for access to Twitter data. They from $42,000 per month to access 50 million tweets. Approximately three times the volume of tweets had previously been available for free. In a tweet this weekMusk accused Microsoft, a major artificial intelligence developer and close partner of OpenAI, of training algorithms “illegally using data from Twitter.” Without elaborating, he added: “Judgment time.”
Both Stack Overflow and Reddit will continue to license data for free to some individuals and companies. Chandrasekar says that Stack Overflow only wants remuneration only from companies that develop LLMs for big business purposes. “When people start charging for products created on community-created sites like ours, that’s where it’s not fair use,” he says.
Steve Huffman, CEO of Reddit said The New York Times this week he didn’t want to give a freebie to the biggest companies in the world. “Crawling Reddit, generating value, and not returning any of that value to our users is something we have a problem with,” she said.