Skip to Main Content

REDDIT

Reddit Sues AI Firms Over Alleged User Comment Scraping

Social media giant Reddit has filed a lawsuit against AI company Perplexity and others, alleging widespread, unlawful scraping of user comments for commercial gain.

Read time
5 min read
Word count
1,007 words
Date
Oct 22, 2025
Summary

Reddit, the prominent social media platform, has initiated legal action against AI firm Perplexity AI and several other entities. The lawsuit, filed in a New York federal court, accuses these companies of engaging in an extensive, illicit operation to extract millions of Reddit user comments. This alleged scraping is purportedly conducted for commercial purposes, circumventing Reddit's security measures and exploiting its vast repository of human conversation. The legal challenge highlights a growing tension between content platforms and AI developers seeking data for training advanced AI models.

A smartphone displaying the Reddit logo. Credit: westernslopenow.com
🌟 Non-members read here

Social media platform Reddit has filed a federal lawsuit against artificial intelligence company Perplexity AI and three additional entities, alleging their involvement in an extensive and illicit operation to extract user comments. The lawsuit claims these parties are engaged in an “industrial-scale, unlawful” economy designed to “scrape” the commentary of millions of Reddit users for commercial benefit. This legal action, filed in a New York federal court, underscores a significant and evolving challenge facing content platforms as AI development accelerates.

The complaint targets San Francisco-based Perplexity, known for its AI chatbot and “answer engine” which competes with prominent services like Google and ChatGPT. Also named in the lawsuit are Oxylabs UAB, a data-scraping company based in Lithuania; AWMProxy, described by Reddit as a “former Russian botnet”; and SerpApi, a Texas-based startup that lists Perplexity as a client on its official website. This marks Reddit’s second such lawsuit against a major AI company, following a previous action against Anthropic in June.

However, the current lawsuit diverges by not only confronting an AI company but also focusing on the less-publicized services that the AI industry often relies upon to acquire the vast amounts of online text necessary for training sophisticated AI chatbots. Reddit’s chief legal officer, Ben Lee, emphasized that these “scrapers bypass technological protections to steal data, then sell it to clients hungry for training material.” Lee further noted that Reddit is a prime target due to its status as “one of the largest and most dynamic collections of human conversation ever created.”

Perplexity has indicated it has not yet formally received the lawsuit but stated it “will always fight vigorously for users’ rights to freely and fairly access public knowledge.” The company affirmed its approach remains “principled and responsible” in providing factual answers with accurate AI, vowing not to “tolerate threats against openness and the public interest.” SerpApi’s customer success director, Ryan Schafer, also responded, stating, “We strongly disagree with Reddit’s allegations and intend to vigorously defend ourselves in court.” Oxylabs and AWMProxy have not yet publicly responded to the allegations.

The Allegations: Circumventing Protections and Profiting from Stolen Data

Reddit’s lawsuit portrays the defendants as “would-be bank robbers” who, unable to breach a bank vault directly, instead target an armored truck. This analogy highlights the core accusation: that these companies are circumventing Reddit’s established anti-scraping measures. The complaint further alleges that the defendants are “circumventing Google’s controls and scraping Reddit content directly from Google’s search engine results” to acquire data. This multi-layered approach to data acquisition is central to Reddit’s legal argument.

Ben Lee elaborated on this strategy, explaining that because direct scraping of Reddit content is difficult, these entities “mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search.” Lee then directly implicated Perplexity, stating the AI firm “is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself.” This suggests a deliberate choice by Perplexity to procure data through illicit means rather than pursuing proper licensing channels.

The current legal battle echoes Reddit’s earlier lawsuit against Anthropic, where similar arguments were made regarding the unauthorized use of its content. That case, initially filed in a California state court, has since been moved to federal court and is scheduled for a hearing in January. The consistent theme across these lawsuits is Reddit’s assertion that AI companies are utilizing its platform’s content without proper authorization or compensation, despite the significant value that content provides in training AI models.

Reddit’s expansive archives, along with other online resources like Wikipedia and digitized books and news articles, represent invaluable sources of human language patterns. These vast datasets are crucial for teaching AI assistants to comprehend and generate human-like text. The ongoing legal disputes highlight a broader struggle within the digital ecosystem to define fair use and intellectual property rights in the age of generative AI, where data is the foundational resource for technological advancement.

The issue of data sourcing for AI training has become a critical point of contention, particularly as AI models grow more sophisticated and demand ever-larger datasets. Reddit has proactively engaged in licensing agreements with several prominent technology companies, including Google and OpenAI. These agreements permit the AI firms to train their systems on the public commentary generated by Reddit’s extensive user base, which exceeds 100 million daily active users. Such partnerships illustrate one pathway for AI developers to legally access and utilize content for training purposes.

These licensing deals have played a significant role in Reddit’s financial strategy, helping the 20-year-old online platform generate revenue in anticipation of its public listing on Wall Street last year. The ability to monetize its vast content library through such agreements demonstrates the perceived value of Reddit’s data in the burgeoning AI economy. This context frames the current lawsuits as not merely a defense of intellectual property, but also a protection of a crucial revenue stream and business model for the social media giant.

The lawsuits against Perplexity and Anthropic represent a broader trend of content creators and platforms seeking to assert control over their data in the face of widespread AI development. As AI models become increasingly integrated into various industries, the legal and ethical frameworks governing data acquisition and usage are being tested and reshaped. Companies like Reddit are pushing for a clear distinction between publicly accessible information and data that is protected or requires licensing for commercial AI training.

The outcome of these legal proceedings could establish significant precedents for how AI companies operate and acquire training data in the future. It could influence the development of new licensing models, stricter enforcement of anti-scraping technologies, and a re-evaluation of what constitutes “fair use” in the context of advanced AI. The ongoing legal battles underscore the complex interplay between technological innovation, intellectual property rights, and the commercial imperatives of both content platforms and AI developers.