Today's AI/ML headlines are brought to you by ThreatPerspective

Digital Event Horizon

The AI Tarpit: How Crawlers are Overwhelming Open Source Infrastructure



The open source community is facing a growing crisis as aggressive AI crawlers overwhelm infrastructure and drive up costs for maintainers. Companies such as Amazon, OpenAI, Anthropic, and Meta are among those responsible for the surge in bot traffic, with some projects reporting as much as 97 percent of their traffic originating from these sources. The impact is significant, with many sites experiencing persistent DDoS attacks that strain already stretched-thin maintainers.

  • The open source community is facing a crisis due to aggressive AI crawlers overwhelming infrastructure and driving up costs for maintainers.
  • Companies such as Amazon, OpenAI, Anthropic, and Meta are among those responsible for the surge in bot traffic, with some projects reporting over 97% of their traffic originating from these sources.
  • These crawlers are causing persistent DDoS attacks that strain maintainers, with some sites experiencing significant infrastructure strain.
  • AI-generated bug reports are also being received by open source projects, containing fabricated vulnerabilities and wasting developer time.
  • New defensive tools have emerged to protect websites from unwanted AI crawlers, but their limitations are becoming increasingly apparent.
  • The situation highlights a broader crisis in the open source community, with smaller startups working collaboratively while larger corporations remain unresponsive.
  • The problem threatens the sustainability of essential online resources, relying on public collaboration and limited resources compared to commercial entities.



  • The open source community is facing a growing crisis, as aggressive AI crawlers increasingly overwhelm infrastructure and drive up costs for maintainers. According to recent reports, companies such as Amazon, OpenAI, Anthropic, and Meta are among those responsible for the surge in bot traffic, with some projects reporting as much as 97 percent of their traffic originating from these sources.

    The impact is significant, with many sites experiencing persistent distributed denial-of-service (DDoS) attacks that strain already stretched-thin maintainers. For instance, KDE's GitLab infrastructure was temporarily knocked offline by crawler traffic originating from Alibaba IP ranges, according to LibreNews, citing a KDE Development chat. Similarly, GNOME sysadmin Bart Piotrowski shared on Mastodon that only about 3.2 percent of requests (2,690 out of 84,056) passed their challenge system, suggesting the vast majority of traffic was automated.

    The problem extends beyond infrastructure strain. As LibreNews points out, some open source projects began receiving AI-generated bug reports as early as December 2023, first reported by Daniel Stenberg of the Curl project on his blog in a post from January 2024. These reports appear legitimate at first glance but contain fabricated vulnerabilities, wasting valuable developer time.

    The crawlers' behavior suggests different possible motivations. Some may be collecting training data to build or refine large language models, while others could be executing real-time searches when users ask AI assistants for information. The frequency of these crawls is particularly telling, with some companies appearing more aggressive than others. KDE's sysadmin team reported that crawlers from Alibaba IP ranges were responsible for temporarily knocking their GitLab offline.

    In response to these attacks, new defensive tools have emerged to protect websites from unwanted AI crawlers. As Ars reported in January, an anonymous creator identified only as "Aaron" designed a tool called "Nepenthes" to trap crawlers in endless mazes of fake content. Aaron explicitly describes it as "aggressive malware" intended to waste AI companies' resources and potentially poison their training data.

    However, the limitations of these tools are becoming increasingly apparent. The Read the Docs project reported that blocking AI crawlers immediately decreased their traffic by 75 percent, going from 800GB per day to 200GB per day. This change saved the project approximately $1,500 per month in bandwidth costs, according to their blog post "AI crawlers need to be more respectful."

    The situation highlights a broader crisis rapidly spreading across the open source community. What appears to be aggressive AI crawling is becoming an insurmountable challenge for maintainers of public collaboration and limited resources projects. It remains unclear why companies don't adopt more collaborative approaches and, at a minimum, rate-limit their data harvesting runs so they don't overwhelm source websites.

    The problem also extends beyond the individual projects affected by these crawlers. A disproportionate burden on open source infrastructure threatens the sustainability of essential online resources, which rely on public collaboration and typically operate with limited resources compared to commercial entities.

    As one Hacker News user put it, AI firms are operating from a position that "goodwill is irrelevant" with their "$100bn pile of capital." The discussions depict a battle between smaller AI startups that have worked collaboratively with affected projects and larger corporations that have been unresponsive despite allegedly forcing thousands of dollars in bandwidth costs on open source project maintainers.

    It remains to be seen how this crisis will be resolved. Industry players, including Amazon, OpenAI, Anthropic, and Meta, have yet to respond to requests for comment on the issue. The situation is likely to escalate further unless meaningful regulation or self-restraint by AI firms is implemented.



    Related Information:
  • https://www.digitaleventhorizon.com/articles/The-AI-Tarpit-How-Crawlers-are-Overwhelming-Open-Source-Infrastructure-deh.shtml

  • https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/

  • https://technewstube.com/ars-technica/1716121/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-entire/


  • Published: Tue Mar 25 19:59:49 2025 by llama3.2 3B Q4_K_M











    © Digital Event Horizon . All rights reserved.

    Privacy | Terms of Use | Contact Us