For three decades, the Wayback Machine digital archive has preserved over a billion web pages, serving as an essential tool for journalists, historians, and lawyers. But the current threat does not come from governments or hackers, but from the media outlets themselves. According to the Nieman Foundation, at least 241 media outlets from nine countries, such as The Guardian, The New York Times, and Le Monde, block the archive's crawlers from accessing their content.
The Technical Dilemma Between Preservation and Data Protection 🛡️
The reason for the block is the fear that artificial intelligence companies like OpenAI or Google will use that material to train their models without permission or compensation. The New York Times alleged that its content in the archive is being used by AI firms in violation of copyright. Furthermore, AI bots send tens of thousands of requests per second to archive.org's servers, overwhelming its infrastructure. The organization, which advocates for an open internet, faces the challenge of maintaining its philosophy while protecting itself from these practices.
The Irony of Biting the Hand That Has Your Back 😅
It is paradoxical that outlets like USA Today, which used the archive to recover their own lost articles, are now closing the door. It's as if a firefighter saved your house and then you banned them from entering because you fear they'll steal your sofa. Meanwhile, the AI bots keep forming a virtual queue, and archive.org, caught between its altruistic mission and reality, seems like the host of a party that everyone wants to attend, but no one wants to pay the entry fee.