The initial vision of a decentralized and community network has evolved into a model where the user performs microtasks that benefit large corporations. Figures like Luis von Ahn, with projects like reCAPTCHA and Duolingo, were key in this transition. They transformed everyday actions, such as verifying that we are not robots or learning languages, into mechanisms for generating valuable data. This free work now sustains the development of artificial intelligence and other commercial services.
The code behind data capture: from interaction to dataset 🤖
Technically, these systems are based on ingenious human-computer interaction (HCI) designs that mask data collection. reCAPTCHA, for example, presents two words: one control word known to the system and another scanned from a book that needs digitization. The user's verification resolves both. Duolingo structures its lessons as bidirectional translation exercises, where each response contributes to training language models. These data, anonymized and aggregated, form datasets to train OCR or automatic translation algorithms.
Welcome to the most fun workplace in the world (doesn't pay) 🦉
It's curious to think that our free time has become the most distributed production line on the planet. While we thought we were downloading a meme or proving our humanity to a text box, we were actually clocking in at a data factory. The next time Duolingo reminds you with a crying owl to practice Spanish, think that you're not only learning, but also polishing the AI model that a company will later rent. At least we don't have to clock in with a card.