Digital Event Horizon
As the development of Artificial Intelligence (AI) continues to advance, a concerning trend has emerged in the way data is being collected and utilized. A recent study by the Data Provenance Initiative has revealed that AI's data sources are concentrating power overwhelmingly in the hands of a few dominant technology companies. This raises important questions about the ethics and accessibility of AI development.
The lack of transparency about data origins has significant implications for AI development. A study by the Data Provenance Initiative found a concentration of power in AI's data practices among a few dominant technology companies. The shift towards internet-sourced data sets, primarily from YouTube, creates a monoculture and widens the gap between scraped and curated data sets. Companies behind AI models often keep their data sources secret to protect competitive advantages. Restrictive licenses attached to many data sets limit their use for commercial purposes, such as training AI models. The concentration of power in the hands of a few large tech corporations partitions the internet into distinct zones, limiting access for smaller companies and researchers.
The world of Artificial Intelligence (AI) is built upon vast amounts of data, which serves as the foundation for training algorithms to perform various tasks. However, despite the importance of this data, researchers and developers often find themselves in the dark when it comes to understanding the origins and provenance of the data they are using. This lack of transparency and knowledge about where data comes from has significant implications for AI development.
To shed light on this issue, a group of over 50 researchers from both academia and industry joined forces as part of the Data Provenance Initiative. Their primary goal was to investigate the sources of data used in AI models and uncover any patterns or trends that may be emerging. After auditing nearly 4,000 public data sets spanning over 600 languages, 67 countries, and three decades, their findings revealed a concerning trend.
It appears that AI's data practices are concentrating power overwhelmingly in the hands of a few dominant technology companies. In the early 2010s, data sets came from a variety of sources, including encyclopedias, parliamentary transcripts, earning calls, and even weather reports. These diverse sources allowed for specific curation and collection tailored to individual tasks. However, with the advent of transformers in 2017, which underpin language models, AI began to see improved performance with larger and more extensive data sets.
Today, most AI data sets are constructed by indiscriminately gathering material from the internet, indicating a shift towards a monoculture that predominantly relies on YouTube as its primary source. Since 2018, the web has become the dominant source for data used in various forms of media such as audio, images, and video. The result is a widening gap between scraped data and more curated data sets.
This concentration of power within the hands of a few major tech companies raises critical questions regarding the human experience's portrayal in these datasets and what kinds of models are being built. As individuals upload content to platforms like YouTube with specific audiences in mind, their actions within those videos often have intended effects that reflect particular goals or objectives. The concern is whether these data sets accurately capture the nuances and complexity of humanity.
Another significant challenge arises from the fact that companies behind AI models do not typically share information about their data sources. One reason for this secrecy is to protect competitive advantages, while another lies in the complex nature of how data sets are bundled, packaged, and distributed. Consequently, these organizations often lack complete knowledge regarding constraints on how their data is intended to be used or shared.
Furthermore, many data sets have restrictive licenses attached, limiting their use for commercial purposes, such as training AI models. This inconsistency across the data lineage makes it challenging for developers to make informed decisions about which data to utilize and increases the likelihood of inadvertently training models with copyrighted material.
To exacerbate these issues, companies like OpenAI and Google have established exclusive data-sharing deals with publishers, major forums such as Reddit, and social media platforms on the web. These partnerships not only concentrate power in the hands of these large tech corporations but also partition the internet into distinct zones where access is granted to specific entities.
The trend benefits the biggest AI players who can afford such deals at the expense of researchers, nonprofits, and smaller companies that struggle to obtain access due to limited resources and asymmetric information. As Longpre, a researcher at MIT part of the Data Provenance Initiative, puts it, "This is a new wave of asymmetric access that we haven't seen on the open web."
The implications of this concentration of power are profound and raise questions about the ethics and accessibility of AI development in the future. As AI continues to evolve and play an increasingly significant role in various aspects of our lives, understanding its data sources and ensuring transparency becomes crucial for creating models that accurately reflect the human experience.
Related Information:
https://www.technologyreview.com/2024/12/18/1108796/this-is-where-the-data-to-build-ai-comes-from/
Published: Wed Dec 18 05:49:09 2024 by llama3.2 3B Q4_K_M