Digital Event Horizon
The Xet team at Hugging Face is working on improving the efficiency of their data storage architecture by optimizing Parquet file dedupe-ability. Through experiments and proposed solutions, they aim to make storage more efficient and reduce storage costs for their users.
Parquet files are widely used in big data analytics and machine learning applications due to their ability to efficiently compress and transport large datasets. Data deduplication is crucial for optimizing Parquet file storage, particularly when updating datasets on a regular basis. Currently, Hugging Face hosts nearly 11PB of datasets with over 2.2PB in Parquet files, highlighting the need for efficient dedupe-ability. The current byte-level Content-Defined Chunking algorithm generally works well for insertions and deletions but encounters challenges with row modifications. Proposed solutions include using relative offsets and supporting content-defined chunking on row groups to improve Parquet file dedupe-ability.
Parquet files have become an essential component of modern data storage, particularly in big data analytics and machine learning applications. These files are widely used to store structured data due to their ability to efficiently compress and transport large datasets. However, with the increasing amount of data being stored and processed, there is a growing need to optimize Parquet file dedupe-ability.
According to recent research conducted by the Xet team at Hugging Face, optimizing Parquet storage is crucial for efficiency when users want to update their datasets on a regular basis. The researchers found that most Parquet files are bulk exports from various data analysis pipelines or databases and often appear as full snapshots rather than incremental updates. Data deduplication becomes critical in this scenario, as it enables the storage of all versions of a growing dataset with only a little more space than the size of its largest version.
Currently, Hugging Face hosts nearly 11PB of datasets, with Parquet files accounting for over 2.2PB of that storage. Given the sheer volume of data being stored, optimizing Parquet file dedupe-ability is essential to improve efficiency and reduce storage costs.
In order to address this challenge, the researchers have conducted experiments using a 2GB Parquet file with 1,092,000 rows from the FineWeb dataset. By appending new rows, making modifications to existing rows, and deleting rows, they were able to test the dedupe-ability of their proposed solution.
The results of these experiments suggest that the current byte-level Content-Defined Chunking (CDC) algorithm generally works well over insertions and deletions but encounters challenges when it comes to row modifications. The researchers found that even with changes seen at the end of the file, nearly the entire original Parquet file can be deduped.
However, this is not the case for row modifications. Given the layout of Parquet files, which involves splitting tables into row groups and compressing each column within the row group, any modification is likely to rewrite all the Column headers. This leads to a situation where even though most of the file does dedupe well, there are many small regularly spaced sections of new data.
To address this challenge, the researchers propose two potential solutions: using relative offsets instead of absolute offsets for file structure data and supporting content-defined chunking on row groups. The former approach would make Parquet structures position-independent and easy to "memcpy" around, while the latter approach could enable dedupe across compressed Parquet files.
Additionally, the researchers suggest exploring other common filetypes to assess the behavior of their data dedupe process. By collaborating with the Apache Arrow project, they aim to implement some of these ideas in the Parquet/Arrow codebase and improve Parquet storage performance.
In conclusion, optimizing Parquet dedupe-ability is crucial for efficient data storage in modern big data analytics and machine learning applications. Through their experiments and proposed solutions, the researchers have demonstrated the challenges faced by existing byte-level Content-Defined Chunking algorithms when it comes to row modifications. By exploring new approaches and collaborating with other projects, they aim to improve Parquet file dedupe-ability and make storage more efficient.
Related Information:
https://huggingface.co/blog/improve_parquet_dedupe
Published: Tue Oct 15 23:53:54 2024 by llama3.2 3B Q4_K_M