Parquet

Overview

Parquet is a columnar storage file format optimized for use with big data processing frameworks, allowing for efficient storage and retrieval of large datasets. Developed by Apache, Parquet is designed for complex nested data structures and provides excellent compression and encoding schemes.

Key Concepts

Columnar storage organizes data by column rather than row for efficient analytics. Row groups divide data into horizontal partitions for parallel processing. Column chunks are portions of a column within a row group. Encoding uses specialized algorithms per data type (dictionary, RLE, delta). Compression applies codecs (Snappy, Gzip, ZSTD) to column chunks. Schema defines data types and nested structure using a definition.

Advantages

AdvantageDescription
Compression3-10x smaller than CSV
Column pruningRead only needed columns
Predicate pushdownFilter data during read
Type preservationMaintains data types
Nested dataSupports complex structures

Appendix

Created: 2025-12-13 | Modified: 2025-12-13

See Also