Monday, March 29, 2021

Parquet format at Twitter

Compressed columnar format (per Row group < 1GB) for Hadoop

Efficient scan with compression. Parquet supports deeply nested structures, efficient encoding and column compression schemes, and is designed to be compatible with a variety of higher-level type systems.

https://www.youtube.com/watch?v=Qfp6Uv1UrA0


Hybrid storage model (horizontal row groups and vertical column chunks partitioning)

https://www.youtube.com/watch?v=1j8SdS7s_NY

Row group filtering through predicate pushdown. Each row group has min-max stats for each predicate for fast filtering. Use dictionary filtering to filter by exact values.

Optimization: avoid many small files (too many stats/footer, overhead) and few huge files (>1 GB?). Need to repartition data.   

Automatic repartition using Delta Lake.


Spark Reading and Writing to Parquet Storage Format

https://www.youtube.com/watch?v=-ra0pGUw7fo

df.select('mycol','bla').write.partitionBy('mycol').mode(SaveMode.Append).format('parquet').save('/tmp/foo')


How Adobe Does 2 Million Records per second using Apache Spark

https://www.youtube.com/watch?v=rPgoPHAEYAM

val kafka = spark.readStream.format('kafka').option(...).load()


Read Parquet Files in Python

https://www.youtube.com/watch?v=XFO5jdGsMek

pip install pandas

pip install pyarrow

pd.read_parquet(parquet_file, engine='auto')

*parquet_file can be a single parquet file or a folder of parquet files

No comments: