Compressed columnar format (per Row group < 1GB) for Hadoop
Efficient scan with compression. Parquet supports deeply nested structures, efficient encoding and column compression schemes, and is designed to be compatible with a variety of higher-level type systems.
https://www.youtube.com/watch?v=Qfp6Uv1UrA0
Hybrid storage model (horizontal row groups and vertical column chunks partitioning)
https://www.youtube.com/watch?v=1j8SdS7s_NY
Row group filtering through predicate pushdown. Each row group has min-max stats for each predicate for fast filtering. Use dictionary filtering to filter by exact values.
Optimization: avoid many small files (too many stats/footer, overhead) and few huge files (>1 GB?). Need to repartition data.
Automatic repartition using Delta Lake.
Spark Reading and Writing to Parquet Storage Format
https://www.youtube.com/watch?v=-ra0pGUw7fo
df.select('mycol','bla').write.partitionBy('mycol').mode(SaveMode.Append).format('parquet').save('/tmp/foo')
How Adobe Does 2 Million Records per second using Apache Spark
https://www.youtube.com/watch?v=rPgoPHAEYAM
val kafka = spark.readStream.format('kafka').option(...).load()
Read Parquet Files in Python
https://www.youtube.com/watch?v=XFO5jdGsMek
pip install pandas
pip install pyarrow
pd.read_parquet(parquet_file, engine='auto')
*parquet_file can be a single parquet file or a folder of parquet files
No comments:
Post a Comment