Setup Apache Sparks – Delta Lake Warehouse

3 August 2024

This article explains how to setup a playground environment for data warehousing soluiotion.

Install Python version matching with the compatibility matrix of Delta lake
Check Java version (development environment not run time)
Create a new virtual environment if required
pip install apache spark with the version compatible with delta lake based on the matrix
- https://pypi.org/project/pyspark/3.3.4/
pip install delta spark : Note version should based on the compatibility matrix
- https://pypi.org/project/delta-spark/
activate virtual env and run the following command : pyspark
- pyspark should load without any issue.
- ensure JAVA_HOME , PYTHON_HOME , PYSPARK_HOME environment variables should be defined.
stop pyspark using Control + C

Testing Delta Lake setup and environment

Run the following command
- pyspark --packages io.delta:delta-core_2.12:2.1.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
This should start the spark session with the io delta lake
Validate storage
- data = spark.range(0, 5)
- data.write.format(“delta”).save(“/tmp/delta-table”)

Reference

Delta Lake basic official documentation

Apache Spark / Delta lake Compatibility matrix

Post Views: 75

ERP Realm