This article explains how to setup a playground environment for data warehousing soluiotion.
- Install Python version matching with the compatibility matrix of Delta lake
- Check Java version (development environment not run time)
- Create a new virtual environment if required
- pip install apache spark with the version compatible with delta lake based on the matrix
- pip install delta spark : Note version should based on the compatibility matrix
- activate virtual env and run the following command : pyspark
- pyspark should load without any issue.
- ensure JAVA_HOME , PYTHON_HOME , PYSPARK_HOME environment variables should be defined.
- stop pyspark using Control + C
Testing Delta Lake setup and environment
- Run the following command
pyspark --packages io.delta:delta-core_2.12:2.1.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
- This should start the spark session with the io delta lake
- Validate storage
- data = spark.range(0, 5)
- data.write.format(“delta”).save(“/tmp/delta-table”)
Reference
Delta Lake basic official documentation
https://delta.io/learn/getting-started
https://pypi.org/project/delta-spark
https://pypi.org/project/pyspark/3.3.4
Apache Spark / Delta lake Compatibility matrix
https://stackoverflow.com/questions/76066363/unable-to-write-df-in-delta-format-on-hdfs
0 Comments