This article explains how to setup a playground environment for data warehousing soluiotion.

  1. Install Python version matching with the compatibility matrix of Delta lake
  2. Check Java version (development environment not run time)
  3. Create a new virtual environment if required
  4. pip install apache spark with the version compatible with delta lake based on the matrix
  5. pip install delta spark : Note version should based on the compatibility matrix
  6. activate virtual env and run the following command : pyspark
    • pyspark should load without any issue.
    • ensure JAVA_HOME , PYTHON_HOME , PYSPARK_HOME environment variables should be defined.
  7. stop pyspark using Control + C

Testing Delta Lake setup and environment

  1. Run the following command
    • pyspark --packages io.delta:delta-core_2.12:2.1.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
  2. This should start the spark session with the io delta lake
  3. Validate storage
    • data = spark.range(0, 5)
    • data.write.format(“delta”).save(“/tmp/delta-table”)

Reference

Delta Lake basic official documentation

https://delta.io/learn/getting-started

https://pypi.org/project/delta-spark

https://pypi.org/project/pyspark/3.3.4

Apache Spark / Delta lake Compatibility matrix

https://stackoverflow.com/questions/76066363/unable-to-write-df-in-delta-format-on-hdfs


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *