Running Spark on Docker using bitnami image

  apache-spark, docker, pyspark, python

I’m trying to run spark in a docker container from a python app which is located in another container:

version: '3'
services:
  spark-master:
    image: docker.io/bitnami/spark:2
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - type: bind
        source: ./conf/log4j.properties
        target: /opt/bitnami/spark/conf/log4j.properties
    ports:
      - '8080:8080'
      - '7077:7077'
    networks:
      - spark
    container_name: spark
  spark-worker-1:
    image: docker.io/bitnami/spark:2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - type: bind
        source: ./conf/log4j.properties
        target: /opt/bitnami/spark/conf/log4j.properties
    ports:
      - '8081:8081'
    container_name: spark-worker
    networks:
      - spark
    depends_on:
      - spark-master
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    ports:
      - 22181:2181
    container_name: zookeeper
    networks: 
      - rmoff_kafka
  kafka:
    image: confluentinc/cp-kafka:5.5.0
    depends_on:
      - zookeeper
    ports:
      - 9092:9092
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    container_name: kafka
    networks: 
      - rmoff_kafka
  app:
    build:
      context: ./
    depends_on: 
      - kafka
    ports:
      - 5000:5000
    container_name: app
    networks: 
      - rmoff_kafka

networks:
  spark:
    driver: bridge
  rmoff_kafka:
    name: rmoff_kafka

When I try to create a SparkSession:

conf = SparkConf()
    conf.setAll(
        [
            (
                "spark.master",
                os.environ.get("SPARK_MASTER_URL", "spark://spark:7077"),
            ),
            ("spark.driver.host", os.environ.get("SPARK_DRIVER_HOST", "local[*]")),
            ("spark.submit.deployMode", "client"),
            ('spark.ui.showConsoleProgress', 'true'),
            ("spark.driver.bindAddress", "0.0.0.0"),
            ("spark.app.name", app_name)
        ]
    )

    spark_session = SparkSession.builder.config(conf=conf).getOrCreate()

I get an error related with Java:

JAVA_HOME is not set
Exception: Java gateway process exited before sending its port number

I supose I have to install Java or set Java Home environment variable, but I don’t know how to exactly tackle the problem. Should I install java in the spark container or the container from I run the python script?

Source: Docker Questions

LEAVE A COMMENT