Python:Connect to a spark cluster running in a docker

I am quite new to the park world and used online resources to stand up my spark cluster using the following docker compose

version: "3"
services:
  spark-master:
    image: fiziy/spark:latest
    container_name: spark-master
    hostname: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
    networks:
      - spark-network
    environment:
      - "SPARK_LOCAL_IP=spark-master"
      - "SPARK_MASTER_PORT=7077"
      - "SPARK_MASTER_WEBUI_PORT=8080"
    command: "/run_master.sh"
  spark-worker:
    image: fiziy/spark:latest
    depends_on:
      - spark-master
    ports:
      - 8080
    networks:
      - spark-network
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
      - "SPARK_WORKER_WEBUI_PORT=8080"
    command: "/run_worker.sh"
networks:
  spark-network:
    driver: bridge
    ipam:
      driver: default

This bring up my cluster and I can see the spark webUI on localhost:8080 and it tells me that the master is running at spark://spark-master:7077

This is all well and good but I dont know where to go from here. What I want is to connect to it using a python script running locally on my windows machine. I tried something like the following:

from pyspark.sql import SparkSession
import random

spark = SparkSession.builder 
    .master("spark://spark-master:7077") 
    .appName("Spark Test App") 
    .getOrCreate()

NUM_SAMPLES = 9

def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

count = spark.sparkContext.parallelize(range(0, NUM_SAMPLES)) 
             .filter(inside).count()

print('Pi is roughly {}'.format(4.0 * count / NUM_SAMPLES))

This generates a bunch of warning and errors foremost of which is that it could not connect to the master. All the examples I have seen online are for connecting through jupyter notebook running in the same docker or baking a python app inside the image itself.

My question is whether I can do iterative development locally using Pycharm or another IDE while connecting to the cluster in the docker. I’ll be really grateful if you can point me to some online resources or add to my code to make it work. Thanks

UPDATE:

I changed spark://spark-master:7077 to spark://localhost:7077 and now I get a different error

Fist I get:

20/02/27 21:52:16 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable nullbinwinutils.exe in the Hadoop binaries.
20/02/27 21:52:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

And then I get:

  File "C:devspark_clusterappvenvlibsite-packagespysparksqlutils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'Unsupported class file major version 55'

while on the docker logs I do see workers being kicked off

spark-worker_2  | 20/02/28 02:52:18 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=55384" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:55384" "--executor-id" "0" "--hostname" "172.21.0.5" "--cores" "2" "--app-id" "app-20200228025218-0004" "--worker-url" "spark://[email protected]:39557"
spark-worker_3  | 20/02/28 02:52:18 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=55384" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:55384" "--executor-id" "2" "--hostname" "172.21.0.3" "--cores" "2" "--app-id" "app-20200228025218-0004" "--worker-url" "spark://[email protected]:34405"
spark-worker_1  | 20/02/28 02:52:19 INFO Worker: Asked to kill executor app-20200228025218-0004/1
spark-worker_1  | 20/02/28 02:52:19 INFO ExecutorRunner: Runner thread for executor app-20200228025218-0004/1 interrupted
spark-worker_1  | 20/02/28 02:52:19 INFO ExecutorRunner: Killing process!
spark-master    | 20/02/28 02:52:19 INFO Master: Received unregister request from application app-20200228025218-0004
spark-master    | 20/02/28 02:52:19 INFO Master: Removing app app-20200228025218-0004
spark-master    | 20/02/28 02:52:19 INFO Master: 172.21.0.1:52272 got disassociated, removing it.
spark-master    | 20/02/28 02:52:19 INFO Master: host.docker.internal:55384 got disassociated, removing it.
spark-worker_2  | 20/02/28 02:52:19 INFO Worker: Asked to kill executor app-20200228025218-0004/0
spark-worker_2  | 20/02/28 02:52:19 INFO ExecutorRunner: Runner thread for executor app-20200228025218-0004/0 interrupted
spark-worker_2  | 20/02/28 02:52:19 INFO ExecutorRunner: Killing process!
spark-worker_3  | 20/02/28 02:52:19 INFO Worker: Asked to kill executor app-20200228025218-0004/2
spark-worker_3  | 20/02/28 02:52:19 INFO ExecutorRunner: Runner thread for executor app-20200228025218-0004/2 interrupted
spark-worker_3  | 20/02/28 02:52:19 INFO ExecutorRunner: Killing process!
spark-worker_2  | 20/02/28 02:52:19 INFO Worker: Executor app-20200228025218-0004/0 finished with state KILLED exitStatus 143
spark-master    | 20/02/28 02:52:19 WARN Master: Got status update for unknown executor app-20200228025218-0004/0

According to the this question – Pyspark error – Unsupported class file major version 55, the answer lies in having Java 8 but that I already do. What I have is the following:

  • openjdk:8-alpine (docker)
  • spark-2.4.5-bin-hadoop2.7 (docker)
  • python 3.7.6 (docker)
  • Locally running Python 3.7.6ith pyspark 2.4.5

Source: StackOverflow