Implementing Dask scheduler and workers on Docker containers

  dask, dask-distributed, docker, parquet, python

I need to run a scikit-learn RandomForestClassifier with multiple processes in parallel. For that, I’m looking into implementing a Dask scheduler with N workers, where the scheduler and each worker run in a separate Docker container.

The client application, that also runs in a separate Docker container, will first connect to the scheduler and initiate the scikit-learn process with with joblib.parallel_backend('dask'):.

The data is stored in parquet in the client application Docker container. What is the best practice to have the workers access the data?

Source: Docker Questions