I need to run a scikit-learn RandomForestClassifier with multiple processes in parallel. For that, I’m looking into implementing a Dask scheduler with N workers, where the scheduler and each worker run in a separate Docker container.
The client application, that also runs in a separate Docker container, will first connect to the scheduler and initiate the scikit-learn process with
The data is stored in parquet in the client application Docker container. What is the best practice to have the workers access the data?
Source: Docker Questions