Trying to distribute data processing across a cluster and then aggregate it in master

Right now I have a Python Application which runs 50 threads to process data. It takes an xlsx file and will process a list of values, and will output a simple csv.

I said to myself, since this is a simple Python App with 50 threads, How can I create a cluster to distribute data-processing even more? FOR EXAMPLE: Have each Worker node process a subset given to it by the master. Well that sounds easy, just take the master app slice up the dataset generated and then push it to the workers with load balancing.

How do I get the results though? I would want to take all results (out.csv in this case) and return them to the master and merge them to create 1 master_out.csv

At first I was thinking a Docker swarm, but no one i know uses them, everything beyond a simple docker container is offloaded to K8.

Right now, i have a simple file structure:

app/
  __init__.py (everything is in this file)
  dataset.xlxs
  out.csv

I was thinking to create a docker image so that way I could move this app into the image, update/upgrade, install python3 if it isnt already, and then just run this application.

I started getting deeper into processing, and realized that there is likely some built in ways to handle this. create a flask app to handle ingestion, and then a flask app on master to accept files at completion, etc…. But then master needs to know all the workers etc.

  • I was thinking to create a cluster.
  • Master node has access to a volume which contains the file i need to process.
  • Load balancing pushes parts of each file ( ROWS / NUM_WORKERS) to each node.
  • After WORKERS FINISH, Master Aggregates the resulting csv files to make a master file.
  • Master_OUT.csv will exist in the folder for consumption.

So the cluster would turn on and when ready will run everything, then tare down at the end. Since they want the cluster to likely be distributed, I am not sure how that would work though as processing has IP Address limitations. It seems like this will not work on a local cluster because to machines being used to reference will hit a cloudflare (or similar) wall after enough requests, so im trying to think of a UNIQUE IP Solution.

I have an idea for architecture, but im not sure if i should create a dockerfile for this, and then figure out the way kube can handle all of this for me. Though i think in the kube config files we can put remote aws instance login creds so it will spin up all the remote servers.

While I have been doing some stuff with Swarms, It seems that kube is where the real work is done, as swarms seem to be better suited for other things.

Im trying to think of how I would approach this from a kube (or swarm) perspective.

Given the information, this concept reminds me less of load balancing because of the data aggregation and more of like Kubeflow, where you create a CLOUD specifically for ML, but instead of ML it would be ANY distributed processing.

Source: StackOverflow