I am trying to run a distributed computation using Dask on a AWS Fargate cluster (using
dask.cloudprovider API) and I am running into the exact same issue as this question. Based on the partial answers to the linked question, and on things like this, I heavily suspect it is due to the pandas version in my worker being outdated; and indeed the
official Dask Dockerfile specifies a old-ish version of pandas.
By contrast, when I run my computation locally (using a
distributed.LocalCluster) with a pandas version at
1.2.2 it works fine. Btw, it is a call to the
categorize method on a Dask DataFrame that triggers the error in the Fargate cluster case.
What I would like to do as a workaround is simply to specify myself the version of pandas in the image deployed to the workers, either by rewriting the Dockerfile or through some other method. Is there a way to achieve this?
Source: Dockerfile Questions