AWS Deep Learning Container – Insufficient Memory

Published

I’m running a training job in an AWS deep learning container, on a AWS deep learning AMI p3 EC2 instance.

The EC2 instance has one GPU (12 gigs of gpu memory) and 60 gigs of memory.

When I try to run my training job, I get an insufficient memory error as soon training starts.

The built docker container is 9.73 gb but as I understand it, the docker container has all access to the host’s memory and storage, so I’m not sure why I’m getting this error.

I’ve got a very simple docker file.

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.2.0-gpu-py36-cu100-ubuntu16.04

ADD . "/home/app/"

WORKDIR "/home/app/src"

RUN apt-get update
RUN pip install --no-cache-dir -r requirements.txt

ENTRYPOINT ["python", "train.py"]

I’m reading my dataset from S3 via Pytorch’s Dataset class.

class SegmentationDataset(Dataset, ImageTransformer):

    def __init__(self, dataset_path, split, target_size, classes):
        super().__init__(split=split, target_size=target_size, classes=classes)

        self.img_dir = "{}/{}".format(dataset_path,  ('train' if self.split == 'train' else 'val'))
        if os.path.exists(self.img_dir):
            shutil.rmtree(self.img_dir)
        os.mkdir(self.img_dir)

        downloader = DatasetDownloader(self.img_dir)
        self.img_paths, self.lbl_paths = downloader.read_training_paths()

    def __len__(self):
        # return 10
        return len(self.img_paths)

    def __getitem__(self, index):
        client = boto3.resource("s3")
        bucket = client.Bucket("bucket")
        img_path = "/tmp/{}".format(os.path.basename(self.img_paths[index]))

        bucket.download_file(
            self.img_paths[index],
            img_path
        )        
        img = read_image(img_path)
        if self.split == 'test':
            return self.prepare_image(img), img_path, img.shape

        lbl_path = self.lbl_paths[index]
        lbl_path = "/tmp/{}".format(os.path.basename(self.lbl_paths[index]))
        bucket.download_file(
            self.lbl_paths[index],
            lbl_path
        )         
        lbl = read_mask(lbl_path, binarize=True, zeroone=True)

        return self.prepare_image(img, lbl)

self.img_paths is just an array of s3 file paths. Nothing is actually downloaded until the __getitem__ method but even still its done in batches and I get my error after downloading only a few images.

I run my docker container with docker run --gpus all <image id>.

Any help would be much appreciated.

Source: Docker Questions

Published
Categorised as amazon-ec2, amazon-web-services, docker, python, pytorch Tagged , , , ,

Answers

Leave a Reply

Still Have Questions?


Our dedicated development team is here for you!

We can help you find answers to your question for as low as 5$.

Contact Us
faq