How to run the python script within a docker container through nextflow seemlessly without any path or env related issues?

  aws-batch, bash, docker, nextflow, python

I am trying to run a python script using nextflow and docker. I am using a dockerfile (as shown below) to create a docker image. Nextflow script has a simple launch of a python script. The issue is when I run the same python command from within the docker container (in the interactive mode) it works fine. But when I launch it using nextflow with a docker container then it throws up error.

Dockerfile:

#!/usr/local/bin/docker
# -*- version: 20.10.2 -*-

############################################
## MULTI-STAGE CONTAINER CONFIGURATION ##
FROM python:3.6.2
RUN apt-get update && apt-get install -y 
    apt-transport-https 
    software-properties-common 
    unzip 
    curl
RUN wget -O- https://apt.corretto.aws/corretto.key | apt-key add - && 
    add-apt-repository 'deb https://apt.corretto.aws stable main' && 
    apt-get update && 
    apt-get install -y java-1.8.0-amazon-corretto-jdk


############################################
## PHEKNOWLATOR (PKT_KG) PROJECT SETTINGS ##
# create needed project directories
WORKDIR /PKT
RUN mkdir -p /PKT
RUN mkdir -p /PKT/resources
RUN mkdir -p /PKT/resources/construction_approach
RUN mkdir -p /PKT/resources/edge_data
RUN mkdir -p /PKT/resources/knowledge_graphs
RUN mkdir -p /PKT/resources/node_data
RUN mkdir -p /PKT/resources/ontologies
RUN mkdir -p /PKT/resources/processed_data
RUN mkdir -p /PKT/resources/relations_data

# copy scripts/files needed to run pkt_kg
COPY pkt_kg /PKT/pkt_kg
COPY Main.py /PKT
COPY setup.py /PKT
COPY README.rst /PKT
COPY resources /PKT/resources

# download and copy needed data
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/edge_source_list.txt && mv edge_source_list.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ontology_source_list.txt && mv ontology_source_list.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/resource_info.txt && mv resource_info.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/subclass_construction_map.pkl && mv subclass_construction_map.pkl resources/construction_approach/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PheKnowLator_MergedOntologies.owl && mv PheKnowLator_MergedOntologies.owl resources/knowledge_graphs/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/node_metadata_dict.pkl && mv node_metadata_dict.pkl resources/node_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt && mv DISEASE_MONDO_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt && mv ENSEMBL_GENE_ENTREZ_GENE_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt && mv ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt && mv GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEx_TISSUE_CELL_MAP.txt && mv HPA_GTEx_TISSUE_CELL_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt && mv MESH_CHEBI_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt && mv PHENOTYPE_HPO_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/STRING_PRO_ONTOLOGY_MAP.txt && mv STRING_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt && mv UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/INVERSE_RELATIONS.txt && mv INVERSE_RELATIONS.txt resources/relations_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/RELATIONS_LABELS.txt && mv RELATIONS_LABELS.txt resources/relations_data/

# install needed python libraries
RUN pip install --upgrade pip setuptools
WORKDIR /PKT
RUN pip install .


############################################
## GLOBAL ENVRIONMENT SETTINGS ##
# copy files needed to run docker container
COPY entrypoint.sh /PKT

# update permissions for all files
RUN chmod -R 755 /PKT

# set OWlTools memory (set to a high value, system will only use available memory)
ENV OWLTOOLS_MEMORY=500g
RUN echo $OWLTOOLS_MEMORY

# set python envrionment encoding
RUN export PYTHONIOENCODING=utf-8

Name of the docker image– pkt:2.0.0

Nextflow script:

process run_PKTBaseRun{

echo True

container 'pkt:2.0.0'
publishDir "${params.outDir}", mode: 'copy'

output:
file '*' into output_ch

script:
"""
which python
$PWD
pwd
python /PKT/Main.py --onts /PKT/resources/ontology_source_list.txt 
            --edg /PKT/resources/edge_source_list.txt 
            --res /PKT/resources/resource_info.txt 
            --out /PKT/resources/knowledge_graphs --app subclass --kg full --nde yes --rel yes --owl no
"""


}

Now when I execute:

nextflow run main.nf

Then this gives error related to glob.glob modules as it is not listing the files as it must inside the docker container.

However, when i simply run the python code above inside the docker container then it runs seemlessly.

> docker run -it pkt:2.0.0 /bin/bash

/PKT> python Main.py --onts resources/ontology_source_list.txt 
            --edg resources/edge_source_list.txt 
            --res resources/resource_info.txt 
            --out resources/knowledge_graphs --app subclass --kg full --nde yes --rel yes --owl no

It is only when I combine nextflow with docker does this code throw errors.
I have ensured that the python that is used is that of within the container.

Questions:

  1. Any ideas/thoughts to make it work?

Interestingly,
the output of which python –> python within the container
BUT,
the output of $PWD –> directory from where nextflow is launched
the output of pwd –> work directory of nextflow

  1. When we add container in the nextflow process, it is not that the commands inside the nextflow process (run_PKTBaseRun) are run from the container workdir?Therefore should value of pwd not be that of container workdir instead of nextflow workdir?

All the required files have been added to the docker image.

  1. Is there a way to ensure that the commands within the script section in the nextflow process are run from the docker root/workdir?

The idea behing this nextflow and docker is to finally run it on aws batch using awscli. But before running it on aws batch, want to ensure that its running fine on the local server.

Looking forward to your suggestions and ideas. Thank you.

Source: Docker Questions

LEAVE A COMMENT