KERAS stuck randomly while adding first layer inside docker container

  docker, fastapi, keras, tensorflow, uvicorn

I design a classification model using Python=3.9.5, Keras=2.4.3 and tensorflow-cpu=2.5.0. The model works fine in the Windows development environment, but when I deploy it to server using Docker, FastAPI and uvicorn. It is stuck at random while adding the first layer. Sometime this error occurs on training for 4th time or sometimes it occurs after 20th time. Nothing is printing on logs (attached the screenshot below). The training is running on separate process when I see the stats of docker the processes are still alive and might be stuck on the same step

Source code / logs

Model Structure

try:
        log.info("Initializing Sequential Model")
        model = Sequential()

        log.info("Initializing GlorotNormal")
        initializer = initializers.GlorotNormal()

        log.info("Adding LSTM as input layer ")
        model.add(LSTM(100,  input_shape=(
            train_x.shape[1:]), return_sequences=False))

        log.info("Adding hidden dense layer")
        model.add(Dense(64, activation='selu', name="layer2", 
            kernel_initializer=initializer))
        
        log.info("Adding Dropout")
        model.add(Dropout(rate=0.5))

        log.info("Adding Output layer")
        model.add(Dense(len(intent_tags), activation='softmax', name="layer3"))
        
        log.info("Generating model Summary")
        model.summary()

        log.info("Compiling model")
        model.compile(loss='categorical_crossentropy', optimizer=
           tf.keras.optimizers.Adamax(learning_rate=0.005), metrics=['accuracy'])

        log.info("Model Compiled succesfully")

Model fit:

model: Sequential = create_training_model(train_x, train_y, intent_tags)
log.info("Model Created")

add_into_queue: LambdaCallback = LambdaCallback(on_epoch_end=lambda epoch,_: queue.put({"type": "progress", "sub_type": "training_progress", "progress": f'EPOCHS: {epoch+1}/{configuration_epochs}'}))
es: EarlyStopping = EarlyStopping(monitor='loss', mode='min',verbose=1, patience=30, min_delta=1)
log.info("fitting Training")

history: object = model.fit(train_x, train_y, epochs=200, batch_size=5,
                  verbose=1, validation_data=(test_x, test_y), 
                  callbacks=[es, add_into_queue])

 
if es.stopped_epoch:
      training_completed_message: str = f"Training completed {es.stopped_epoch}/{configuration_epochs} Epoch, Early Stopping applied"
      log.info(training_completed_message)

      progress_data: dict = {"type": "progress", "sub_type":"training_completed"  , "progress": str(training_completed_message)}
      queue.put(progress_data)

else:
      progress_data: dict = {type": "progress", "sub_type": "training_completed","progress": str(configuration_epochs)}
      queue.put(progress_data)

Fastapi websocket code snippet for training model:

try:
    configuration["TRAINING_COUNT"] +=1
    log.info(f"Training Count: {configuration['TRAINING_COUNT']}")
    log.info("Starts training on seprate procces")
    multi_process = Process(target=chatbot_training, args=(qestions_answers, training_type, client_id, saved_file_path, queue), name=f"training_process_{client_id}")
    multi_process.start()
                        
   log.info("Initializing thread to send training progress")
   data_progress_thread = threading.Thread(target = send_data_progress_call, args=[websocket, queue] , name="data_progress_thread")
   data_progress_thread.daemon = True
   data_progress_thread.start()

Dockerfile

FROM python:3.9.5-slim-buster
COPY ./ /app
WORKDIR /app
RUN pip install -r requirements.txt && 
python -m nltk.downloader punkt && 
python -m nltk.downloader wordnet && 
python -m nltk.downloader averaged_perceptron_tagger && 
python -m pip cache purge
ENV PYTHONHASHSEED=100
CMD ["python", "./starfighter/app.py"]

Docker Container logs:
Docker Container logs

Result of docker stats container_name:
Result of docker stats container_name

Results of docker top container_name:
Results of docker top container_name

Logs on development environment
Logs on development environment

Steps to reproduce:

train model 20-40 times to reproduce the error, for saving time use small dataset

Environment information

Server OS = Centos 7
docker base image = python:3.9.5-slim-buster
Python Version = 3.9.5
tensorflow-cpu==2.5.0
keras==2.4.3
nltk==3.5
pyspellchecker==0.6.2
pandas==1.2.4
fastapi==0.65.1
aiofiles==0.7.0
openpyxl==3.0.7
websockets==9.0.2
numpy==1.19.5
strictyaml
uvicorn==0.13.4
PyYAML==5.4.1

Source: Docker Questions

LEAVE A COMMENT