Before we start, I am doing everything in a Python container (AWS Fargate) so hence some limitations.
I am using Flask module to run a web-server and spawn a permanent child process with multiprocessing module to run repetitive background tasks. Everything is working fine but at night the child process is getting stuck randomly with no logs or traces. I’ve even run a custom health check to make sure it’s alive and it kind of is, but it stops working, CPU usage drops and I can’t understand what is going on. It is doing a lot of networks calls but I would expect it to terminate on timeout, not hang. CPU usage is sitting flat on ~40% while it’s working and RAM is constant at 18%. Potentially it could run out of file descriptors but why would it?
The code looks like this:
from flask import Flask from multiprocessing import Process import time def poll(): while True: blablabla time.sleep(60) p = Process(target=poll) p.start() p.join app = Flask(__name__) @app.route('/', methods=['GET']) def java(): return app.send_static_file('java.html')
I can’t use any timeouts because the child process is meant to run indefinitely and I can’t spawn a new process every few minutes because container’s kernel will run out of PIDs fairly soon. I don’t see how I can use “try” too since it’s not failing but just getting unresponsive.
Technically, I can run another container for this subjob but I wonder if there is a better solution?