I’m running a nestjs web application implemented with fastify on kubernetes.
I split my application into Multi Zones, and deploy it into different pyhsical location k8s clusters (Cluster A & Cluster B).
Everything gose well, except the Zone X in Culster A which has the maximum traffic during all zones.
( Here is a 2-Day metrics dashboard for Zone X during normal time )
The problem only happens on the Zone X in Cluster A and never happens on any other zones or clusters.
At first some 499 responses appear in Cluster A‘s Ingress Dashboard, and soon the memory of pods suddenly expand to the memory limit one pod after another.
It seems that the 499 status is caused by pods not sending responses to the outer.
At the same time, other zones in Cluster A work normally.
For avoiding influencing users, I switch all network traffic to Cluster B and everything work properly, Which excludes causing by dirty data.
I tried to kill and redeploy all pods of Zone X in Cluster A, but when I switch traffic back to Cluster A, the problem occurs again. But after waitting for 2-3 hours and then swith back the traffic, the problems disappers!
Since I don’t konow how comes, only thing I can do is switching traffic and check is everything back to normal.
I’ve tried multiple variations of node memory issues, but none of them seems to cause this problem. Any ideas or inspirations of this problem?
Source: Docker Questions