Getting tons of DecisionTaskTimedOut after scaling out the matching service of Uber cadence in docker swarm cluster

  cadence-workflow, docker, docker-swarm, uber-cadence

I’m tryng to run each cadence service independently so that I can scale them in and out easily.
My teams is using docker-swarm, and we’re managing everything with a Portainer UI.
So far, I’ve been able to scale the frontend service to have two replicas, but If I do the same with the matching service, I will get a lot of DecisionTaskTimedOut with a workflow execution. Eventually, the execution will finish successfully but after some long time. To have an idea, It would take 2 minutes with two matching service replicas, while it only takes 7 seconds with just one.

This is a Test environment. I’m using a dockerized cassand db (we cannot use a real one due to some budget restrictions) Maybe that’s the problem? The Docker image is configured with the following enviroment variables:

RINGPOP_BOOTSTRAP_MODE=dns
KEYSPACE=cadence
BIND_ON_IP=0.0.0.0
SKIP_SCHEMA_SETUP=false
VISIBILITY_KEYSPACE=cadence_visibility
CASSANDRA_HOSTNAME=soap_cassandra
RINGPOP_SEEDS=soap_cadence_frontend:7933,soap_cadence_history:7934,soap_cadence_worker:7939
CADENCE_HOME=/etc/cadence
SERVICES=matching

You can assume the default values for any other env var you don’t see above

The RINGPOP_SEEDS are the service names assigned to every cadence service, docker-swarm will create a DNS entry out of them as well as load balancer if there is more than 1 replica declared.

The matching service seems to start correctly, Logs:

{"level":"info","ts":"2021-02-18T22:47:36.296Z","msg":"Created RPC dispatcher and listening","service":"cadence-matching","address":"0.0.0.0:7935","logging-call-at":"rpc.go:81"},
{"level":"warn","ts":"2021-02-18T22:47:36.321Z","msg":"Failed to fetch key from dynamic config","key":"system.advancedVisibilityWritingMode","error":"unable to find key","logging-call-at":"config.go:68"},
{"level":"info","ts":"2021-02-18T22:47:36.336Z","msg":"Add new peers by DNS lookup","address":"0.0.0.0","addresses":"[0.0.0.0:7933]","logging-call-at":"clientBean.go:321"},
{"level":"info","ts":"2021-02-18T22:47:36.321Z","msg":"Creating RPC dispatcher outbound","service":"cadence-frontend","address":"0.0.0.0:7933","logging-call-at":"clientBean.go:277"},
{"level":"info","ts":"2021-02-18T22:47:36.441Z","msg":"Starting service matching","logging-call-at":"server.go:217"},
{"level":"warn","ts":"2021-02-18T22:47:36.441Z","msg":"Failed to fetch key from dynamic config","key":"matching.throttledLogRPS","error":"unable to find key","logging-call-at":"config.go:68"},
{"level":"info","ts":"2021-02-18T22:47:36.441Z","msg":"Creating RPC dispatcher outbound","service":"cadence-frontend","address":"127.0.0.1:7933","logging-call-at":"clientBean.go:277"},
{"level":"info","ts":"2021-02-18T22:47:36.442Z","msg":"Add new peers by DNS lookup","address":"127.0.0.1","addresses":"[127.0.0.1:7933]","logging-call-at":"clientBean.go:321"},
{"level":"info","ts":"2021-02-18T22:47:36.713Z","msg":"matching starting","service":"cadence-matching","logging-call-at":"service.go:90"},
{"level":"info","ts":"2021-02-18T22:47:36.734Z","msg":"RuntimeMetricsReporter started","service":"cadence-matching","logging-call-at":"runtime.go:169"},
{"level":"info","ts":"2021-02-18T22:47:36.734Z","msg":"PProf not started due to port not set","logging-call-at":"pprof.go:64"},
{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-matching","addresses":"[[::]:7935]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-worker","addresses":"[[::]:7939]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.800Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-frontend","addresses":"[[::]:7933]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.814Z","msg":"service started","service":"cadence-matching","logging-call-at":"resourceImpl.go:383"},
{"level":"info","ts":"2021-02-18T22:47:36.814Z","msg":"matching started","service":"cadence-matching","logging-call-at":"service.go:99"}

I can see the following errors in the logs when the workflow is executing:

{"level":"error","ts":"2021-02-18T22:17:07.281Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278, taskListType: 0, rangeID: 14, db rangeID: 15","wf-task-list-name":"ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278","wf-task-list-type":0,"number":1300001,"next-number":1300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Errornt/cadence/common/log/loggerimpl/logger.go:134ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoopnt/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:52:03.740Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 16, db rangeID: 17","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1500002,"next-number":1500002,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Errornt/cadence/common/log/loggerimpl/logger.go:134ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoopnt/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:10:10.971Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"FeaTaskList","wf-task-list-type":1,"store-operation":"create-task","error":"Failed to create task. TaskList: FeaTaskList, taskListType: 1, rangeID: 94, db rangeID: 95","wf-task-list-name":"FeaTaskList","wf-task-list-type":1,"number":9300001,"next-number":9300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Errornt/cadence/common/log/loggerimpl/logger.go:134ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoopnt/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:09:53.345Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 14, db rangeID: 15","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1300001,"next-number":1300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Errornt/cadence/common/log/loggerimpl/logger.go:134ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoopnt/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:53:56.145Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 17, db rangeID: 18","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1600001,"next-number":1600001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Errornt/cadence/common/log/loggerimpl/logger.go:134ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoopnt/cadence/service/matching/taskWriter.go:176"}

The docker image version I’m currently using is: ubercadence/server:0.15.1

Is there any way to resolve this issue?

Source: Docker Questions

LEAVE A COMMENT