I have a GKE cluster which has suddenly stopped being able to pull my docker images from GCR; both are in the same GCP project. It has been working well for several months, no issues pulling images, and has now started throwing errors without having made any changes.
(NB: I’m generally the only one on my team who accesses Google Cloud, though it’s entirely possible that someone else on my team may have made changes / inadvertently made changes without realising).
I’ve seen a few other posts on this topic, but the solutions offered in others haven’t helped. Two of these posts stood out to me in particular, as they were both posted around the same day my issues started ~13/14 days ago. Whether this is coincidence or not who knows..
This post has the same issue as me; unsure whether the posted comments helped them resolve, but it hasn’t fixed for me. This post seemed to also be the same issue, but the poster says it resolved by itself after waiting some time.
I first noticed the issue on the cluster a few days ago. Went to deploy a new image by pushing image to GCR and then bouncing the pods
kubectl rollout restart deployment.
The pods all then came back with
ImagePullBackOff, saying that they couldn’t get the image from GCR:
kubectl get pods:
XXX-XXX-XXX 0/1 ImagePullBackOff 0 13d XXX-XXX-XXX 0/1 ImagePullBackOff 0 13d XXX-XXX-XXX 0/1 ImagePullBackOff 0 13d ...
kubectl describe pod XXX-XXX-XXX:
Normal BackOff 20s kubelet Back-off pulling image "gcr.io/<GCP_PROJECT>/XXX:dev-latest" Warning Failed 20s kubelet Error: ImagePullBackOff Normal Pulling 8s (x2 over 21s) kubelet Pulling image "gcr.io/<GCP_PROJECT>/XXX:dev-latest" Warning Failed 7s (x2 over 20s) kubelet Failed to pull image "gcr.io/<GCP_PROJECT>/XXX:dev-latest": rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/<GCP_PROJECT>/XXX:dev-latest": failed to resolve reference "gcr.io/<GCR_PROJECT>/XXX:dev-latest": unexpected status code [manifests dev-latest]: 403 Forbidden Warning Failed 7s (x2 over 20s) kubelet Error: ErrImagePull
Troubleshooting steps followed from other posts:
I know that the image definitely exists in GCR –
- I can pull the image to my own machine (also removed all docker images from my machine to confirm it was really pulling)
- I can see the tagged image if I look on the GCR UI on chrome.
I’ve SSH’d into one of the cluster nodes and tried to docker pull manually, with no success:
docker pull gcr.io/<GCP_PROJECT>/XXX:dev-latest Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
(Also did a docker pull of a public mongodb image to confirm that was working, and it’s specific to GCR).
So this leads me to believe it’s an issue with the service account not having the correct permissions, as in the cloud docs under the ‘Error 400/403’ section. This seems to suggest that the service account has either been deleted, or edited manually.
During my troubleshooting, I tried to find out exactly which service account GKE was using to pull from GCR. In the steps outlined in the docs, it says that:
The name of your Google Kubernetes Engine service account is as follows, where PROJECT_NUMBER is your project number:
I found the service account and checked the polices – it did have one for
roles/container.serviceAgent, but nothing specifically mentioning kubernetes as I would expect from the description in the docs.. ‘the Kubernetes Engine Service Agent role‘ (unless that is the one they’re describing, in which case I’m no better off that before anyway..).
Must not have had the correct roles, so I then followed the steps to re-enable (disable then enable the Kubernetes API). Running
cloud projects get-iam-policy <GCP_PROJECT> again and diffing the two outputs (before/after), the only difference is that a service account for ‘@cloud-filer…’ has been deleted.
Thinking maybe the error was something else, I thought I would try spinning up a new cluster. Same error – can’t pull images.
I’ve been racking my brains to try to troubleshoot, but I’m now out of ideas! Any and all help much appreciated!
Source: Docker Questions