In this blog post, I will walk through my journey investigating and resolving an issue where my certificate request from ZeroSSL, using Cert-Manager, remained in a "not ready" state for over two days. I'll cover the tools involved, provide background information, and show how I eventually identified and fixed the problem.
ACME (Automatic Certificate Management Environment) is a protocol developed by the Internet Security Research Group (ISRG) to automate the process of obtaining and managing SSL/TLS certificates from Certificate Authorities (CAs). It simplifies the traditionally manual steps involved in certificate issuance by using an automated process. ACME is widely used by services like Let’s Encrypt and ZeroSSL to secure websites with HTTPS.
The ACME protocol automates interactions between a client (such as Cert-Manager or Certbot) and a CA, allowing the client to request, renew, and manage certificates without human intervention. ACME operates through a series of challenges that prove ownership of the domain for which a certificate is requested. Once the ownership is verified, the CA can issue the certificate.
In this challenge, the client proves ownership of a domain by hosting a specific file at a designated path (e.g., http://gmem.cc/.well-known/acme-challenge/). The CA attempts to retrieve this file, and if successful, the challenge is validated. This challenge is commonly used for publicly accessible web servers.
In this challenge, the client proves domain ownership by creating a special DNS TXT record for the domain. The CA checks the DNS record to confirm ownership. DNS-01 is typically used for wildcard certificates or when the server is not publicly accessible because it doesn’t rely on serving files over HTTP.
This challenge requires the client to prove control of a domain by configuring a TLS server with a special certificate during the ACME validation process. The CA then connects to the server via TLS and checks the certificate. This challenge is less common and usually used in specialized environments.
Cert-Manager is an open-source Kubernetes add-on that automates the management, issuance, and renewal of certificates within Kubernetes clusters. It integrates with various Certificate Authorities (CAs) and protocols, including ACME (used by providers like Let’s Encrypt and ZeroSSL). Cert-Manager is widely used to ensure that certificates remain valid and secure without manual intervention.
When deploying Cert-Manager in Kubernetes, several key components work together to handle certificate management.
The core component of the Cert-Manager system, responsible for managing the lifecycle of certificates and interacting with Issuers (such as ACME servers like Let’s Encrypt or ZeroSSL). It runs as a Kubernetes controller and is responsible for:
This component performs the actual management of certificates, from creation to renewal, ensuring that the requested certificates are stored securely as Kubernetes Secrets.
The CA Injector is an additional component that works alongside Cert-Manager to inject CA data into other Kubernetes resources. It primarily operates on Kubernetes ValidatingWebhookConfiguration and MutatingWebhookConfiguration resources, injecting certificates into them automatically. This is necessary for:
The cainjector is critical in environments where certain Kubernetes components (e.g., webhooks) require their certificates to be signed by a trusted CA.
The Webhook component provides an admission controller that validates Cert-Manager resources like Certificate, Issuer, ClusterIssuer, and CertificateRequest upon creation or update. It ensures that the Cert-Manager resources are correctly configured by:
The webhook helps catch configuration issues early, improving the reliability of certificate management workflows.
This webhook, which is maintained by the community, specifically handles DNS-01 challenges for domains hosted in Tencent Cloud’s DNSPod. When Cert-Manager requests a certificate using the DNS-01 challenge, it needs to create a DNS TXT record in the domain's DNS zone. cert-manager-webhook-dnspod facilitates this by interacting with the DNSPod API to manage DNS records.
Cert-Manager invokes this webhook when it needs to solve a DNS-01 challenge using DNSPod as the DNS provider. The webhook receives instructions from Cert-Manager, communicates with the DNSPod API to create or delete DNS TXT records, and reports back to Cert-Manager when the challenge is complete.
In this post the webhook was created with the following manifest:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cert-manager-webhook-dnspod
namespace: argocd
spec:
destination:
namespace: cert-manager
server: https://kubernetes.default.svc
project: default
source:
repoURL: https://github.com/imroc/cert-manager-webhook-dnspod
targetRevision: master
path: charts/cert-manager-webhook-dnspod
helm:
releaseName: cert-manager-webhook-dnspod
values: |
groupName: acme.dnspod.tencent.com
clusterIssuer:
name: letsencrypt-issuer
secretId: DNSPOD_SECRETID
secretKey: DNSPOD_SECRETKEY
email: gmem@me.com
syncPolicy:
syncOptions:
- CreateNamespace=true
automated:
prune: true
selfHeal: true
Cert-Manager relies on several types of custom resources that work together to manage the certificate lifecycle:
The HTTP-01 challenge is used when the domain is publicly accessible over HTTP. The CA validates ownership of the domain by requesting a specific file over HTTP. Cert-Manager sets up the challenge response using a Kubernetes Ingress resource.
The DNS-01 challenge is used when domain ownership must be validated via DNS records. This method is often preferred for wildcard certificates or domains that are not publicly accessible over HTTP.
I created a ClusterIssuer and a Certificate for the wildcard domain *.gmem.cc. However, after waiting for more than two days, the certificate status still showed as not ready, with the following condition in the certificate's status:
status:
conditions:
- lastTransitionTime: "2024-10-14T06:02:10Z"
message: Issuing certificate as Secret does not exist
observedGeneration: 1
reason: DoesNotExist
status: "False"
type: Ready
I checked on DNSPod and found a TXT record _acme-challenge.gmem.cc with the TTL set to 600 seconds.
To diagnose the issue, I checked the status of the related Cert-Manager resources: CertificateRequest, Order, and Challenge. These are the key resources that Cert-Manager uses to interact with ACME and handle certificate issuance. Here’s an overview of my findings:
CertificateRequest:
status:
conditions:
- lastTransitionTime: "2024-10-14T06:02:14Z"
message: Certificate request has been approved by cert-manager.io
reason: cert-manager.io
status: "True"
type: Approved
- lastTransitionTime: "2024-10-14T06:02:14Z"
message: 'Waiting on certificate issuance from order istio-system/wildcard-ssl-5pgsr-3353861729: "pending"'
reason: Pending
status: "False"
type: Ready
The certificate request was approved, but the system was waiting for the certificate issuance to complete.
Order:
status:
state: pending
authorizations:
- challenges:
- token: iQMwrfsFRmJ_MytUY3N4NW6QehtTn0-IEvJWAmYEw_k
type: dns-01
url: https://acme.zerossl.com/v2/DV90/chall/DDRBMBd9jnJo_W4EcQfSWQ
wildcard: true
The order was still in a "pending" state, and the DNS-01 challenge had not been completed yet.
Challenge:
status: presented: true processing: true reason: 'Waiting for DNS-01 challenge propagation: DNS record for "gmem.cc" not yet propagated' state: pending
The challenge was waiting for DNS propagation, but apparently the TXT record I mentioned above had been created two days ago.
Mutifarious reasons can cause delay on DNS propagation, we can check whether TXT record _acme-challenge.gmem.cc is synchronized all over the world using whatsmydns.net:
https://www.whatsmydns.net/#TXT/_acme-challenge.gmem.cc
In this case, the check result was that the record had been fully propagated.
I then checked the logs for the Cert-Manager pod to see if any errors were being reported. The relevant and repeating error message from the logs was:
E1014 05:52:58.647839 1 sync.go:190] "cert-manager/challenges: propagation check failed" err="DNS record for \"gmem.cc\" not yet propagated"
At this point, it seemed that Cert-Manager was unable to see the propagated DNS records, even though I had confirmed their existence.
After enabling verbose logging (--v5 log level) in Cert-Manager, I finally discovered the root cause. The detailed logs revealed that Cert-Manager was checking the DNS record at an intermediate CNAME:
I1014 06:20:33.919849 1 dns.go:116] "cert-manager/challenges/Check: checking DNS propagation"
I1014 06:20:33.921190 1 wait.go:90] Updating FQDN: _acme-challenge.gmem.cc. with its CNAME: lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com.
I1014 06:25:35.886683 1 wait.go:298] Searching fqdn "lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com." using seed nameservers [10.231.18.121:53]
I1014 06:25:35.886696 1 wait.go:329] Returning cached zone record "sg-tencentclb.com." for fqdn "lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com."
I1014 06:20:33.987786 1 wait.go:141] Looking up TXT records for "lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com."
The challenge was failing because the DNS record for *.gmem.cc was a CNAME pointing to another domain, lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com, which caused Cert-Manager to search for a wrong TXT record.
The solution was to remove the wildcard record *.gmem.cc which was CNAMEed to Tencent Cloud Loadbalancer address. After a few minutes, the DNS cache was invalidated and Cert-Manager finally logged something different:
I1014 06:25:45.974354 1 wait.go:298] Searching fqdn "_acme-challenge.gmem.cc." using seed nameservers [10.231.18.121:53]
I1014 06:25:46.453434 1 wait.go:383] Returning discovered zone record "gmem.cc." for fqdn "_acme-challenge.gmem.cc."
I1014 06:25:46.454660 1 wait.go:316] Returning authoritative nameservers [c.dnspod.com., a.dnspod.com., b.dnspod.com.]
I1014 06:25:46.462705 1 wait.go:141] Looking up TXT records for "_acme-challenge.gmem.cc."
indicating that the correct TXT record was found. And from the subsequent logs some working detailed of Cert-Manager was revealed:
I1014 06:25:47.050500 1 dns.go:128] "cert-manager/challenges/Check: waiting DNS record TTL to allow the DNS01 record to propagate for domain" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" dnsName="gmem.cc" type="DNS-01" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" domain="gmem.cc" ttl=60 fqdn="_acme-challenge.gmem.cc."
This line indicated that after Cert-Manager validated the TXT record locally, it would wait for TTL ( 60 here ) seconds, just in case that Zero SSL server hadn't been able to see te record.
014 06:26:47.051650 1 sync.go:359] "cert-manager/challenges/acceptChallenge: accepting challenge with ACME server" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" dnsName="gmem.cc" type="DNS-01"
I1014 06:26:47.051665 1 logger.go:81] "cert-manager/acme-middleware: Calling Accept"
I1014 06:26:49.708221 1 sync.go:376] "cert-manager/challenges/acceptChallenge: waiting for authorization for domain" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" dnsName="gmem.cc" type="DNS-01"
I1014 06:26:49.708250 1 logger.go:99] "cert-manager/acme-middleware: Calling WaitAuthorization"
I1014 06:26:50.194610 1 logs.go:199] "cert-manager/controller: Event(v1.ObjectReference{Kind:\"Challenge\", Namespace:\"istio-system\", Name:\"wildcard-ssl-5pgsr-3353861729-3026093647\", UID:\"a4b2f2d7-6185-4072-92f0-b7a89418cdb1\", APIVersion:\"acme.cert-manager.io/v1\", ResourceVersion:\"463570645\", FieldPath:\"\"}): type: 'Normal' reason: 'DomainVerified' Domain \"gmem.cc\" verified with \"DNS-01\" validation"
After the wait, Cert-Manager called Zero SSL server for Accept and WaitAuthorization operation and the server verified that we were the owner of the domain name.
In this case, the root cause of the certificate issuance failure was a CNAME record interfering with Cert-Manager's DNS-01 challenge. It had nothing to do with the ACME server but was linked to Cert-Manager's internal implementation.
To aviod similar issues in the future, we need to stop using wildcard domain records.
Leave a Reply