Investigating and Solving the Issue of Failed Certificate Request with ZeroSSL and Cert-Manager
In this blog post, I will walk through my journey investigating and resolving an issue where my certificate request from ZeroSSL, using Cert-Manager, remained in a "not ready" state for over two days. I'll cover the tools involved, provide background information, and show how I eventually identified and fixed the problem.
ACME (Automatic Certificate Management Environment) is a protocol developed by the Internet Security Research Group (ISRG) to automate the process of obtaining and managing SSL/TLS certificates from Certificate Authorities (CAs). It simplifies the traditionally manual steps involved in certificate issuance by using an automated process. ACME is widely used by services like Let’s Encrypt and ZeroSSL to secure websites with HTTPS.
The ACME protocol automates interactions between a client (such as Cert-Manager or Certbot) and a CA, allowing the client to request, renew, and manage certificates without human intervention. ACME operates through a series of challenges that prove ownership of the domain for which a certificate is requested. Once the ownership is verified, the CA can issue the certificate.
In this challenge, the client proves ownership of a domain by hosting a specific file at a designated path (e.g., http://gmem.cc/.well-known/acme-challenge/). The CA attempts to retrieve this file, and if successful, the challenge is validated. This challenge is commonly used for publicly accessible web servers.
In this challenge, the client proves domain ownership by creating a special DNS TXT record for the domain. The CA checks the DNS record to confirm ownership. DNS-01 is typically used for wildcard certificates or when the server is not publicly accessible because it doesn’t rely on serving files over HTTP.
This challenge requires the client to prove control of a domain by configuring a TLS server with a special certificate during the ACME validation process. The CA then connects to the server via TLS and checks the certificate. This challenge is less common and usually used in specialized environments.
Cert-Manager is an open-source Kubernetes add-on that automates the management, issuance, and renewal of certificates within Kubernetes clusters. It integrates with various Certificate Authorities (CAs) and protocols, including ACME (used by providers like Let’s Encrypt and ZeroSSL). Cert-Manager is widely used to ensure that certificates remain valid and secure without manual intervention.
When deploying Cert-Manager in Kubernetes, several key components work together to handle certificate management.
The core component of the Cert-Manager system, responsible for managing the lifecycle of certificates and interacting with Issuers (such as ACME servers like Let’s Encrypt or ZeroSSL). It runs as a Kubernetes controller and is responsible for:
- Watching Certificate, CertificateRequest, Issuer, ClusterIssuer, Order, and Challenge resources.
- Requesting certificates from CAs.
- Automatically renewing certificates before expiration.
- Handling the interactions with external CAs (via ACME, Vault, Venafi, etc.).
This component performs the actual management of certificates, from creation to renewal, ensuring that the requested certificates are stored securely as Kubernetes Secrets.
The CA Injector is an additional component that works alongside Cert-Manager to inject CA data into other Kubernetes resources. It primarily operates on Kubernetes ValidatingWebhookConfiguration and MutatingWebhookConfiguration resources, injecting certificates into them automatically. This is necessary for:
- Mutating admission controllers that require TLS certificates for secure communication.
- Ensuring that Kubernetes components relying on CA certificates have up-to-date CA data.
The cainjector is critical in environments where certain Kubernetes components (e.g., webhooks) require their certificates to be signed by a trusted CA.
The Webhook component provides an admission controller that validates Cert-Manager resources like Certificate, Issuer, ClusterIssuer, and CertificateRequest upon creation or update. It ensures that the Cert-Manager resources are correctly configured by:
- Validating resources before they are accepted into the Kubernetes API (syntax and structure).
- Mutating resources to provide defaults (for example, setting default values in a Certificate resource).
- Providing a layer of security and correctness by ensuring invalid configurations are caught early.
The webhook helps catch configuration issues early, improving the reliability of certificate management workflows.
This webhook, which is maintained by the community, specifically handles DNS-01 challenges for domains hosted in Tencent Cloud’s DNSPod. When Cert-Manager requests a certificate using the DNS-01 challenge, it needs to create a DNS TXT record in the domain's DNS zone. cert-manager-webhook-dnspod facilitates this by interacting with the DNSPod API to manage DNS records.
Cert-Manager invokes this webhook when it needs to solve a DNS-01 challenge using DNSPod as the DNS provider. The webhook receives instructions from Cert-Manager, communicates with the DNSPod API to create or delete DNS TXT records, and reports back to Cert-Manager when the challenge is complete.
In this post the webhook was created with the following manifest:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: cert-manager-webhook-dnspod namespace: argocd spec: destination: namespace: cert-manager server: https://kubernetes.default.svc project: default source: repoURL: https://github.com/imroc/cert-manager-webhook-dnspod targetRevision: master path: charts/cert-manager-webhook-dnspod helm: releaseName: cert-manager-webhook-dnspod values: | groupName: acme.dnspod.tencent.com clusterIssuer: name: letsencrypt-issuer secretId: DNSPOD_SECRETID secretKey: DNSPOD_SECRETKEY email: gmem@me.com syncPolicy: syncOptions: - CreateNamespace=true automated: prune: true selfHeal: true |
Cert-Manager relies on several types of custom resources that work together to manage the certificate lifecycle:
- Issuer/ClusterIssuer:
- An Issuer or ClusterIssuer is a Cert-Manager resource that defines how certificates should be requested from a Certificate Authority (CA). The difference between them is that an Issuer is scoped to a single namespace, whereas a ClusterIssuer can be used cluster-wide.
- Issuers can be configured to use different CA backends, such as ACME, Vault, or self-signed certificates.
- Certificate: The Certificate resource defines which certificates should be issued and managed by Cert-Manager. It specifies details such as the domain names, issuer to use, renewal period, and where to store the certificate (usually in a Kubernetes secret).
- CertificateRequest: When a Certificate resource is created, Cert-Manager generates a CertificateRequest. This resource represents the actual request to the Issuer for a certificate. Cert-Manager manages these requests and handles the approval and signing process.
- Order: When Cert-Manager requests a certificate from an ACME-based Issuer, it creates an Order resource to track the status of the certificate issuance. The Order keeps track of challenges and interactions with the ACME server.
- Challenge: A Challenge resource represents the ACME challenge issued by the CA (such as DNS-01 or HTTP-01). The challenge proves domain ownership by requiring the client (Cert-Manager) to respond to a domain validation request from the CA.
The HTTP-01 challenge is used when the domain is publicly accessible over HTTP. The CA validates ownership of the domain by requesting a specific file over HTTP. Cert-Manager sets up the challenge response using a Kubernetes Ingress resource.
- Create an Issuer or ClusterIssuer: The Issuer defines the connection to the CA (such as Let’s Encrypt or ZeroSSL) and specifies that ACME should be used with the HTTP-01 challenge.
- Create a Certificate Resource: A Certificate resource is created that specifies the domain names for which a certificate is needed and the Issuer to use.
- Cert-Manager Creates a CertificateRequest: Cert-Manager generates a CertificateRequest based on the Certificate resource.
- Cert-Manager Creates an Order: Cert-Manager creates an Order resource to track the status of the certificate request with the ACME server.
- Cert-Manager Creates an HTTP-01 Challenge: An HTTP-01 challenge is created, and Cert-Manager configures an Ingress resource to serve the challenge response at the path /.well-known/acme-challenge/.
- ACME Server Attempts to Validate: The CA attempts to access the challenge file via the HTTP URL (e.g., http://example.com/.well-known/acme-challenge/<token>). If successful, the challenge is validated.
- Certificate Issued: Once the challenge is validated, the CA issues the certificate, and Cert-Manager stores it in the specified Kubernetes Secret.
The DNS-01 challenge is used when domain ownership must be validated via DNS records. This method is often preferred for wildcard certificates or domains that are not publicly accessible over HTTP.
- Create an Issuer or ClusterIssuer: The Issuer specifies that ACME should be used with the DNS-01 challenge and includes configuration for interacting with the DNS provider (e.g., AWS Route53, Cloudflare, or a custom webhook like dnspod).
- Create a Certificate Resource: A Certificate resource is created that defines the domains for which the certificate is needed and the Issuer to use.
- Cert-Manager Creates a CertificateRequest: Cert-Manager generates a CertificateRequest based on the Certificate resource.
- Cert-Manager Creates an Order: Cert-Manager creates an Order resource to track the status of the certificate request with the ACME server.
- Cert-Manager Creates a DNS-01 Challenge: A DNS-01 challenge is created, and Cert-Manager interacts with the configured DNS provider to automatically create a special TXT record for the domain (e.g., _acme-challenge.example.com).
- ACME Server Attempts to Validate: The CA checks for the presence of the _acme-challenge.example.com TXT record in the domain’s DNS records. If the correct value is found, the challenge is validated.
- Certificate Issued: Once the DNS challenge is validated, the CA issues the certificate, and Cert-Manager stores it in the specified Kubernetes Secret.
I created a ClusterIssuer and a Certificate for the wildcard domain *.gmem.cc. However, after waiting for more than two days, the certificate status still showed as not ready, with the following condition in the certificate's status:
1 2 3 4 5 6 7 8 |
status: conditions: - lastTransitionTime: "2024-10-14T06:02:10Z" message: Issuing certificate as Secret does not exist observedGeneration: 1 reason: DoesNotExist status: "False" type: Ready |
I checked on DNSPod and found a TXT record _acme-challenge.gmem.cc with the TTL set to 600 seconds.
To diagnose the issue, I checked the status of the related Cert-Manager resources: CertificateRequest, Order, and Challenge. These are the key resources that Cert-Manager uses to interact with ACME and handle certificate issuance. Here’s an overview of my findings:
CertificateRequest:
1 2 3 4 5 6 7 8 9 10 11 12 |
status: conditions: - lastTransitionTime: "2024-10-14T06:02:14Z" message: Certificate request has been approved by cert-manager.io reason: cert-manager.io status: "True" type: Approved - lastTransitionTime: "2024-10-14T06:02:14Z" message: 'Waiting on certificate issuance from order istio-system/wildcard-ssl-5pgsr-3353861729: "pending"' reason: Pending status: "False" type: Ready |
The certificate request was approved, but the system was waiting for the certificate issuance to complete.
Order:
1 2 3 4 5 6 7 8 |
status: state: pending authorizations: - challenges: - token: iQMwrfsFRmJ_MytUY3N4NW6QehtTn0-IEvJWAmYEw_k type: dns-01 url: https://acme.zerossl.com/v2/DV90/chall/DDRBMBd9jnJo_W4EcQfSWQ wildcard: true |
The order was still in a "pending" state, and the DNS-01 challenge had not been completed yet.
Challenge:
1 2 3 4 5 |
status: presented: true processing: true reason: 'Waiting for DNS-01 challenge propagation: DNS record for "gmem.cc" not yet propagated' state: pending |
The challenge was waiting for DNS propagation, but apparently the TXT record I mentioned above had been created two days ago.
Mutifarious reasons can cause delay on DNS propagation, we can check whether TXT record _acme-challenge.gmem.cc is synchronized all over the world using whatsmydns.net:
https://www.whatsmydns.net/#TXT/_acme-challenge.gmem.cc
In this case, the check result was that the record had been fully propagated.
I then checked the logs for the Cert-Manager pod to see if any errors were being reported. The relevant and repeating error message from the logs was:
E1014 05:52:58.647839 1 sync.go:190] "cert-manager/challenges: propagation check failed" err="DNS record for \"gmem.cc\" not yet propagated"
At this point, it seemed that Cert-Manager was unable to see the propagated DNS records, even though I had confirmed their existence.
After enabling verbose logging (--v5 log level) in Cert-Manager, I finally discovered the root cause. The detailed logs revealed that Cert-Manager was checking the DNS record at an intermediate CNAME:
I1014 06:20:33.919849 1 dns.go:116] "cert-manager/challenges/Check: checking DNS propagation"
I1014 06:20:33.921190 1 wait.go:90] Updating FQDN: _acme-challenge.gmem.cc. with its CNAME: lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com.
I1014 06:25:35.886683 1 wait.go:298] Searching fqdn "lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com." using seed nameservers [10.231.18.121:53]
I1014 06:25:35.886696 1 wait.go:329] Returning cached zone record "sg-tencentclb.com." for fqdn "lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com."
I1014 06:20:33.987786 1 wait.go:141] Looking up TXT records for "lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com."
The challenge was failing because the DNS record for *.gmem.cc was a CNAME pointing to another domain, lb-db1nok14-foh4te5vrj0dya3c.clb.sg-tencentclb.com, which caused Cert-Manager to search for a wrong TXT record.
The solution was to remove the wildcard record *.gmem.cc which was CNAMEed to Tencent Cloud Loadbalancer address. After a few minutes, the DNS cache was invalidated and Cert-Manager finally logged something different:
I1014 06:25:45.974354 1 wait.go:298] Searching fqdn "_acme-challenge.gmem.cc." using seed nameservers [10.231.18.121:53]
I1014 06:25:46.453434 1 wait.go:383] Returning discovered zone record "gmem.cc." for fqdn "_acme-challenge.gmem.cc."
I1014 06:25:46.454660 1 wait.go:316] Returning authoritative nameservers [c.dnspod.com., a.dnspod.com., b.dnspod.com.]
I1014 06:25:46.462705 1 wait.go:141] Looking up TXT records for "_acme-challenge.gmem.cc."
indicating that the correct TXT record was found. And from the subsequent logs some working detailed of Cert-Manager was revealed:
I1014 06:25:47.050500 1 dns.go:128] "cert-manager/challenges/Check: waiting DNS record TTL to allow the DNS01 record to propagate for domain" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" dnsName="gmem.cc" type="DNS-01" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" domain="gmem.cc" ttl=60 fqdn="_acme-challenge.gmem.cc."
This line indicated that after Cert-Manager validated the TXT record locally, it would wait for TTL ( 60 here ) seconds, just in case that Zero SSL server hadn't been able to see te record.
014 06:26:47.051650 1 sync.go:359] "cert-manager/challenges/acceptChallenge: accepting challenge with ACME server" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" dnsName="gmem.cc" type="DNS-01"
I1014 06:26:47.051665 1 logger.go:81] "cert-manager/acme-middleware: Calling Accept"
I1014 06:26:49.708221 1 sync.go:376] "cert-manager/challenges/acceptChallenge: waiting for authorization for domain" resource_name="wildcard-ssl-5pgsr-3353861729-3026093647" resource_namespace="istio-system" resource_kind="Challenge" resource_version="v1" dnsName="gmem.cc" type="DNS-01"
I1014 06:26:49.708250 1 logger.go:99] "cert-manager/acme-middleware: Calling WaitAuthorization"
I1014 06:26:50.194610 1 logs.go:199] "cert-manager/controller: Event(v1.ObjectReference{Kind:\"Challenge\", Namespace:\"istio-system\", Name:\"wildcard-ssl-5pgsr-3353861729-3026093647\", UID:\"a4b2f2d7-6185-4072-92f0-b7a89418cdb1\", APIVersion:\"acme.cert-manager.io/v1\", ResourceVersion:\"463570645\", FieldPath:\"\"}): type: 'Normal' reason: 'DomainVerified' Domain \"gmem.cc\" verified with \"DNS-01\" validation"
After the wait, Cert-Manager called Zero SSL server for Accept and WaitAuthorization operation and the server verified that we were the owner of the domain name.
In this case, the root cause of the certificate issuance failure was a CNAME record interfering with Cert-Manager's DNS-01 challenge. It had nothing to do with the ACME server but was linked to Cert-Manager's internal implementation.
To aviod similar issues in the future, we need to stop using wildcard domain records.
Leave a Reply