From ece9de83cbee03a45ea933d4712037d75b532d1d Mon Sep 17 00:00:00 2001 From: Thanatat Tamtan Date: Thu, 18 Jun 2026 12:17:32 +0700 Subject: [PATCH 1/2] Document deployment + cert cleanup behavior MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add two sections describing the automatic cleanup behaviors: - deployments/overview.md "Automatic error cleanup" — a deployment with no ready pod for 15 minutes is marked error and torn down; covers what's excluded (cronjob / scale-to-zero / paused / rollouts) and how a failed rollout differs. - networking/domains.md "When a certificate can't be issued" — a cert that never issues within 24 hours is reclaimed and the domain flips to error; covers causes and recovery. Co-Authored-By: Claude Opus 4.8 (1M context) --- content/deployments/overview.md | 28 ++++++++++++++++++++++++++++ content/networking/domains.md | 23 +++++++++++++++++++++++ 2 files changed, 51 insertions(+) diff --git a/content/deployments/overview.md b/content/deployments/overview.md index d4db4a2..5fb1dd0 100644 --- a/content/deployments/overview.md +++ b/content/deployments/overview.md @@ -68,6 +68,34 @@ configured replica range. Metrics flow into the dashboard and usage to billing. If readiness never passes, the rollout fails and the previous revision keeps serving — you don't get a broken deployment because of a bad image. +## Automatic error cleanup + +A deployment that *should* keep pods running but has **no ready pod for 15 +minutes** is automatically marked **error** and its workload is torn down. This +catches a deployment that applied cleanly but then can't stay up — a +crash-looping image, an image that never pulls, or a readiness probe that never +passes — so a dead deployment doesn't sit consuming a slot indefinitely. + +It only ever acts on a deployment that is *supposed* to have a running pod, so it +leaves these alone: + +- **Scheduled jobs (CronJob)** — they have no standing pods between runs. +- **Scale-to-zero** deployments — zero pods is the configured intent. +- **Paused** deployments, in-flight rollouts, and freshly-deployed revisions + (which get a grace period to pull the image and start up). + +To recover, fix the image or configuration and deploy again — the deployment is +recreated from its spec. While it's torn down its URL stops serving, so a +redeploy is what brings it back. + +{{< callout type="note" >}} +A failed *rollout* is different: if a new revision can't become ready, the +**previous revision keeps serving** and nothing is torn down. Cleanup only fires +when there is no ready pod at all for the full grace window — and it backs off +during a cluster-wide incident, so a bad node pool or a registry outage doesn't +mass-error your deployments. +{{< /callout >}} + ## How to drive it Anything you can do from this page, you can do from the [CLI](/automation/cli/) diff --git a/content/networking/domains.md b/content/networking/domains.md index 8865a9d..6814a97 100644 --- a/content/networking/domains.md +++ b/content/networking/domains.md @@ -64,6 +64,29 @@ Wildcard domains require **DNS-01** verification, so you'll also need to add a verify with HTTP-01 and don't need that record. {{< /callout >}} +## When a certificate can't be issued + +A verified domain normally gets its TLS certificate within a minute or two. Once +in a while issuance keeps failing — Let's Encrypt is rate-limiting the account, a +`CAA` record blocks Let's Encrypt, or (for wildcards) the `_acme-challenge` CNAME +isn't in place — and the certificate stays **issuing** without ever completing. + +If a certificate stays unissued for **more than 24 hours**, the platform reclaims +it: the stale request is removed and the domain flips to **error**. This stops a +permanently-failing request from burning Let's Encrypt quota and surfaces the +problem instead of leaving the domain silently without HTTPS. + +To recover, fix the underlying cause — clear the `CAA` restriction, add the +`_acme-challenge` CNAME the console shows, or wait out a Let's Encrypt +rate-limit — and the platform re-requests the certificate automatically. The +domain returns to **active** once the certificate issues. + +{{< callout type="note" >}} +After a reclaim the platform keeps retrying about once a day, so a domain that +becomes issuable later (a rate-limit clears, a missing record is added) recovers +on its own — you don't need to re-create it. +{{< /callout >}} + ## Routing traffic Creating a domain alone doesn't send any traffic to a deployment — you still From a4992791e276d6ede939ec41c87c9fe5954a926e Mon Sep 17 00:00:00 2001 From: Thanatat Tamtan Date: Thu, 18 Jun 2026 12:33:17 +0700 Subject: [PATCH 2/2] docs: drop scale-to-zero from auto-error exclusions deploys.app doesn't offer scale-to-zero, so don't list it as an excluded case. Co-Authored-By: Claude Opus 4.8 (1M context) --- content/deployments/overview.md | 1 - 1 file changed, 1 deletion(-) diff --git a/content/deployments/overview.md b/content/deployments/overview.md index 5fb1dd0..129e20f 100644 --- a/content/deployments/overview.md +++ b/content/deployments/overview.md @@ -80,7 +80,6 @@ It only ever acts on a deployment that is *supposed* to have a running pod, so i leaves these alone: - **Scheduled jobs (CronJob)** — they have no standing pods between runs. -- **Scale-to-zero** deployments — zero pods is the configured intent. - **Paused** deployments, in-flight rollouts, and freshly-deployed revisions (which get a grace period to pull the image and start up).