From ece9de83cbee03a45ea933d4712037d75b532d1d Mon Sep 17 00:00:00 2001
From: Thanatat Tamtan <acoshift@gmail.com>
Date: Thu, 18 Jun 2026 12:17:32 +0700
Subject: [PATCH 1/2] Document deployment + cert cleanup behavior
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add two sections describing the automatic cleanup behaviors:
- deployments/overview.md "Automatic error cleanup" — a deployment with no
  ready pod for 15 minutes is marked error and torn down; covers what's
  excluded (cronjob / scale-to-zero / paused / rollouts) and how a failed
  rollout differs.
- networking/domains.md "When a certificate can't be issued" — a cert that
  never issues within 24 hours is reclaimed and the domain flips to error;
  covers causes and recovery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 content/deployments/overview.md | 28 ++++++++++++++++++++++++++++
 content/networking/domains.md   | 23 +++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/content/deployments/overview.md b/content/deployments/overview.md
index d4db4a2..5fb1dd0 100644
--- a/content/deployments/overview.md
+++ b/content/deployments/overview.md
@@ -68,6 +68,34 @@ configured replica range. Metrics flow into the dashboard and usage to billing.
 If readiness never passes, the rollout fails and the previous revision keeps
 serving — you don't get a broken deployment because of a bad image.
 
+## Automatic error cleanup
+
+A deployment that *should* keep pods running but has **no ready pod for 15
+minutes** is automatically marked **error** and its workload is torn down. This
+catches a deployment that applied cleanly but then can't stay up — a
+crash-looping image, an image that never pulls, or a readiness probe that never
+passes — so a dead deployment doesn't sit consuming a slot indefinitely.
+
+It only ever acts on a deployment that is *supposed* to have a running pod, so it
+leaves these alone:
+
+- **Scheduled jobs (CronJob)** — they have no standing pods between runs.
+- **Scale-to-zero** deployments — zero pods is the configured intent.
+- **Paused** deployments, in-flight rollouts, and freshly-deployed revisions
+  (which get a grace period to pull the image and start up).
+
+To recover, fix the image or configuration and deploy again — the deployment is
+recreated from its spec. While it's torn down its URL stops serving, so a
+redeploy is what brings it back.
+
+{{< callout type="note" >}}
+A failed *rollout* is different: if a new revision can't become ready, the
+**previous revision keeps serving** and nothing is torn down. Cleanup only fires
+when there is no ready pod at all for the full grace window — and it backs off
+during a cluster-wide incident, so a bad node pool or a registry outage doesn't
+mass-error your deployments.
+{{< /callout >}}
+
 ## How to drive it
 
 Anything you can do from this page, you can do from the [CLI](/automation/cli/)
diff --git a/content/networking/domains.md b/content/networking/domains.md
index 8865a9d..6814a97 100644
--- a/content/networking/domains.md
+++ b/content/networking/domains.md
@@ -64,6 +64,29 @@ Wildcard domains require **DNS-01** verification, so you'll also need to add a
 verify with HTTP-01 and don't need that record.
 {{< /callout >}}
 
+## When a certificate can't be issued
+
+A verified domain normally gets its TLS certificate within a minute or two. Once
+in a while issuance keeps failing — Let's Encrypt is rate-limiting the account, a
+`CAA` record blocks Let's Encrypt, or (for wildcards) the `_acme-challenge` CNAME
+isn't in place — and the certificate stays **issuing** without ever completing.
+
+If a certificate stays unissued for **more than 24 hours**, the platform reclaims
+it: the stale request is removed and the domain flips to **error**. This stops a
+permanently-failing request from burning Let's Encrypt quota and surfaces the
+problem instead of leaving the domain silently without HTTPS.
+
+To recover, fix the underlying cause — clear the `CAA` restriction, add the
+`_acme-challenge` CNAME the console shows, or wait out a Let's Encrypt
+rate-limit — and the platform re-requests the certificate automatically. The
+domain returns to **active** once the certificate issues.
+
+{{< callout type="note" >}}
+After a reclaim the platform keeps retrying about once a day, so a domain that
+becomes issuable later (a rate-limit clears, a missing record is added) recovers
+on its own — you don't need to re-create it.
+{{< /callout >}}
+
 ## Routing traffic
 
 Creating a domain alone doesn't send any traffic to a deployment — you still

From a4992791e276d6ede939ec41c87c9fe5954a926e Mon Sep 17 00:00:00 2001
From: Thanatat Tamtan <acoshift@gmail.com>
Date: Thu, 18 Jun 2026 12:33:17 +0700
Subject: [PATCH 2/2] docs: drop scale-to-zero from auto-error exclusions

deploys.app doesn't offer scale-to-zero, so don't list it as an excluded case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 content/deployments/overview.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/content/deployments/overview.md b/content/deployments/overview.md
index 5fb1dd0..129e20f 100644
--- a/content/deployments/overview.md
+++ b/content/deployments/overview.md
@@ -80,7 +80,6 @@ It only ever acts on a deployment that is *supposed* to have a running pod, so i
 leaves these alone:
 
 - **Scheduled jobs (CronJob)** — they have no standing pods between runs.
-- **Scale-to-zero** deployments — zero pods is the configured intent.
 - **Paused** deployments, in-flight rollouts, and freshly-deployed revisions
   (which get a grace period to pull the image and start up).