diff --git a/content/deployments/overview.md b/content/deployments/overview.md index d4db4a2..129e20f 100644 --- a/content/deployments/overview.md +++ b/content/deployments/overview.md @@ -68,6 +68,33 @@ configured replica range. Metrics flow into the dashboard and usage to billing. If readiness never passes, the rollout fails and the previous revision keeps serving — you don't get a broken deployment because of a bad image. +## Automatic error cleanup + +A deployment that *should* keep pods running but has **no ready pod for 15 +minutes** is automatically marked **error** and its workload is torn down. This +catches a deployment that applied cleanly but then can't stay up — a +crash-looping image, an image that never pulls, or a readiness probe that never +passes — so a dead deployment doesn't sit consuming a slot indefinitely. + +It only ever acts on a deployment that is *supposed* to have a running pod, so it +leaves these alone: + +- **Scheduled jobs (CronJob)** — they have no standing pods between runs. +- **Paused** deployments, in-flight rollouts, and freshly-deployed revisions + (which get a grace period to pull the image and start up). + +To recover, fix the image or configuration and deploy again — the deployment is +recreated from its spec. While it's torn down its URL stops serving, so a +redeploy is what brings it back. + +{{< callout type="note" >}} +A failed *rollout* is different: if a new revision can't become ready, the +**previous revision keeps serving** and nothing is torn down. Cleanup only fires +when there is no ready pod at all for the full grace window — and it backs off +during a cluster-wide incident, so a bad node pool or a registry outage doesn't +mass-error your deployments. +{{< /callout >}} + ## How to drive it Anything you can do from this page, you can do from the [CLI](/automation/cli/) diff --git a/content/networking/domains.md b/content/networking/domains.md index 8865a9d..6814a97 100644 --- a/content/networking/domains.md +++ b/content/networking/domains.md @@ -64,6 +64,29 @@ Wildcard domains require **DNS-01** verification, so you'll also need to add a verify with HTTP-01 and don't need that record. {{< /callout >}} +## When a certificate can't be issued + +A verified domain normally gets its TLS certificate within a minute or two. Once +in a while issuance keeps failing — Let's Encrypt is rate-limiting the account, a +`CAA` record blocks Let's Encrypt, or (for wildcards) the `_acme-challenge` CNAME +isn't in place — and the certificate stays **issuing** without ever completing. + +If a certificate stays unissued for **more than 24 hours**, the platform reclaims +it: the stale request is removed and the domain flips to **error**. This stops a +permanently-failing request from burning Let's Encrypt quota and surfaces the +problem instead of leaving the domain silently without HTTPS. + +To recover, fix the underlying cause — clear the `CAA` restriction, add the +`_acme-challenge` CNAME the console shows, or wait out a Let's Encrypt +rate-limit — and the platform re-requests the certificate automatically. The +domain returns to **active** once the certificate issues. + +{{< callout type="note" >}} +After a reclaim the platform keeps retrying about once a day, so a domain that +becomes issuable later (a rate-limit clears, a missing record is added) recovers +on its own — you don't need to re-create it. +{{< /callout >}} + ## Routing traffic Creating a domain alone doesn't send any traffic to a deployment — you still