Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions content/deployments/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,33 @@ configured replica range. Metrics flow into the dashboard and usage to billing.
If readiness never passes, the rollout fails and the previous revision keeps
serving — you don't get a broken deployment because of a bad image.

## Automatic error cleanup

A deployment that *should* keep pods running but has **no ready pod for 15
minutes** is automatically marked **error** and its workload is torn down. This
catches a deployment that applied cleanly but then can't stay up — a
crash-looping image, an image that never pulls, or a readiness probe that never
passes — so a dead deployment doesn't sit consuming a slot indefinitely.

It only ever acts on a deployment that is *supposed* to have a running pod, so it
leaves these alone:

- **Scheduled jobs (CronJob)** — they have no standing pods between runs.
- **Paused** deployments, in-flight rollouts, and freshly-deployed revisions
(which get a grace period to pull the image and start up).

To recover, fix the image or configuration and deploy again — the deployment is
recreated from its spec. While it's torn down its URL stops serving, so a
redeploy is what brings it back.

{{< callout type="note" >}}
A failed *rollout* is different: if a new revision can't become ready, the
**previous revision keeps serving** and nothing is torn down. Cleanup only fires
when there is no ready pod at all for the full grace window — and it backs off
during a cluster-wide incident, so a bad node pool or a registry outage doesn't
mass-error your deployments.
{{< /callout >}}

## How to drive it

Anything you can do from this page, you can do from the [CLI](/automation/cli/)
Expand Down
23 changes: 23 additions & 0 deletions content/networking/domains.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,29 @@ Wildcard domains require **DNS-01** verification, so you'll also need to add a
verify with HTTP-01 and don't need that record.
{{< /callout >}}

## When a certificate can't be issued

A verified domain normally gets its TLS certificate within a minute or two. Once
in a while issuance keeps failing — Let's Encrypt is rate-limiting the account, a
`CAA` record blocks Let's Encrypt, or (for wildcards) the `_acme-challenge` CNAME
isn't in place — and the certificate stays **issuing** without ever completing.

If a certificate stays unissued for **more than 24 hours**, the platform reclaims
it: the stale request is removed and the domain flips to **error**. This stops a
permanently-failing request from burning Let's Encrypt quota and surfaces the
problem instead of leaving the domain silently without HTTPS.

To recover, fix the underlying cause — clear the `CAA` restriction, add the
`_acme-challenge` CNAME the console shows, or wait out a Let's Encrypt
rate-limit — and the platform re-requests the certificate automatically. The
domain returns to **active** once the certificate issues.

{{< callout type="note" >}}
After a reclaim the platform keeps retrying about once a day, so a domain that
becomes issuable later (a rate-limit clears, a missing record is added) recovers
on its own — you don't need to re-create it.
{{< /callout >}}

## Routing traffic

Creating a domain alone doesn't send any traffic to a deployment — you still
Expand Down