Add config option for REPLICA IDENTITY FULL on PostgreSQL tables

## Context

Databricks Lakehouse Sync and other logical-replication-based CDC tools require PostgreSQL tables to have \`REPLICA IDENTITY FULL\` set. Without it, Lakehouse Sync silently skips the table — no error, just missing data downstream:

> \"Tables without REPLICA IDENTITY FULL will be skipped. Run ALTER TABLE ... REPLICA IDENTITY FULL to include them.\"

This is mandatory for Lakehouse Sync — there is no alternative. This issue requests a generic, opt-in mechanism in \`datajoint-python\` so Postgres-on-Lakebase deployments and other CDC-driven setups can configure it without per-table manual ALTER statements.

## What REPLICA IDENTITY FULL does

PostgreSQL's WAL records every change. \`REPLICA IDENTITY\` controls how much of the **old row** is logged for UPDATE and DELETE:

| Setting | What's logged (old row) | WAL cost |
|---------|------------------------|----------|
| **DEFAULT** | Primary key columns only | Minimal |
| **FULL** | Entire old row, all columns | Higher for UPDATE/DELETE on TOASTed columns |

Key facts:
- INSERTs are unaffected — full row always logged regardless of setting
- UPDATEs log the full old row including TOAST (bytea) columns even if unchanged
- The ALTER is **metadata-only, instant, idempotent**, can be applied at any time
- This is **PostgreSQL-only** — MySQL has no equivalent (MySQL uses binlog-based CDC)

For DataJoint's overwhelmingly insert-append workload, the real-world cost of FULL is negligible. The one cost scenario is bulk \`delete()\` on tables with \`<blob>\` columns (e.g., 100 rows × 10 MB blobs ≈ 1 GB WAL burst). Transient and infrequent.

## Proposed design

Add a config option \`database.replica_identity\` with allowed values \`\"default\"\` (default) and \`\"full\"\`. Environment variable: \`DJ_REPLICA_IDENTITY\`.

Two mechanisms:

### 1. New tables — applied automatically via \`post_ddl\` during \`declare()\`

```sql
ALTER TABLE \"schema_name\".\"table_name\" REPLICA IDENTITY FULL;
```

### 2. Existing tables — utility method

```python
schema.set_replica_identity(\"full\")  # iterates schema's tables, runs ALTER for each
```

Reason a utility is needed: when \`@schema\` decorates a class for an already-existing table, \`declare()\` is **not** called. \`post_ddl\` only fires on table creation. Users who add the config after their pipeline exists need a separate mechanism.

### MySQL behavior

If the config is set on a MySQL connection, emit a \`UserWarning\` consistent with the existing \`database.name\` warning pattern (PR #1426). REPLICA IDENTITY is a PostgreSQL concept; setting it on MySQL is a no-op.

## Compliance and security considerations

Under DEFAULT, only PK values appear in WAL. Under FULL, entire rows — including any PHI, PII, or sensitive data — appear in:
- WAL segments on disk (transient, recycled after checkpoint)
- Logical replication streams
- Anything subscribed to those streams

For self-hosted PostgreSQL, this is a real consideration. For managed PostgreSQL with controlled WAL access (Lakebase, RDS with logical replication only to a specific subscriber), the WAL stays inside the managed environment's security boundary.

This is why the option must be **opt-in**, not on-by-default.

## Why a generic upstream feature, not Lakebase-specific

CDC consumers besides Lakehouse Sync also benefit (Debezium and ClickHouse ClickPipes can use FULL when PKs aren't sufficient; Azure CDC; logical replication to other Postgres). Keeping the config generic lets any Postgres-based deployment configure it without backend-specific code paths.

The Lakebase-specific decision (\"Lakebase deployments default to FULL\") lives in the deployment configuration of the consumer, not in this upstream code. Reference: [DataJoint × Databricks DECISIONS.md](https://github.com/datajoint-company/datajoint-databricks/blob/main/DECISIONS.md) — internal repo, summary available on request.

## Industry context

| CDC Tool | FULL Required? | Recommendation |
|----------|---------------|----------------|
| Debezium | No (for tables with PKs) | DEFAULT preferred |
| Azure CDC | No | Recommends against FULL |
| ClickHouse ClickPipes | No (if PK exists) | DEFAULT preferred |
| **Databricks Lakehouse Sync** | **Yes, mandatory** | No alternative |

Most CDC tools prefer DEFAULT for performance. Databricks mandates FULL for SCD Type 2 history. The opt-in design serves all of them.

## References

- [PostgreSQL: Logical Replication — Replica Identity](https://www.postgresql.org/docs/current/logical-replication-publication.html)
- [Databricks: Lakehouse Sync](https://docs.databricks.com/aws/en/oltp/projects/lakehouse-sync)
- [Xata: Performance impact of REPLICA IDENTITY FULL](https://xata.io/blog/replica-identity-full-performance)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add config option for REPLICA IDENTITY FULL on PostgreSQL tables #1447

Context

What REPLICA IDENTITY FULL does

Proposed design

1. New tables — applied automatically via `post_ddl` during `declare()`

2. Existing tables — utility method

MySQL behavior

Compliance and security considerations

Why a generic upstream feature, not Lakebase-specific

Industry context

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Setting	What's logged (old row)	WAL cost
DEFAULT	Primary key columns only	Minimal
FULL	Entire old row, all columns	Higher for UPDATE/DELETE on TOASTed columns

CDC Tool	FULL Required?	Recommendation
Debezium	No (for tables with PKs)	DEFAULT preferred
Azure CDC	No	Recommends against FULL
ClickHouse ClickPipes	No (if PK exists)	DEFAULT preferred
Databricks Lakehouse Sync	Yes, mandatory	No alternative

Add config option for REPLICA IDENTITY FULL on PostgreSQL tables #1447

Description

Context

What REPLICA IDENTITY FULL does

Proposed design

1. New tables — applied automatically via `post_ddl` during `declare()`

2. Existing tables — utility method

MySQL behavior

Compliance and security considerations

Why a generic upstream feature, not Lakebase-specific

Industry context

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions