Skip to content

Add config option for REPLICA IDENTITY FULL on PostgreSQL tables #1447

@dimitri-yatsenko

Description

@dimitri-yatsenko

Context

Databricks Lakehouse Sync and other logical-replication-based CDC tools require PostgreSQL tables to have `REPLICA IDENTITY FULL` set. Without it, Lakehouse Sync silently skips the table — no error, just missing data downstream:

"Tables without REPLICA IDENTITY FULL will be skipped. Run ALTER TABLE ... REPLICA IDENTITY FULL to include them."

This is mandatory for Lakehouse Sync — there is no alternative. This issue requests a generic, opt-in mechanism in `datajoint-python` so Postgres-on-Lakebase deployments and other CDC-driven setups can configure it without per-table manual ALTER statements.

What REPLICA IDENTITY FULL does

PostgreSQL's WAL records every change. `REPLICA IDENTITY` controls how much of the old row is logged for UPDATE and DELETE:

Setting What's logged (old row) WAL cost
DEFAULT Primary key columns only Minimal
FULL Entire old row, all columns Higher for UPDATE/DELETE on TOASTed columns

Key facts:

  • INSERTs are unaffected — full row always logged regardless of setting
  • UPDATEs log the full old row including TOAST (bytea) columns even if unchanged
  • The ALTER is metadata-only, instant, idempotent, can be applied at any time
  • This is PostgreSQL-only — MySQL has no equivalent (MySQL uses binlog-based CDC)

For DataJoint's overwhelmingly insert-append workload, the real-world cost of FULL is negligible. The one cost scenario is bulk `delete()` on tables with `` columns (e.g., 100 rows × 10 MB blobs ≈ 1 GB WAL burst). Transient and infrequent.

Proposed design

Add a config option `database.replica_identity` with allowed values `"default"` (default) and `"full"`. Environment variable: `DJ_REPLICA_IDENTITY`.

Two mechanisms:

1. New tables — applied automatically via `post_ddl` during `declare()`

ALTER TABLE \"schema_name\".\"table_name\" REPLICA IDENTITY FULL;

2. Existing tables — utility method

schema.set_replica_identity(\"full\")  # iterates schema's tables, runs ALTER for each

Reason a utility is needed: when `@schema` decorates a class for an already-existing table, `declare()` is not called. `post_ddl` only fires on table creation. Users who add the config after their pipeline exists need a separate mechanism.

MySQL behavior

If the config is set on a MySQL connection, emit a `UserWarning` consistent with the existing `database.name` warning pattern (PR #1426). REPLICA IDENTITY is a PostgreSQL concept; setting it on MySQL is a no-op.

Compliance and security considerations

Under DEFAULT, only PK values appear in WAL. Under FULL, entire rows — including any PHI, PII, or sensitive data — appear in:

  • WAL segments on disk (transient, recycled after checkpoint)
  • Logical replication streams
  • Anything subscribed to those streams

For self-hosted PostgreSQL, this is a real consideration. For managed PostgreSQL with controlled WAL access (Lakebase, RDS with logical replication only to a specific subscriber), the WAL stays inside the managed environment's security boundary.

This is why the option must be opt-in, not on-by-default.

Why a generic upstream feature, not Lakebase-specific

CDC consumers besides Lakehouse Sync also benefit (Debezium and ClickHouse ClickPipes can use FULL when PKs aren't sufficient; Azure CDC; logical replication to other Postgres). Keeping the config generic lets any Postgres-based deployment configure it without backend-specific code paths.

The Lakebase-specific decision ("Lakebase deployments default to FULL") lives in the deployment configuration of the consumer, not in this upstream code. Reference: DataJoint × Databricks DECISIONS.md — internal repo, summary available on request.

Industry context

CDC Tool FULL Required? Recommendation
Debezium No (for tables with PKs) DEFAULT preferred
Azure CDC No Recommends against FULL
ClickHouse ClickPipes No (if PK exists) DEFAULT preferred
Databricks Lakehouse Sync Yes, mandatory No alternative

Most CDC tools prefer DEFAULT for performance. Databricks mandates FULL for SCD Type 2 history. The opt-in design serves all of them.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions