Context
Databricks Lakehouse Sync and other logical-replication-based CDC tools require PostgreSQL tables to have `REPLICA IDENTITY FULL` set. Without it, Lakehouse Sync silently skips the table — no error, just missing data downstream:
"Tables without REPLICA IDENTITY FULL will be skipped. Run ALTER TABLE ... REPLICA IDENTITY FULL to include them."
This is mandatory for Lakehouse Sync — there is no alternative. This issue requests a generic, opt-in mechanism in `datajoint-python` so Postgres-on-Lakebase deployments and other CDC-driven setups can configure it without per-table manual ALTER statements.
What REPLICA IDENTITY FULL does
PostgreSQL's WAL records every change. `REPLICA IDENTITY` controls how much of the old row is logged for UPDATE and DELETE:
| Setting |
What's logged (old row) |
WAL cost |
| DEFAULT |
Primary key columns only |
Minimal |
| FULL |
Entire old row, all columns |
Higher for UPDATE/DELETE on TOASTed columns |
Key facts:
- INSERTs are unaffected — full row always logged regardless of setting
- UPDATEs log the full old row including TOAST (bytea) columns even if unchanged
- The ALTER is metadata-only, instant, idempotent, can be applied at any time
- This is PostgreSQL-only — MySQL has no equivalent (MySQL uses binlog-based CDC)
For DataJoint's overwhelmingly insert-append workload, the real-world cost of FULL is negligible. The one cost scenario is bulk `delete()` on tables with `` columns (e.g., 100 rows × 10 MB blobs ≈ 1 GB WAL burst). Transient and infrequent.
Proposed design
Add a config option `database.replica_identity` with allowed values `"default"` (default) and `"full"`. Environment variable: `DJ_REPLICA_IDENTITY`.
Two mechanisms:
1. New tables — applied automatically via `post_ddl` during `declare()`
ALTER TABLE \"schema_name\".\"table_name\" REPLICA IDENTITY FULL;
2. Existing tables — utility method
schema.set_replica_identity(\"full\") # iterates schema's tables, runs ALTER for each
Reason a utility is needed: when `@schema` decorates a class for an already-existing table, `declare()` is not called. `post_ddl` only fires on table creation. Users who add the config after their pipeline exists need a separate mechanism.
MySQL behavior
If the config is set on a MySQL connection, emit a `UserWarning` consistent with the existing `database.name` warning pattern (PR #1426). REPLICA IDENTITY is a PostgreSQL concept; setting it on MySQL is a no-op.
Compliance and security considerations
Under DEFAULT, only PK values appear in WAL. Under FULL, entire rows — including any PHI, PII, or sensitive data — appear in:
- WAL segments on disk (transient, recycled after checkpoint)
- Logical replication streams
- Anything subscribed to those streams
For self-hosted PostgreSQL, this is a real consideration. For managed PostgreSQL with controlled WAL access (Lakebase, RDS with logical replication only to a specific subscriber), the WAL stays inside the managed environment's security boundary.
This is why the option must be opt-in, not on-by-default.
Why a generic upstream feature, not Lakebase-specific
CDC consumers besides Lakehouse Sync also benefit (Debezium and ClickHouse ClickPipes can use FULL when PKs aren't sufficient; Azure CDC; logical replication to other Postgres). Keeping the config generic lets any Postgres-based deployment configure it without backend-specific code paths.
The Lakebase-specific decision ("Lakebase deployments default to FULL") lives in the deployment configuration of the consumer, not in this upstream code. Reference: DataJoint × Databricks DECISIONS.md — internal repo, summary available on request.
Industry context
| CDC Tool |
FULL Required? |
Recommendation |
| Debezium |
No (for tables with PKs) |
DEFAULT preferred |
| Azure CDC |
No |
Recommends against FULL |
| ClickHouse ClickPipes |
No (if PK exists) |
DEFAULT preferred |
| Databricks Lakehouse Sync |
Yes, mandatory |
No alternative |
Most CDC tools prefer DEFAULT for performance. Databricks mandates FULL for SCD Type 2 history. The opt-in design serves all of them.
References
Context
Databricks Lakehouse Sync and other logical-replication-based CDC tools require PostgreSQL tables to have `REPLICA IDENTITY FULL` set. Without it, Lakehouse Sync silently skips the table — no error, just missing data downstream:
This is mandatory for Lakehouse Sync — there is no alternative. This issue requests a generic, opt-in mechanism in `datajoint-python` so Postgres-on-Lakebase deployments and other CDC-driven setups can configure it without per-table manual ALTER statements.
What REPLICA IDENTITY FULL does
PostgreSQL's WAL records every change. `REPLICA IDENTITY` controls how much of the old row is logged for UPDATE and DELETE:
Key facts:
For DataJoint's overwhelmingly insert-append workload, the real-world cost of FULL is negligible. The one cost scenario is bulk `delete()` on tables with `` columns (e.g., 100 rows × 10 MB blobs ≈ 1 GB WAL burst). Transient and infrequent.
Proposed design
Add a config option `database.replica_identity` with allowed values `"default"` (default) and `"full"`. Environment variable: `DJ_REPLICA_IDENTITY`.
Two mechanisms:
1. New tables — applied automatically via `post_ddl` during `declare()`
2. Existing tables — utility method
Reason a utility is needed: when `@schema` decorates a class for an already-existing table, `declare()` is not called. `post_ddl` only fires on table creation. Users who add the config after their pipeline exists need a separate mechanism.
MySQL behavior
If the config is set on a MySQL connection, emit a `UserWarning` consistent with the existing `database.name` warning pattern (PR #1426). REPLICA IDENTITY is a PostgreSQL concept; setting it on MySQL is a no-op.
Compliance and security considerations
Under DEFAULT, only PK values appear in WAL. Under FULL, entire rows — including any PHI, PII, or sensitive data — appear in:
For self-hosted PostgreSQL, this is a real consideration. For managed PostgreSQL with controlled WAL access (Lakebase, RDS with logical replication only to a specific subscriber), the WAL stays inside the managed environment's security boundary.
This is why the option must be opt-in, not on-by-default.
Why a generic upstream feature, not Lakebase-specific
CDC consumers besides Lakehouse Sync also benefit (Debezium and ClickHouse ClickPipes can use FULL when PKs aren't sufficient; Azure CDC; logical replication to other Postgres). Keeping the config generic lets any Postgres-based deployment configure it without backend-specific code paths.
The Lakebase-specific decision ("Lakebase deployments default to FULL") lives in the deployment configuration of the consumer, not in this upstream code. Reference: DataJoint × Databricks DECISIONS.md — internal repo, summary available on request.
Industry context
Most CDC tools prefer DEFAULT for performance. Databricks mandates FULL for SCD Type 2 history. The opt-in design serves all of them.
References