A minimal setup using Python, PostgreSQL, and DuckLake that runs locally and in CI without taxing your budget.
Motivation
The widespread adoption of the cloud vendors led to easy access to scalable infrastructure and the accompanying services to easily manage your needs. One of such needs might be data replication, in order to protect your production database from analytical queries. One of the setups for this could be as follows on AWS: Spin-up an AWS DMS instance to replicate your WAL to AWS S3, then use Snowflake’s Snowpipes to automatically ingest your data into Snowflake Tables as your Bronze Layer. From that you start your analytics journey.
Probably, each of the steps in the pipeline can be replaced with the according technology of your favorite cloud vendor. There are two things about this setup, that are a bit troubling to me:
- The amount of money per data size that is needed to get this setup rolling for a small team.
- You need to pay AWS and Snowflake once again if you’d like to do some minimal testing along your setup.
With the advent of DuckLake it seems there’s finally a trivial setup on the line, that allows for an analogue setup that will run during local development and in your CI. Thus, I’ve tried to lay out a minimal design giving up on streaming and switching to what I would call “mini-batches”. The idea here is as follows: Use trigger-based CDC in PostgreSQL and write those changes in “mini-batches” into DuckLake. Basically, this should be language agnostic, in the companion Repository, I’ve used python though. So, let’s go!
Outline
- ⚖️ Weighing up of Alternatives
- 🐘 Trigger-based CDC in PostgreSQL
- 🦆 Efficient Transfer of Changes to DuckLake
- 🤔 Closing Thoughts
⚖️ Weighing up of Alternatives
Choosing is about alternatives, thus let’s quickly throw some buzz words and put those into context.
Alternative Replication Mechanisms
Let’s start with the elephant in the room: What alternative approaches to trigger-based CDC do exist in PostgreSQL:
- Logical Replication is based on the write-ahead log (WAL). It allows for replication via tools like Debezium. This will require additional permanent CPUs and RAM though. I feel like, their advertised outbox pattern is more fitting for microservice architectures, and of course if we need streaming of our data. However, logical replication slots in PostgreSQL can be a beast, and I like to shy away from the resulting complexity.
- Query-based CDC will require you to add the CDC logic to your application by juggling the start and end timestamps of each spell. But, even after running the delete on the spells that are finished, their tuples will remain in the pages of the table, and bloat our table till the vacuum picks them up. Additionally, this will require us to extend the index on the timestamp and permanently increase our overall table size.
Upon googling, there are others’, but “Table Differencing” or “read and parse the WAL” sound like alternatives for special needs.
I feel like just the idea of query-based CDC is a motivation of the trigger-based one, as it just works passively in the background, without bloating your application logic or tables. Still it leaves you with the same guarantee: To be able to replicate each state of your data that has every occurred.
Alternative Data Lake Formats
Without being in this business at all, there seem to be two formats that are widely used by: Delta Lake and Apache Iceberg. There are tons of articles comparing those.
From my perspective, the main difference will be that I’m forced to switch the compute engine: While Delta Lake relies on Spark, Apache Iceberg will give me a bit more of a choice, but will bloat my CI pipeline massively with adding the dependencies and the setup.
Remember, the goal here is to come up with a setup, that has a small footprint, is easy and fast to spin up, and cloud agnostic.
The Emergence of DuckLake
Less than 2 weeks ago, DuckLake was launched. To me, it feels like they’ve managed to hit the same sweet spot they’ve already hit with DuckDB itself. A simple setup, that is build on the stable pillars of existing databases. To me, it looks a bit like they are calling the bluff of Snowflake and Databricks in selling data catalogs to people. That being said, I fully acknowledge, that those platforms have a myriad of additional features, that are not feasible with a trivial DuckLake setup. At the same time, I’d guess though, that not everyone is in need of their well-priced features.
At the time of this writing I couldn’t spot too many blog posts that are neither marketing material, trivial executions of the blog post accompanying the release, or largely AI generated. One that stood out to be, but is covering other topics as well was:
- Salesforce’s $8B Informatica Buy and DuckDB’s DuckLake Could Reshape Data Engineering | Here’s What to Watch: The full section of “Open Source Strikes Back” is very much on point from my perception.
🐘 Trigger-based CDC in PostgreSQL
From my perspective, the goal should be to have the most trivial setup in your production database, and push the complexity to your analytics team. Thus, I’ve set up a single table to collect the Insert, Update, and Delete changes for my tables of interest:
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE SCHEMA IF NOT EXISTS audit;
CREATE UNLOGGED TABLE IF NOT EXISTS audit.cdc (
cdc_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
cdc_ts TIMESTAMP NOT NULL DEFAULT (now() AT TIME ZONE 'UTC'),
cdc_op CHAR(1) NOT NULL,
table_schema TEXT NOT NULL,
table_name TEXT NOT NULL,
data JSONB NOT NULL
);
Note that, as soon as PostgreSQL 18 is released, we can switch to uuidv7 here and remove the cdc_ts
, but the time has not yet come. Otherwise, we just crunch all our row content into a single data json. We don’t care about storage efficiency here, as the sole purpose of this table is to me dumped into the data lake. I’m not perfectly sure if this is a debatable choice. At least I’ve found some blogs that explicitly label the columns. From my perspective we don’t want to get our hands dirty with schema evolution at this stage.
The important part here is to create this as an UNLOGGED
table in order to not bloat the WAL and reduce the footprint on our transactions a bit.
Now, we need two more ingredients: A function to insert into our cdc table, and the according triggers. Due to using a single table with a single data column we will only need a single function:
CREATE OR REPLACE FUNCTION audit.record_change()
RETURNS trigger
LANGUAGE plpgsql
AS $$
BEGIN
IF TG_OP = 'DELETE' THEN
INSERT INTO audit.cdc(table_schema, table_name, cdc_op, data)
VALUES (TG_TABLE_SCHEMA, TG_TABLE_NAME, 'D', to_jsonb(OLD));
ELSE
INSERT INTO audit.cdc(table_schema, table_name, cdc_op, data)
VALUES (TG_TABLE_SCHEMA, TG_TABLE_NAME, substr(TG_OP,1,1), to_jsonb(NEW));
END IF;
RETURN NULL;
END;
$$
The more critical part is to generate the according trigger each time, the schema is updated, or you want to add new tables to your Change Data Capture mechanism. I’ll leave this to you. But, the generic procedure will look like this:
DO $$
DECLARE
tbl RECORD;
BEGIN
FOR tbl IN
SELECT table_schema, table_name
FROM information_schema.tables
WHERE table_schema = 'public' -- restrict to schema of choice.
AND table_name = 'dummy' -- restrict to tables of choice.
AND table_type = 'BASE TABLE'
LOOP
BEGIN
EXECUTE format(
-- 1) %1$I is the full trigger name (including prefix), fully quoted
-- 2) %2$I is the schema
-- 3) %3$I is the table name
'CREATE TRIGGER %1$I
AFTER INSERT OR UPDATE OR DELETE ON %2$I.%3$I
FOR EACH ROW
EXECUTE FUNCTION audit.record_change();',
concat('trg_audit_', tbl.table_name),
tbl.table_schema,
tbl.table_name
);
EXCEPTION
WHEN duplicate_object THEN
NULL; -- trigger already exists, skip
END;
END LOOP;
END;
$$
With that, our PostgreSQL audit schema is in place. You might already spot it, in my code example I’m only using a single table public.dummy
, in the end I think this should be flexible enough to extend to any set of tables. DuckLake’s transactions will allow us to replicate the changes of our PostgreSQL tables in a single transaction.
🦆 Efficient Transfer of Changes to DuckLake
We’ve already got our PostgreSQL database in place, thus we will use this as catalog database for our DuckLake. As my intention is to use this setup in the CI, we will store the resulting parquet files on our local disk:
INSTALL ducklake;
ATTACH 'ducklake:postgres:...' AS dl (DATA_PATH './dl')
The … represents the postgres connection parameters. I’ve left them out for now to reduce bloat. Additionally, we need a single table in our DuckLake to host the captured PostgreSQL changes:
CREATE TABLE dl.cdc (
dl_ts DATETIME NOT NULL,
cdc_uuid UUID NOT NULL,
cdc_ts DATETIME NOT NULL,
cdc_op CHAR(1) NOT NULL,
table_schema TEXT NOT NULL,
table_name TEXT NOT NULL,
data JSON NOT NULL
)
Nothing special here, I’ve mirrored the types and added a dl_ts
column in order to know the time of ingestion next to the time of capture via the cdc_ts
.
In a next step, we’ll take care of moving the changes to the DuckLake. The important part here, is to avoid fetching data twice and removing the storage need for data that has been captured. In order to achieve this, we’ll use a single transaction in PostgreSQL in order to copy the table definition and exchange the target table of the triggers:
BEGIN;
CREATE TABLE audit.cdc_new (LIKE audit.cdc INCLUDING ALL);
ALTER TABLE audit.cdc RENAME TO cdc_old;
ALTER TABLE audit.cdc_new RENAME TO cdc;
COMMIT;
For PostgreSQL, that’s basically a no-op independent of your data’s size. Now, we’ve isolated our current batch into audit.cdc_old
, which can be transferred and dropped from within our DuckDB session, assuming that we’ve attached our PostgreSQL database as pg
:
INSERT INTO dl.cdc (
dl_ts, cdc_uuid, cdc_ts, cdc_op, table_schema, table_name, data
)
SELECT
now(), cdc_uuid, cdc_ts, cdc_op, table_schema, table_name, data
FROM pg.audit.cdc_old;
DROP TABLE pg.audit.cdc_old;
Et voilà, we’re done for today 🙃
🤔 Closing Thoughts
- First of all, I‘m regularly annoyed by tutorials that present snippets, that don’t work without some glue in between, thus I’ve uploaded a
main.py
in order to be fully transparent about what is hidden here and provide a template: https://gitlab.com/jawerg/mini-cdc-dl - I’ve mentioned compatibility with the CI here multiple times. First of all, I’ve used the testcontainer library, which reduces the packages needed in order to run the full example code to
testcontainers-postgres
andduckdb
. This should allow us to test our mechanisms for schema evolution and testing further processing in DuckLake itself. - I feel like it’s time to bring some more software engineering to data engineering. We need to find a way to create our own software based on infrastructure, that can be employed in unit tests leaving out the cloud vendors. But maybe I’m just missing how to do this right.
- The postgres modifications presented here are compatible with alembic, thus I’d guess with other migration frameworks as well.
- I think there’s no merit in discussing the shortcomings of the DuckLake format right now, as it’s just released. Time will tell. Similarly for performance. We have to recognize that not everyone of us has PBs of data, and thus paying $200,000+ per year to just store your data, there’s probably money for some hires as well, to build you a tailored solution. We need some trivial default though as well as starting points.
- Reading the existing blog posts, I felt like there is too few reproducible tinkering and too much DevRel/Marketing material or trivial replication of the docs. I’m wondering how we could improve here? Not everyone has to spawn a package, but presenting use-cases in a minimal setup should also become feasible with a setup like the one presented here. Maybe, it was just hard to do so till now. At least, that’s my hope.
I’m happy about feedback regarding my approach or any blindspots and misunderstandings I might have had here. Thanks for your time and attention!