Data Plus Architecture: Integrating Data into Design
I’ve watched three organizations try to migrate their data at scale this past year. All three started the same way — someone said “we need better analytics.” All three hit the same wall: the architecture wasn’t built for data flow, so extracting anything meant duct-taping brittle pipelines onto systems that fought back every time you queried them.
This is how it always goes. Data gets treated like exhaust — a byproduct of the “real” work of building apps. You ship features, hit deadlines, and then someone from analytics shows up asking for a warehouse. That’s when you realize your microservices scattered data across a dozen schemas, and cross-service queries are now a nightmare you can’t wake up from.
The integration pattern landscape #
Let me untangle the terminology, because everyone’s using these acronyms differently.
ETL (Extract, Transform, Load) is the old way. Pull from sources, transform into some analytical schema, load into a warehouse. Batch jobs, usually nightly. The catch? Your transforms are handcuffed to both the source schema and the destination schema. Change either one, and something breaks.
ELT (Extract, Load, Transform) flips the script. Dump raw data into the warehouse first, transform it there. This only became practical when BigQuery and Snowflake made storage cheap and compute elastic. Now dbt handles the “T” — you write SQL transformations as version-controlled models, which feels almost civilized after the ETL era.
CDC (Change Data Capture) is the one I’m actually excited about. Instead of batch extracts, you stream changes straight from the database’s transaction log in real time. Debezium on Kafka Connect is the usual stack. Your warehouse stays current within seconds, not hours. It’s not magic — just better engineering.
-- dbt model example: transform raw CDC events into a clean orders table
SELECT
payload.after.id AS order_id,
payload.after.customer_id,
payload.after.total_amount,
payload.after.status,
payload.ts_ms AS event_timestamp
FROM {{ source('cdc', 'orders_events') }}
WHERE payload.op IN ('c', 'u') -- creates and updates onlyThe shift from ETL to ELT to CDC isn’t just technical fashion. Expectations changed. Business stakeholders don’t accept day-old data anymore — they want dashboards that refresh in real time, alerts that fire immediately, operational analytics that feed back into the app itself. Batch processing can’t touch that.
Data mesh: decentralize ownership #
Zhamak Dehghani published her data mesh principles in late 2020, and the idea’s been picking up steam through 2021. Her core argument: centralized data teams don’t scale. When one team owns the warehouse, they become a bottleneck for every analytical question in the company.
Data mesh says treat data as a product, owned by the domain teams that produce it. Payments team owns payments data. Inventory team owns inventory data. Each team publishes documented, discoverable interfaces — same as how you’d publish an API.
Four principles: domain ownership, data as a product, self-serve data platform, and federated computational governance.
I’m sympathetic to the organizational argument. Centralized data teams do become bottlenecks — I’ve been stuck in that queue myself. But I’m skeptical about execution. Domain teams need tooling, training, and actual incentives to build quality data products. Most organizations can’t get teams to write decent API documentation; asking them to own data quality, schema evolution, and SLA guarantees is… optimistic.
The parallel to DDD is intentional, and I think it actually holds up. If you’ve done the work of identifying bounded contexts and drawing context maps, you’ve already figured out your domain boundaries. Those same lines should define who owns what data.
Data fabric: unify access #
Where data mesh pushes ownership outward, data fabric pulls access together. Gartner’s been heavy on this concept, and yeah, the marketing is thick — but strip that away and there’s something useful underneath.
A data fabric is an architectural layer that gives you consistent access to data living in different places. Your app queries one interface; the fabric figures out where the data actually lives, applies governance, handles caching, translates schemas.
Think of it as an anti-corruption layer for data. Consumers don’t need to know if the data sits in PostgreSQL, a data lake, some third-party API, or all three. The fabric hides that mess.
In practice, I’ve only seen this work when the scope stays narrow. A full enterprise fabric covering every data source is a multi-year project that usually dies in committee. A domain-specific fabric unifying three or four sources for one business capability? That ships. That delivers value.
Where DDD meets data architecture #
The connection between domain-driven design and data architecture isn’t cosmetic — it’s structural.
Context maps show how bounded contexts relate: customer-supplier, shared kernel, anti-corruption layer. Those same relationships govern data integration. When orders consumes from inventory, the pattern you choose — sync or async, push or pull, tight or loose schema coupling — should follow the context map you already drew for the application.
Anti-corruption layers matter especially here. When you integrate with a legacy system or third-party data source, you don’t want their schema leaking into your domain model. An ACL translates external data into your language. Same principle whether the data arrives via API, CDC stream, or some batch file that shows up at 3 AM.
┌─────────────────┐ ┌─────────────────┐
│ Orders Domain │ CDC │ Inventory Domain │
│ │◄───────│ │
│ ACL translates │ │ Publishes stock │
│ stock events │ │ change events │
│ into order │ │ │
│ domain language│ │ │
└─────────────────┘ └─────────────────┘The practical takeaway #
Data architecture isn’t something you bring in after the app’s built — it’s a constraint that should shape your design from day one.
When you’re drawing service boundaries, ask: what data will this produce that other teams need? When you’re picking a database, ask: how will we extract change events from this thing? When you’re designing APIs, ask: can this schema evolve without breaking everyone downstream?
None of these are hard questions. But teams rarely ask them at the right time. Most discover their data problems six months after shipping, when someone needs a report joining three services and realizes the only path involves four custom ETL scripts and a prayer.
Start with the data flow. The architecture gets better when you do.