Building Cloud-Native Data Architecture

Charles Mingus said making the simple complicated is commonplace, but making the complicated simple is creativity. That's the entire job in cloud data architecture. Anyone can spin up forty services. The work is figuring out which six you actually need and getting the other thirty-four off the bill.

Cloud-native data architecture has been written about so many times that the genre has its own clichés. Storage and compute, separated. Managed services, preferred. Serverless, where possible. Diagrams with little arrows pointing at S3. None of it is wrong, exactly. It's just that reading it doesn't help anyone build a system that works, because the hard parts of cloud architecture are not the parts that fit on a diagram.

I want to write about the parts that don't fit on the diagram, based on the MSBA Financial Group project where I designed an end-to-end AWS pipeline for risk modeling. S3 for storage, Glue for ETL, Redshift for warehousing, SageMaker for the model. On paper it's the textbook architecture. In practice the textbook part took about a week. The other six weeks went into the things the textbook never mentioned.

The first thing nobody warns you about is cost. Cloud bills are designed to be opaque. AWS gives you 200 services, each priced on a different axis, and the bill arrives at the end of the month with line items that are technically accurate and operationally useless. The MSBA pipeline cost roughly four times what I'd estimated in week one. Not because the architecture was wrong, but because the assumptions about how often jobs would run, how much data would move between services, and how much idle compute would sit around waiting were all off by half an order of magnitude. The fix wasn't redesigning the system. It was instrumenting the cost layer the same way you'd instrument latency. Once each step in the pipeline had a dollar amount attached, decisions about which step to optimize became obvious.

The second thing is the seam between storage and compute, which everyone celebrates as the great virtue of cloud-native design. The seam is real and the virtue is real, and also the seam is where most of the bugs live. Data sitting in S3 in one schema, getting transformed by Glue into another schema, landing in Redshift in a third schema. Three places where the column types can drift, three places where a null can become a string of the word "null," three places where a date format can quietly change from ISO to something the downstream consumer doesn't expect. Every cloud-native system I've worked on has at least one of these silent corruptions running, and finding them is mostly a matter of building enough validation between the layers that the corruption becomes loud instead of quiet.

The third thing is managed services. The pitch is that managed services let the team focus on the data instead of the infrastructure. The actual experience is that managed services let you focus on the data right up until the managed service does something you didn't expect, at which point the absence of access to the underlying infrastructure becomes the entire problem. Glue jobs that fail with errors that point to internal AWS components nobody can see. Redshift query plans that change overnight because the optimizer got an update. SageMaker endpoints that scale in ways the documentation doesn't quite cover. Managed services are great until they aren't, and the right way to use them is to assume that one day, you will need to debug something you can't see, and to architect accordingly. Logging everything. Versioning everything. Keeping the pipeline modular enough that any one managed service can be swapped out without rebuilding the whole thing.

The fourth thing is security, which gets one paragraph in every cloud architecture article and deserves about five. Most cloud security failures are not exotic attacks. They're misconfigured S3 buckets, IAM roles with too many permissions because nobody had time to scope them properly, and credentials in environment variables that ended up in a Slack channel during a debugging session. The architecture decisions that make a system secure are not the ones about encryption-at-rest. They're the ones about how access is granted, how it's audited, and how it's revoked. Most of those decisions get made implicitly in the first week of a project, by whoever happens to be setting up the IAM roles, and live with the project for years.

A few patterns I've come around to:

Default to the boring service. AWS has 200 services and you need about 12 of them. The temptation is to use the new one. The new one will save you time in week three and cost you time in month nine. Use the boring one until the boring one is genuinely the wrong tool, which is rarer than the AWS marketing team would like you to believe.

Build the cost dashboard before you build anything else. Not after. Before. If the cost surface is invisible at the start, it will be invisible when the bill triples, and by the time anyone notices, the architectural decisions that caused the problem will be three layers deep.

Treat IAM like production code. Review it. Version it. Don't let it accumulate. The biggest risk in most cloud systems is not a sophisticated attack. It's an old role that nobody remembered to revoke.

Decouple aggressively. Every place where two services talk directly to each other is a place where, eventually, one of them is going to change in a way the other doesn't expect. The architecture pattern that survives is the one with queues, message buses, and explicit contracts in between. The architecture pattern that doesn't is the one where Glue calls Redshift calls SageMaker in a chain, and one upgrade breaks all three.

Document the system that exists, not the system you wish existed. The diagram in the original RFC is almost always wrong by month three. The runbook that matches the actual production system is the one the on-call engineer needs at 3 a.m., and the discrepancy between the two is where outages happen.

The deeper thing about cloud-native data architecture is that it's not really about the cloud. It's about distributed systems, with all the failure modes distributed systems have always had, dressed up in a UI that makes it look easier than it is. The architects who build well in the cloud are the ones who respect what's actually happening underneath, even when the platform is trying to abstract it away. The ones who struggle take the abstractions at face value and get surprised when reality leaks through.

Mingus made the same point about composition that applies here. The hard work is in the editing, not the addition. Most cloud architectures fail because someone kept adding services. The good ones survive because someone kept taking them away.

Building Cloud-Native Data Architecture

Continue Reading

Building Intelligent Automation Systems

Data Science in Practice: Real-world Applications