IaC Tools Got Us Here. Can They Get Us There?

Tools like Terraform, Ansible, Puppet and Chef are powerful solutions that changed the way we organize and deploy cloud infrastructure. Even as they continue to facilitate the creation of ever more powerful and therefore complex systems, they may end up becoming victims of their own success.

By moving cloud infrastructure configuration to code, we could suddenly manage configurations via git, then deploy using a continuous deployment service. Lo and behold, the idea of Infrastructure as Code (IaC) was born. This innovation made cloud system deployment dramatically easier, but we have a new problem: fragmented and static configuration code.

When configuration code is removed from its architectural context, it becomes very difficult to grasp the purpose it serves in the broader system, which limits our ability to make edits mindfully. And we’re not just talking about a few scattered configuration codes– we’re talking about hundreds of files. More and more configuration files that engineers have to find, examine and edit for every change, from one deployment to the next.

To combat this, IaC tools have introduced an array of solutions to scale up configuration management, including templates, Terraform modules, Pulumi stacks and other features to increase efficiency. But this gain in efficiency doesn’t fundamentally change the nature of the problem.

The Current State of Affairs in Cloud Deployment

Where Are These Configuration Files Coming From?

Configuration files grow as a result of a product becoming more complex in support of the business it serves. When your product complexity increases, your cloud system complexity increases, and so do your deployments.

Unlike horizontal scale, in which a system is expanded to accommodate more users and traffic, system complexity is the result of increased functionality. For example, a brand new application doesn’t need user account management features. In the beginning, an engineer can fix a user's account by manipulating user data directly in the production database. But this is risky and time consuming, so before long, the system needs a customer account management application.

Of course, customer account management is not just a few features. Extending billing capabilities and different payment options, invoicing, compliance, GDPR data exports, user data deletion, data privacy, improved security like 2FA, tax, marketing insights… all of this requires adding new databases, queues, file storages, and many other services to a cloud system. It's been only a few paycheck cycles, and suddenly, a simple system with "just" 50-60 cloud configuration files balloons into the 100-200 range. And it just keeps growing.

Configuration Code is Static and Fragmented

Applying the term “code” to configurations is a bit of a misnomer. Configurations are most often just data that represents desired parameters. Configuration code has virtually no ability to link values between files.

Imagine a spreadsheet without formulas and links between sheets. Each cell is an independent isolated value. Would you want to go through the pain of cutting and pasting values from one sheet to another, as changes are made? You probably wouldn’t. The very same goes for cloud configuration files. Cascading changes between dependent services are often done manually. Cutting and pasting, surprisingly, is common practice.

Deployments Multiply Configurations

We always need multiple deployments, including local, testing, staging, and production. Each deployment environment contains a completely new set of independent, fragmented files. Remember, they are not linked. Changes to one environment do not propagate to another without manual intervention.

We can reuse some code that's been distributed through modules, and that reduces the rate of growth of configuration lines it takes for each deployed environment. Code modules also help by reusing the same configurations for the same type of services. But you still end up copy-pasting other parts of the configuration code to instantiate that service in the cloud. Again, IaC tools do provide some management options to help, but they require significant resources to train up teams and create new policies and workflows, all for the sake of managing configuration file explosion.

Each deployed service in a given environment has independent files that have been either written from scratch, or copy-pasted and edited. As with understanding and accurately capturing the dependencies of configurations in a single deployment, when spread across deployments, the manual work (and the potential for errors) for even the simplest of changes, multiplies significantly.

"The Single Source of Truth" Argument

One of the key benefits of IaC is that configuration files provide a single source of truth for how a cloud system operates. That is what we hear in the industry. And it is true, to an extent. But the source of truth for what? For the Software System or the Cloud Infrastructure? What impact does IaC have on defining the CI/CD system? What impact does it have on the development environment?

More importantly, how do configurations as a single source of truth help keep all deployment environments in sync? With IaC as our tool, there is no way to ensure changes to one deployment are properly reflected in all others. For example, when we add a database in the staging environment, we cannot ensure a production-grade database is also provisioned in the production environment which serves the same architectural purpose with properly configured services. IaC, as it works today, requires human contextual understanding to know when and where configurations apply. And then a human needs to apply them.

The Wrong Explanatory Framework

It is worth restating that IaC tools have changed the way we deploy cloud systems for the better. Infrastructure as Code is a very helpful next step in the evolution of cloud software systems, but it is not the last step. As we have seen through this discussion, the ability to treat configurations as code allows us to do new things, but that does not remove the limitations of working on the configuration level.

Cloud system “management-by-configuration” leaves us in the weeds of complex systems, and because of that, the complexity always catches up with us. As dependencies multiply, deployments become more challenging, and over time, create MORE manual work for developers, not less.

This is why we argue that IaC is, in effect, just a faster horse, and not the breakthrough cloud systems need. Instead, we need to elevate our frame of reference from configurations to components and systems. We already think in terms of components and systems, but our day-to-day work doesn’t reflect that, creating inefficiencies that ripple through every stage of the development cycle.

To learn about how Torque Framework changes the nature of managing cloud system architectures, check out this piece about our breakthrough that enables Self-Configuring Systems.