Terraform made our API better

For years, our API and the clients that used it agreed completely. We built our dashboard and our backend side by side, so both ends shared the same assumptions about every field, every type, and every response — and from the outside, everything just worked. That close agreement is also what kept a handful of inconsistencies in the contracts out of sight: a field typed one way going out and another coming back, a "success" that meant "accepted" rather than "finished." None of it ever caused a problem, because every client already spoke the API's exact dialect. It helped that a person in a browser is a forgiving client too — the form submits a string, the table renders a number, and "443" and 443 look identical on screen.

Then we built a Terraform provider — a client that hadn't been in the room when any of those contracts were agreed — and the inconsistencies it couldn't paper over came straight to the surface.

Terraform is, as far as we can tell, the most demanding API client that exists. It needs strict types, because it stores your infrastructure as a typed state file and diffs it on every run. It needs real idempotency, because it will read back what it just wrote and compare. It needs deletes that actually wait for completion, because the next apply assumes the thing is gone. And it needs stable resource identity, because the whole model depends on knowing the ID of the resource it just created. A client that already knows an API's quirks works around them without noticing. Terraform won't — it holds a contract to exactly what the contract says. Neither, it turns out, will an AI agent.

Building the provider didn't just teach us how to integrate with Terraform. It surfaced inconsistencies that had sat latent for years — harmless to every client we had, but real — and gave us the occasion to tighten the contract underneath all of them. Here are the ones worth telling.

[ terraform apply · american cloud provider ]

Apply, then re-plan: 'No changes.' Integer ports in, integer ports out — nothing drifts.

Ports were strings on the way in, integers on the way out

When you created a firewall, egress, or network ACL rule, the API accepted startPort and endPort as strings. When you read that same rule back, it returned them as integers. A person in the dashboard never saw this. The browser coerces "443" and 443 to the same thing on screen; the form submits a string, the table renders a number, nobody notices the seam.

Terraform notices immediately. It writes 443, reads back 443, and then has to decide whether the resource drifted. A type that changes between write and read makes the state diff explode — Terraform sees a perpetual change it can never reconcile, and reports your infrastructure as permanently out of sync. In 1.3.0 we made startPort and endPort integers (1–65535) on create and update, matching the integer values the API had always returned on read. It's a breaking change on paper. In practice it just made the contract say what was already true on one side of it.

"unlimited" was a string pretending to be a number

Object storage had a field that could be a number of buckets — or the literal string "unlimited". To a human reading a settings page, "unlimited" is perfectly clear. To a type system, it's a landmine: a field that is usually number and occasionally the word "unlimited" can't be modeled as a number, so every consumer has to special-case the string before doing arithmetic on it. Most consumers don't, which means the "no limit" case is exactly where the bug hides.

We changed maxBuckets and limitKb to be a number or null, where null means no limit. The "unlimited" string is gone. null is the honest way to say "no value here," and every type system already knows how to handle it. Stringly-typed sentinels are convenient for the person who writes them and a tax on everyone who reads them.

Create told you it succeeded but not what you'd made

Creating an object storage unit used to return a generic success acknowledgement. Fine for a human — you created the thing, you can see it in the list, you click into it. But Terraform has to record the ID of the resource it just created, immediately, or it loses track of it forever. A "success" message with no identifier means the very next operation has nothing to operate on.

Now the create call returns the created unit: its storageUnitId, createdAt, and maxBuckets. The storageUnitId it hands back is the same identifier every other object-storage call takes. This is the difference between an API that confirms an action happened and one that hands you the handle to the thing you made. The first is good enough for a person. Only the second works for automation.

Delete reported success on submission, not on completion

This one was the most insidious, because it looked the most correct. Deleting an isolated network or a VPC returned success the moment the request was accepted — not when the teardown actually finished. A human deletes a network, sees the confirmation, and goes to lunch. By the time they look again, it's gone, and the gap between "accepted" and "gone" was invisible.

Run terraform destroy followed by a fresh apply and that gap becomes a hard failure. Terraform deletes the network, gets "success," and immediately tries to recreate one with the same name — while the old one is still tearing down in the background. The create fails because the name is still taken. The optimistic "success" was actively lying about the state of the world. In 1.3.0, isolated-network and VPC deletes wait for teardown to complete and surface the failure if teardown fails, instead of reporting success on submission. (We also dropped an internal job identifier that used to leak out of the delete response — a consumer should never have to poll our internals to find out whether their delete worked.)

A 500 where a 400 or 404 belonged

Creating a snapshot of a volume that couldn't be snapshotted returned a 500. So did an operation aimed at a resource that no longer existed. A 500 means "the server broke," and a human reading it shrugs, maybe retries once, maybe files a support ticket. But to an automated client — Terraform, a CI pipeline, an AI agent — a 500 means "transient, try again." So it tries again. And again. A 500 where the real answer is "you asked for something impossible" turns a clear no into an infinite retry loop.

The fix was to tell the truth with the status code. Snapshotting a volume that can't be snapshotted now returns a descriptive 400 — "this request is malformed, here's why." An operation against a resource that has been deleted now returns a 404 — "the thing you're addressing isn't here." A 4xx tells the client "this will never work, stop asking," which is exactly the information an automated caller needs to fail fast and tell its operator what's wrong, instead of hammering an endpoint that was never going to say yes.

The ripple: every consumer below found bugs the one above couldn't

Here is the part we didn't expect. We didn't fix these bugs for Terraform alone. We have a public OpenAPI specification, SDKs in TypeScript, Python, and Go generated from it, and an MCP server that lets AI agents drive the platform. They're stacked: the SDKs are generated from the spec, the MCP server is built on the SDKs, and the Terraform provider sits on the same contract. Each layer is a stricter consumer than the one above it, and each one surfaced inconsistencies the previous layer's users had quietly absorbed.

The SDKs forced the types to be honest, because a generated client can't fudge string-or-number. The MCP server forced the errors and the create-returns-the-resource behavior to be honest, because an AI agent reacts to status codes literally and needs the ID of what it just made. The Terraform provider forced the delete semantics to be honest, because nothing else models the full lifecycle of a resource the way state-based infrastructure does. AI agents, it turns out, are ruthless API testers for the same reason Terraform is: there's no human intuition in the loop to paper over an inconsistency. They take the contract at its word, and when the contract is wrong, they break loudly instead of quietly compensating.

What's underneath

The corrected contract is now the foundation that everything else stands on. The OpenAPI spec describes an API whose types don't shift between read and write. The SDKs generate cleanly from it. The MCP server gives AI agents a surface that fails fast and honestly. And our Terraform provider — the client that started all of this — is now on the Terraform Registry, built on the idempotency and identity guarantees it forced into existence.

If you maintain an API and you want to know where it's lying to you, build a Terraform provider for it. Or point an AI agent at it and watch what breaks. You will not enjoy the first week. But every inconsistency a forgiving human has been silently fixing for you will surface at once, and you'll end up with a contract that means what it says — which is the only kind worth building on.

You can see the work in the open. The API reference is public, our SDKs and tooling are on GitHub, and you can read how to put an AI agent on top of the platform in the MCP server overview.