You should have lots of AWS accounts

I know. You need another AWS best practice like you need another covid exposure. But you really should have lots of AWS accounts. Lots of AWS accounts working together in harmony will net you a more secure, more reliable, and more compliant cloud infrastructure.

All of the reasons to have lots of AWS accounts boil down to different forms of desirable isolation. As the original form of isolation in AWS, predating IAM entirely, AWS accounts are the most complete form of isolation on offer. And so, thanks to AWS’ legendary commitment to their public APIs, AWS accounts became the foundation of AWS Organizations. They remain the best way to achieve maximum isolation even today. AWS accounts can create strong security boundaries, reduce the blast radius of changes, improve your control over AWS service limits, avoid high-stakes tagging requirements, and more.

Powerful as it is, the AWS IAM policy language makes it easy to grant access to all resources in an account (EC2 instances, security groups, DNS names, load balancers, IP addresses, databases, etc. — resource is the catch-all term for a thing in AWS) and hard to grant access to a subset of those resources. Granting access to all resources can be a risky security posture. An attacker who gets a foothold would have wide-ranging access and an easy time moving laterally to gain persistence or exfiltrate your most valuable data.

Isolation begets reliability, too. Without sufficient isolation, increased traffic to one service can cause it to become a noisy neighbor to others, exhausting CPU, memory, or I/O capacity or even pushing the whole account beyond an AWS service limit, resulting in an outage. And the same IAM policy shortcuts discussed in a security context above can become reliability risks when changes are made too broadly or too quickly. Here again, changes to one insufficiently isolated environment or service might impact another and cause an outage.

Compliance regimes demand isolation, too. SOC 2 programs almost always include a control declaring that production data is never used in development or testing and numerous controls describing how human access to sensitive systems and data is stratified and managed. Without meaningful isolation, all data must be protected as carefully as the most sensitive data.

Secure, reliable, and compliant systems are built on isolation. In AWS, isolation is built on lots of AWS accounts.

If you want a tool that can mange all your accounts for you, making creation of accounts, and roles easy, and also bootstrapping your Terraform infrastructure, take a look at Substrate, or contact us for a demo.

What’s so wrong with only having one AWS account?

Imagine you’re trying to create the kind of isolation necessary to deliver the security, reliability, and compliance that business customers demand in one AWS account. Your SOC 2 compliance program says that the staging and production environments are isolated from each other. You’ve had outages before in which deploying one service took down another, so you’re hoping to achieve some isolation between services, too.

You’ve already created separate Terraform state files for your staging and production environments but know that terraform apply always uses the same all-powerful IAM role. That means there’s nothing stopping a Terraform run that nominally applies to staging from modifying production resources, a scary thought. Worse, this is but one case in which software in staging can impact production.

Your solution is to use a different IAM role to run Terraform in staging than you do in production. Likewise, each service will use a different IAM role (in its EC2 instance profile) in staging than in production. You get to work defining these roles and, right away, realize you’re in over your head. AWS IAM policies can allow (or deny) access to certain resources but the language for specifying those resources is very limited. IAM policies support adding a clumsy boolean expression of conditions but you can’t just allow access to all resources with names ending in “-staging” and support for allowing access to all resources tagged with Environment=staging is far from universal. It’s impossible to enumerate all the resources ahead of time, too, since many will be created and destroyed every time you deploy. Many APIs, e.g. ec2:DescribeInstances, cannot be meaningfully constrained, so you’re left with a lot of overprivileged IAM roles in your one big AWS account.

You encounter this same dynamic when you try to create the roles your engineers will use to operate the system. It’s easy to allow or deny folks access to specific APIs but hard, for all the reasons discussed above, to restrict that access to only the services for which they’re responsible. And that really matters. It’s perfectly reasonable to allow service owners to use ec2:AuthorizeSecurityGroupIngress to configure the load balancer for their services. It’s highly inappropriate for folks on Team X to use that same API to surreptitiously authorize their clients to connect directly to Team Y’s database.

Everything in AWS is subject to quotas — EC2 instances of a certain type, public IP addresses, RDS databases — and these quotas are enforced per account. It’s all too common, when you only have one AWS account, to encounter a QuotaExceededException only to find that a different service is using 97% of the quota. Nonetheless, it’s you who’ll have to wait for AWS support to raise your quota (and you’re likely to have to wait quite a while if you’re on Basic Support).

If you choose to use Kubernetes to manage how your software is running in EC2 (which is a fine decision) then you also have to manage upgrades to Kubernetes itself. If you choose to populate your one AWS account with only one Kubernetes cluster then you’ve implicitly chosen a risky all-at-once upgrade process, too.

Creating isolation within one AWS account is a relentless game of whack-a-mole, of ensuring that IAM policies enumerate everything properly, that tags are being applied thoroughly, and that systems are assuming the appropriate IAM roles.

How do you even use multiple AWS accounts at the same time?

This section is for folks who’ve used AWS before but only ever in one account (at a time). You’re no doubt familiar with S3 buckets, one or a few ways to buy compute capacity, VPC networking, some of their managed database offerings, etc. But you’re unsure how you’d continue to offer a single, integrated product experience that’s somehow served by lots of AWS accounts.

Principals in IAM policies

When you have lots of AWS accounts, you have to plan who can access what. In the AWS IAM policy language a principal is how we talk about who is attempting to take an action. A principal in one account must be explicitly authorized to take an action in another account — isolated by default with explicit exceptions. Some AWS services support cross-account actions directly. For example, KMS allows cross-account use of encryption keys and S3 allows cross-account access to object storage. In these cases, you write an IAM policy that allows certain principals (in other AWS accounts) access to the AWS APIs specified in the policy that act on the e.g. KMS key or S3 bucket in question.

Most AWS services, though, punt on cross-account access and ask you to first assume a role in the other account. That role may allow certain principals in other AWS accounts to assume it and then be allowed access to the AWS APIs specified in the policy. You use the sts:AssumeRole API to trade the access key ID, secret access key, and session token associated with your initial AWS account for a temporary new set associated with the other account. Then you use these new credentials to make further AWS API requests. All of the AWS SDKs make assuming roles and handling these changing credentials really simple.

Cross-account networking

Offering your single, integrated product is more than just cross-account access to AWS services, though. Your various services need to be able to work together, too.

You could create a VPC in every one of your accounts and peer them using VPC peering or a Transit Gateway but this would place a tax on every byte that crosses the network between services and make zonal architectures hard to reason about (because availability zone names are a function of account number). You could get even fancier and connect your services via Private Link but this, too, would place a tax on every byte.

My favorite way to create a network between all my services hosted in different AWS accounts is to share a VPC from a network account into all my service accounts and use security groups to authorize service-to-service communication. There’s no per-byte tax, zonal architectures are easy to reason about, and security groups work just like you expect.

With shared VPCs, networking ends up feeling very familiar, even with lots of AWS accounts. The biggest difference will be that most tasks will begin with an sts:AssumeRole request to get into the right account. If you forget (it happens to everyone, especially at first, don’t worry), you’ll be confused for a moment and then find your way to the right account. Rest assured, assuming a role becomes second nature pretty quickly.

How is having lots of AWS accounts better?

Now let’s reimagine meeting your business customers’ security, reliability, and compliance demands, this time with lots of AWS accounts. You’re still on the hook to isolate environments and very interested in isolating services, too. You create one set of accounts for staging and a parallel set for production. A few critical services will each get their own account in each environment.

In order to keep using Terraform, you create an all-powerful IAM role in each AWS account and configure Terraform to assume that role. Right away, this massively improves isolation by ensuring that Terraform code meant for staging can’t impact production and code meant for one service can’t impact another. Now you can terraform apply without fear.

You also create an IAM role in each account for EC2 instances there to use. Since each of these accounts host a single service in a single environment, you don’t need to write any complicated conditions, nor do you need to enumerate any resources in your IAM policies. "Resource": "*" will do fine. Because you have lots of accounts, all the resources in this account is a far cry from all your resources.

Authorizing access for engineers is similarly easier. You can list each engineer who’s responsible for a service, and not any of your other engineers, as principals who can access that service’s AWS account(s). In staging, we’ll probably make these engineers all-powerful. In production, you might restrict access to certain APIs. Here, too, "Resource": "*" will do fine because the account hosts only that one service.

You’ll probably want to name or tag your AWS accounts so that your engineers don’t have to copy and paste 12-digit AWS account numbers all over everywhere. AWS SDK profiles can help, though sharing and synchronizing them between all your engineers becomes your problem. AWS account names and tags can be better, since they’re a part of the account and not merely a part of the API client.

Each of your AWS accounts gets its own quotas. This has two effects. In aggregate, since you’ve defined six AWS accounts, your limits start out six times higher than when you had only one AWS account. More importantly, when you do actually hit a limit, it’s your own fault, because the only thing that could be hogging a quota in your account is part of your service which you can fix.

Just like when you only had one AWS account, choosing Kubernetes (still a fine choice) means choosing to deal with Kubernetes upgrades. AWS EKS clusters must all be in the same AWS account, which means you automatically carry the same isolation you’re enjoying elsewhere into your Kubernetes upgrade process. Per-service upgrades are far less scary than global upgrades.

You’ve achieved isolation, and quite easily, but you’re probably not done. You may need to add IAM policy statements in one account to allow the other two to put objects into an S3 bucket hosted there. You may need to authorize one account to invoke a Lambda function hosted in another. You may need to authorize a few network paths from EC2 instances in two of your accounts to load balancers hosted in a third, made easy by the VPC you shared between all your staging accounts and the other VPC you shared between all your production accounts.

With lots of AWS accounts, isolation is the default. Security, reliability, and compliance come more easily and your engineers can move faster.

One more thing: Automatic cost categorization

There’s no monetary cost to each additional AWS account you open in your organization. There is, though, a very monetary reason to have lots of them: Cost management. No matter how thoroughly or sloppily you tag resources, every single line item on your AWS bill is associated with an AWS account number and you can use that to figure out where your money’s going.

The more AWS accounts you have, the more meaningful this cost categorization becomes. An account per service group, per environment can buy you many years of understanding your AWS bill without ever tagging a single resource. In some situations, an account per customer might make sense and would allow you to understand exactly how much it costs you to serve each and every customer.

Cost categorization by account number saves everyone on your team the time and energy they would otherwise have to devote to diligent tagging. It’s a cherry on top of all the benefits isolation brings to your infrastructure.

Having lots of AWS accounts puts you on the fast path to robust isolation between environments and services, which is an important part of delivering a secure, reliable, and compliant product to business customers.

For a head start using IAM, VPC networking, and Terraform with lots of AWS accounts, check out Substrate, command-line tools for managing lots of AWS accounts to deliver secure, reliable, and compliant cloud infrastructure.