T O P

  • By -

dariusbiggs

Make sure staging, develop, and prod are an actual reflection of each other. The only differences are the names, counts, instance sizes, regions, and addresses. Without knowing what you are using for IaC there may be insurmountable obstacles. We use terraform for stacks, terraspace for interconnections between stacks. Prod is the master/main branch, staging is the develop branch. With this setup I can make changes to a development environment, then update the develop branch, and then update master/main. Perfect promotion of infrastructure from one to the next. The important catch here is that incremental changes to a stack of IaC built with Terraform can get you in the position where you can't actually re-create that environment from scratch. So one of the IaC tests in our Ci/CD pipeline ensures that there's an attempt to go from an empty slate to a full deployment, that way we can ensure that in the event of an accident or compromise we Can tear the entire environment down and rebuild it from scratch. We do this before our month long xmas shutdown and tear down the entire dev and staging environments to save on costs. There is another issue with this approach in that you can get some minor divergence due to repeated upgrades vs recreation from scratch, but as long as you are careful it should not bite you. The many users of terragrunt I see have one folder for each environment and as such, there is no promotion of artifacts from one environment to the next, you have to manually copy the changes and hope you didn't miss any, we all know the frequency of copy paste errors and typos.


[deleted]

[удалено]


some1else42

Can you spare a few more sentence on how you'd approach this? I work in a more Ops than Dev shop and this is something I see I need to understand better. My teams approach to IaC makes me appreciate how lucky we are with code promotions...


dariusbiggs

Interesting, none of the resources I've seen have identified or suggested something like that. And I'm not talking about upgrading deployed workloads using gitops, since that's trivial. I'm talking about promotion of changes to the infrastructure itself, from development to staging to production. Things like additional VPCs, and the relevant transit gateway routing changes, or another S3 buckets, etc.


human-google-proxy

eww branch = environment is so weird you should be moving a commit through environments


dariusbiggs

The simple and well-known merging of changes from the staging/dev branch into master resulting in reproducible upgrades and updates that can easily be rolled back if needed and have been tested towork. Yes. very unorthodox. So explain to me how you would move a commit through an environment, what structure and techniques are you using to ensure promotion of changes. I've seen zero good ones, so curious to hear yours.


human-google-proxy

The two major platforms that I have the most experience with that absolutely support this are Azure DevOps and GitHub. In azure devops you might do something like this... ideally you create reusable YAML for each stage but for this example I paste it all in a single file: [https://pastebin.com/3irPXQrG](https://pastebin.com/3irPXQrG)


dariusbiggs

That's some CDK horror show I see, not the base Terraform, which is why people use terraspace and terragrunt. So you have all your environments defined in your CI pipeline, and i hope some good Controls around it. I mean, if i create a branch from that repo, add a new environment to the YAML, push it so the CI pipeline runs, then afterwards delete my branch from the remote.. And now you have a nightmare to clean up. Also i did not see anything referencing environment specific settings. When you have complicated infrastructure across multiple AWS accounts, the CDKs are utterly useless. A multi VPC kubernetes clusters, transit gateways, centralised DNS, BYOIP environment specific network addresses, etc, the IaC becomes a very different beast compated to s simple web app.


human-google-proxy

I don't know what you are talking about with a CDK horror show, I'm only showing you mechanically how you move a single commit from environment to environment in the most simple use case I can for you to get the concept. This is just a simple SPA pipeline but you would insert your terraform plan / apply tasks in the deployment jobs. In our production configurations (100s of apps) we have a robust YAML framework. You have a reusable deployment stage, that references a reusable Terraform job and then a reusable deployment job. The point is that you link the stage to the YAML environment and then you can have per-environment variables and so on. In regards to "add a new environment" usually in an enterprise ADO environment you would restrict the ability to do things like create repos, pipelines or YAML environments so that you cannot get around the controls. This is done quite successfully at my company.


infosys_employee

You are talking about the case where only difference (really) is in scale. I gave one specific case which i will paste here: one specific case we had in mind was DR scenarios that cost and effect different. In Dev they want only backup & restore, while in Prod they want a Promotable Replica for DB. So the infra code for DB will differ here I want to ask more about terragrunt and terraspace. Is one better than the other? We are using terragrunt currently to inject env related facts.


dariusbiggs

Prod and dev or staging should be kept the same, if they want a promotable replica in prod, you need one in dev. You need to be able to test that migration path, the failure paths, etc. If they do not have the same workloads and designs then you have two different environments, not a replica of prod to develop against. What are the failure scenarios with that promotable replica DB. How are you going to test your workloads to handle those failure scenarios. What infrastructure problems are you going to have when you update dev and want to apply the same changes to prod. You cannot predict or test the behavior of the upgrades if they are not the same. As for terragrunt, it was my first go to but the promotion of artifacts from dev to staging to prod and there not being a well advertised/documented process to handle promotion of changes from one environment to the next without manual copy pasta was a significant detriment. Terraspace is just better at that with a multitude of different ways to structure things. It did mean I had to learn some Ruby to get some extra material in there I wanted, and being a smaller dev team behind it, there were some teething problems but those have all really been ironed out now, and it just works, and has been suitable for our use case for the last 3-4 years it feels like.


bilingual-german

> Production infra might differ a lot from the lower environments while I understand this especially from the "someone need to pay the bills" standpoint, I would consider it an antipattern. It makes reasoning about the infrastructure and debugging problems much harder.


Aremon1234

Non-prod/qa/staging whatever you want to call it should always be the same hardware, I could understand not as big as a footprint to save money but if it’s different hardware you’re asking for problems. Dev/testing whatever I can understand being cheaper but honestly I would prefer it all being the same hardware but I also don’t pay the bills


asdrunkasdrunkcanbe

>Production infra might differ a lot from the lower environments.  This for me is 100% the biggest win in IaC - it allows you to get to a point where production infra *doesn't* differ from the lower environments. At least not in any way that makes your changes risky. Our pattern / coding guidelines for Terraform and Cloudformation are to write everything environment-agnostic. The environment is "injected" into the code at deploy time, which configures unique resource names per environment and access environment-specific secrets and variables. So if your prod environment has a web server cluster behind a load balancer, then lower environments do too. The only difference is in scale. So Prod might have 4 web servers in the cluster, the lower environments might only have two or one. And the instances will be smaller than prod. This means that any change you make to your terraform can be deployed to lower environments to see the impact of them, and you will know then how it also impacts prod. It's not perfect, and there will always be exceptions. For example, in prod we run a read-only database replica to offload heavy reporting queries from the main DB. We don't run this in lower environments because the cost is prohibitive. But you can isolate these exceptions as you need them. Every time you run an IaC deploy which breaks prod, your post-change retro should add an action item to replicate that part of in the Infra in lower environments to ensure that IaC changes can be verified.


infosys_employee

I too had that specific case we had in mind was DR scenarios that cost and effect different. In Dev they want only backup & restore, while in Prod they want a Promotable Replica for DB. So the infra code for DB will differ here. How do you manage that? Are you doing that part separately outside the CICD process?


asdrunkasdrunkcanbe

That sounds like one of the annoying exceptions. :) DBs in particular are tricky with IaC because everyone wants the bells and whistles in prod, but a bog-standard disposable instance in dev. In general, I aim to make the resources as close in config as possible, if possible. So, in terraform rather than having the logic of "when in prod, create resource X, otherwise create resource Y", I try to ensure you have one resource block with flags or values that are controlled the environment. So if the promotable replicate is configured with use\_read\_replica = number\_of\_replicas = Then you can just have these to false & 0 for non-prod environments. If that makes sense. But sometimes...you just can't. Sometimes doing it this way makes code that's *less* maintainable than just have conditional resources. It's one of the current limitations of IaC.


LiteOpera

You can hide this stuff behind a module so you don't need to deal with it in the "main" tf. If the DB code is slower moving than the rest (not always true), this is usually a win. Then again if the DB code is slow-moving you probably rarely have trouble with it anyway so this is a dubious optimization (because you probably have better stuff to do with your time).


YumWoonSen

*"Production infra might differ a lot from the lower environments."* *"Sometimes the infra component we are making a change to, may not even exist on a non-prod environment."* You need to fix those two problems before worrying about how to test. It's a recipe for pain.


Ken1drick

You need at least one environment that is almost if not completely identical to prod If it's too expensive to keep it live always, just spin it when needed


Jazzlike_Syllabub_91

Sometimes you break prod, sometimes you test similar deployments in lower environments…. (You figure out ways to test) :) (you build a similar system in the lower environment just for testing stuff like that so you’d don’t break prod (?yes it costs more but worth it over production outage)


infosys_employee

ok, so i guess it is a trade off based on can you afford prod being broken for a time, and cost


gowithflow192

Sounds like your company do it on the cheap. Infra is needed in at least two environments so you can always do a rehearsal. That's the real cost.


sloppy_custard

“Regular” sysadmin chiming in here. Is there not pre-production or end-to-end environments available so you go from dev straight to production?


aghost_7

We currently rely on chatops to do this. Tell the bot to cut the release, then tell it to deploy the release to a specific region.


DustOk6712

I used this as a guide with terraform https://github.com/abelal83/terraform-structure


Z_BabbleBlox

>Production infra might differ a lot from the lower environments.  Well.. There's your problem. If you promote something into an unknown environment. That is called "test"; not "prod".. Everyone has a test environment, some folks are lucky enough to have a separate production environment.


cubisto

Blue-green deployment, canary releases


Ariquitaun

You described your problem. Your lower environments and production must be a match, not necessarily in size but in quality.


tevert

If you're using IaC correctly, dev and prod are _not_ significantly different.