Our SRE Journey — Part IV

5 min readJun 23, 2021

This is Part IV of many about the TrainingPeaks journey in SRE. If you haven’t read them yet, Part I is how we started, which continues in Part II and Part III.

open rusted lock — Image by Kerstin Riemer from Pixabay

Data and Security Policies

Modernizing and dealing with legacy systems is often focused extremely inward: improving existing infrastructure and tooling to support shipping improvements and embracing change and risk. This would change as our team would face challenges entirely driven by external forces.

In January 2020, one of our executives received an email from a white hat researcher who believed they had found publicly exposed data. At this point we had no policies or procedures in place to deal with this type of incident.

We take all security related inquiries and incidents seriously, prioritizing up any critical work. Our first priority is to stop the bleeding, and this incident was all hands. Once the patient was stable and the incident closed, we followed with a deep investigation into our policies, systems, and workflows. Security can’t be an afterthought, and should instead be part of your software lifecycle.

Security is always balanced with individual and team productivity. We determined that our organic growth with a focus on local efficiency had left us in an exposed position with some legacy storage and workflows. We did not have sufficient guardrails in place to prevent these scenarios, nor the education or policies in place to enable our teams to do their jobs in an effective yet secure manner. We came up with a number of education campaigns and processes to help increase security awareness and integrate security into our software lifecycle.

bridge piers — Image by Wonita & Troy Janzen from Pixabay

Building Capacity

If you have been around in software for a while you’ve most likely encountered a very popular misconception, which is that your team is shooting for 100% efficiency. Lean theory shows the opposite with wait times approaching infinity as efficiency approaches 100%. The high degree of variability in software compounds this effect. Instead of efficiency, your team should be working towards effectiveness. To be effective your efficiency should be no more than 80%, allowing at least 20% slack in your system. Dealing with unexpected incidents (are there any other kind?) adds to this problem. We quickly learned that our ability to respond is directly correlated to the presence of capacity. A capacity that our team did not have, but needed to build.

When your SRE team is doing well, it has a null dataset problem: when things are going well you have fewer incidents. In the absence of major events, a dedicated team looks expensive. Safety research tells a different story, where the lack of significant events is highly correlated to the next incident being major. To build capacity there are only a few levers you can pull: reducing scope, reducing quality, or adding resources. Our team was focused on reducing work and impact of incidents through automation. Our system was growing in both usage and complexity and we tackled a lot of our easier, smaller items first leaving larger projects. Basic calculations told us we needed to add resources, so we started the process of adding another SRE.

Hiring to go from three to four is a lot bigger and more time consuming than we anticipated and impacted our team’s capacity in the short term. The DevOps/SRE space is also a land mine of extremely varied skill sets and titles. Our journey to add to our team took us six months.

Rosie the riveter — Image by Here and now, unfortunately, ends my journey from Pixabay

Fully Immutable and Repeatable Deploys

When we first moved to AWS we did a traditional lift and shift and spun up some EC2 servers, shoe-horning our existing traditional on-prem processes into cloud servers. As we grew we were able to chip away at many pieces of this old infrastructure, but large components remained essentially the same as when it was first created. These caused a number of downstream burdens on deploys, patching and general maintenance which became a bottleneck for growth.

The plan was made to move everything over to immutable infrastructure. This gave us a few benefits, namely simpler deployments and easier maintenance. Instead of rotating servers out of a load balancer, deploying new code, and rotating them back in, we could simply spin up a server that pulls the version of code we want, and add it to the load balancer. Instead of having to schedule maintenance windows to patch servers or do it during a deployment, we could run regular jobs to build new machine images that were fully updated and patched. We built processes so that each step could be run independently and idempotently.

As with all changes we attempt, we started small. Working on one small portion of our background workers as a proof of concept. This gave us a safe and focused target for interaction and learning. We could quickly respond to issues as they arose and ensure no significant impact on developer velocity. Once we proved our concept with the first batch of background workers, we were able to move to the next set of workers and continue all the way through our web-facing tiers. We completed this project with minimal to no customer impact.

This work enabled our ultimate goal of moving to a more continuous delivery model. Moving away from weekly deployments and enabling developers to own their code from inception to production.

As we made these changes, we implemented as much as possible with Infrastructure-as-code, giving us the freedom and power to continue iterating on our own processes. We have since been able to add more features including auto-scaling groups, target group rotations and even tests for our own deployment code. We have a solid foundation that enables continued growth and scalability of our product and our operations as well.

Next: Boring, Enabling Work

Give us a follow and read our final article in this series about how we focus on boring enabling work.

Our SRE Journey — Part IV

Data and Security Policies

Building Capacity

Fully Immutable and Repeatable Deploys

Next: Boring, Enabling Work

Written by Todd Palmer