Our SRE Journey — Part II

Todd Palmer
5 min readApr 16, 2021

This is Part II of many about the TrainingPeaks journey in SRE. If you haven’t read it, start with Part I and find out how we started.

Terracotta Army
Photo by Kevin Jackson on Unsplash

Know The Enemy, And Know Yourself

If you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat. If you know neither the enemy nor yourself, you will succumb in every battle.”

— Sun Tzu, The Art of War

Our 2019 team plan was to tackle limiters of our system by making continuous small improvements with just enough planning and continuous prioritization as we learned. First up was getting useful feedback from our systems.

In case of emergency, break glass
Photo by John Cafazza on Unsplash

Actionable Alerts Only Please

Our systems had basic monitoring and alerting setup, and we had a lot of alerts. We started to track incidents and analyze what we found. We determined that an overwhelming majority of our alerts were not actionable. High noise, low signal. These alerts amounted to “something seems a little weird”, but the alert required an engineer to investigate. Most required no action on the part of the engineers and self resolved. These were highly disruptive with very little value. So we turned them off.

Any alert that was not a strong signal and required a engineer to take action was modified to be so or removed. This sounds simple, however when you have a system that you do not fully trust, and some of those alerts provided clues to past systematic problems, there is a tendency to see these type of alerts as a safety net for risk with larger problems. We evaluated each alert based on impact and risk and if we had some other monitoring or alerting in place that would result in the same thing, removed the alert.

This project not only removed noise and interruptions, but also improved the quality our alerting. The number of alerts went down and when an alert did occur, we knew what we needed to look and respond to. In the process we clawed back a little time to focus on larger problems.

Fire fighters equipment lined up and ready to go
Photo by Philipp Berg on Unsplash

Incident Response

Getting our alerts under control let us focus on the next aspect those alerts: incident response. TrainingPeaks has traditionally had a very small engineering team, and production problems were left for engineers to “figure out” without a formal policy or plan. We handed you a parachute and nudged you out of the plane. Your path to success had to be cut by machetes through a jungle filled with poisonous snakes. There was no tracking, consistency, training, or continuity. As our system, traffic, and number of teams grew, this became a limiter to growth.

We set up guides, set expectations, setup templates to report incidents, starting training for on-call, and hardest of all documenting everything we could. We jumped in and helped teams with incidents. We kept everything transparent, with public slack channels for managing incidents and tracking alerts, and a weekly email that included current work and incidents.

Tracking let us see trends, prioritize work, and focus on the most problematic parts of the system. Eventually our number and severity of incidents began to decrease.

Overall we made big strides, but this one never goes away. As the saying goes if accidents were preventable, they wouldn’t call them accidents would they? Our incident response will need to change and adapt to handle all our new and improved accidents.

small man yells at clouds
Photo by Jason Rosewell on Unsplash

Blameless Postmortem Culture

While alerts and incident response were a big part of getting a handle on our systems, one of the biggest challenges we had was less technical and much more cultural: becoming a learning organization. People, teams, and organizations, learn from failure. We know this, but still tend to share our decisions only when they have a positive outcomes and attribute that outcome in hindsight to a key decision.

In addition, failures tend to be reported and reacted to with proximity to the sharp end — people with direct contact with the work. In reality, understanding and learning from failures requires examining the larger system within where people work. This means also looking at the blunt end — the organization that both supports and constrains the activities at the sharp end.

If we don’t consider the whole, disruptive and expensive lessons get transformed into isolated hiccups. A blameless postmortem culture is one way to help combat this problem. Putting your failures on display and including uncomfortable details that are beyond a single individual or “root cause”.

This is hard. We did it anyway, first by writing and publishing our own postmortems. A written postmortem is the culmination of a reflective process, and takes time. Sending that written document to the entire company, detailing your latest escapades down the unholy halls of “maybe this will work? It can’t get any worse LOL” is another level indeed. The first one sent was a clencher for sure, and it was terrible. But it got better and maybe a bit easier. We championed this one from the inside out. Doing them, creating a template based on what we learned, and helping other teams write their own. Getting started down this path was hard, and we are still working on it.

Nope We Ain’t Done, Not Even Close

I get knocked down
But I get up again
You’re never gonna keep me down

Tubthumping Chumbawamba

Unlike Chumbawamba, for SRE to work it can’t be a one hit wonder. So give us a follow and read Part III, where our journey continues.

--

--

Todd Palmer

Husband, Father, Software Developer, Cyclist, White Gold Wielder