Our SRE Journey — Part I

Published in

TrainingPeaks Product Development

3 min readApr 13, 2021

Fits and Starts

It’s December 2018, our system has just gone down hard two weekends in a row. We’ve survived so far by some massive engineering efforts and upsizing hardware. Our systems and libraries are years out of date, some past EOL support. Our traffic grows 40% each year starting in December, and the next few months are shaping up to be constant fire fighting.

Until this point we’ve had a small engineering team and an even smaller operations team both above capacity. Delivery is slowing down, our AWS bill is going up. This is not an unusual story for a software company that has been around for 20 years. Executive approval is easier when the dumpster fires are raging. Never waste a good crisis. Born of inspiration and desperation, the TrainingPeaks SRE team was formed around two software engineers and an operations engineer.

Okay, Now What?

Well titles of course! Okay, besides new titles, it was time to get to work. We’d like to say that it was “all better now”, but it wasn’t. Our journey was just beginning. The next nine months certainly was a lot of firefighting, but it was time to move fast and unbreak things. So we set about learning and planning.

Hope is not a strategy
— Traditional SRE saying

First up as a team we read the Google SRE Book, The DevOps Handbook, The AWS Well-Architected Framework to start us off. For each chapter we were each required to bring at least one thing we learned we could implement this week to make our systems better, and we implemented them. Along the way we learned one big lesson:

We are not Google, we couldn’t just imitate them, we’d have to figure out how to make this work for our team and organization.

We had just started on our journey to measure, analyze, decide, act, and repeat with a clear goal that this would be a systematic culture shift across our engineering organization.

Photo by Cesar Carlevarino Aragon on Unsplash

Our First Team Charter

Our team exists to improve the resiliency of our systems and teams. To create opportunities for improvement based on the main pillars of:

Operational Excellence

Share insights throughout the organization
Continuously improve the system
Improve feedback for faster decision making
Create opportunity for improvement to happen
Maximize system efficiency (costs, people, time)

Embracing Risk

Encourage and embrace change
Maximize Throughput by advocate for agile / lean methodologies and failing fast and loud
Learn from experience
Failures are the best teaching tool

Automated and Automatic Processes

Reduce toil (manual work)
Infrastructure as code
Monitor everything
Actionable alerts
Automate everything
Safe to fail

Education and Advocacy

Incident management and readiness response
Continuously ready for production.
Production belongs to our customers.
Champions of infrastructure best practices. Visible and predictable.

Next Steps: From Theory to Practice

Knowledge without practice is useless. Practice without knowledge is dangerous.
— Confucius

For our team, our next steps were clear: to apply our learning to our environment using our charter as a guide to inform our decisions. To systematically break down work and using feedback from each step to improve our engineering organization.

Give us a follow and read Part II, where our journey continues and we’ll talk about our biggest challenges and some of those projects.

Have a passion for engineering or SRE?

We are always looking for amazing engineers and SREs to join our team at TrainingPeaks.