Our SRE Journey — Part III

6 min readApr 29, 2021

This is Part III of many about the TrainingPeaks journey in SRE. If you haven’t read them yet, Part I is how how we started, which continues in Part II.

Chess pieces — Photo by Randy Fath on Unsplash

Best Laid Plans

In preparing for battle I have always found that plans are useless, but planning is indispensable.
— Dwight D. Eisenhower

It’s still 2019 and the team is new. We’ve spent some time learning as a team, coming up with a plan, and working on a few problems to help inform us where to go next. Turns out we had a few more challenges, some of them quite unexpected.

bridge building — Image by Erich Westendarp from Pixabay

Critical System Upgrades

Aging infrastructure is not unusual or unexpected for most companies. There is always that service, instance, or solution that works well enough to keep getting pushed down the priority list until they reach a critical point. Software and infrastructure are more like gardening than building bridges, with a much shorter life expectancy in the last five years than the ten years before. Sometimes these systems become limiters or blockers to accomplishing a goal and sometimes they become performance problems. We won the lottery and got to deal with both! Our top three:

First up was our ElasticSearch cluster. Hand rolled and spun up for one feature in 2015, the cluster had new features and functionality added to it combined with massive growth over the years. It was undersized, performing poorly, and 5 major versions out of date — far enough that it could not be upgraded in place and required upgrades of incompatible access libraries due to protocol changes. A multi-disciplinary cross team effort over months was used to migrate data with dual writes and service refactoring. As we worked through each service, we were able to make small incremental optimizations leading to reduced complexity and load on other database infrastructure. New infrastructure was done as Infrastructure as Code (IaC) with automation for scaling and maintenance. Logging, monitoring, alerting, and access tooling was added or improved. The result was a new cluster that cut response times in half, was more reliable, cheaper to operate, and easier to maintain.

Our main database server also received a makeover. Upgrades of two major versions, and an extensive series of database performance related projects were completed. These were driven by the engineering teams to tune, improve, eliminate queries, remove data, and modify schema for efficiency. Once completed, database response times plummeted and blocking and deadlocks were virtually eliminated. This knocked the biggest risk off our top 5 list and gave use breathing room to execute other critical work. We also learned some painful lessons including our need to schedule regular maintenance and downtimes and a larger effort to analyze and prioritize a longer term solution.

We rolled our knowledge from our ElasticSearch project into a complete overhaul of a separate ELK cluster used for logging, monitoring, reporting, and alerting. Internal tooling upgrades have different requirements and a different customer: your team. This adds additional work in the form of internal advocacy, education, and documentation. Internal tooling projects also require a different approach to your typical business decision, since returns are more difficult to quantify and end customer outcomes are not visible. These are projects that help your organization deliver to your customers even without direct constant daily use. They payback immediately when something doesn’t go quite as expected. Allowing teams to assess, diagnose, and fix problems in a fraction of the time massively impacting your mean time to recovery (MTTR).

Gold coins — Image by aleksandra85foto from Pixabay

AWS Costs

All businesses seek to manage costs and expenses with an eye towards profitability and growth. For cloud based software companies your provider bill is typically your second largest cost behind your people. It is easy to be surprised by large unexpected bills. Getting a handle on our provider costs was our next initiative, and we approached it like all others: measure, analyze, decide, act, and repeat.

Our provider costs had drifted upwards and been normalized as linearly increasing with our increased traffic and number of customers. Balancing costs and scalable, reliable infrastructure is a constant concern. Budgeting and forecasting is mostly seen as a chore, increasingly so when done annually or semi-annually. Annual or semi-annual budgeting is also a very slow feedback loop. If workloads are not consistent and predictable this results in reactionary disruptive projects to reduce costs when concerns are raised.

An integrated approach, with awareness and practices to continuously manage and optimize costs across the organization was our aspirational goal. In practice, we knew reaching this goal was going to be harder and on a longer timescale, and would involve multiple teams. We set about implementing small focused projects for immediate improvements. Additionally, we planned larger projects that would involve greater engineering resources that we could prioritize. Finally we began setting up operational changes, education, and enablement to drive longer term goals.

Focusing on areas where costs were growing rapidly, such as our storage usage, allowed us to make an immediate impact on our bill. As we evaluated storage we also had to evaluate our disaster recovery plans focusing on failures points and mean time to recovery (MTTR) for all cases across a large variety of infrastructure, software, and services. This effort let us set up automated policies that better reflected those goals and had a large effect on managing costs. We are determined to regularly review those decisions and policies as we grow and our needs change.

Our move into AWS in 2014 was a lift and shift and the landscaped has changed dramatically. Cloud infrastructure, together with engineering efforts, allows you to optimize costs by taking advantage of services that are tailored to your workload and usage. We targeted services with infrequent or high volume/low frequency workloads to reduce traditional overprovisioning and standby capacity. We prioritized, swarmed, and completed projects to move these to usage based AWS services such as AWS Lambda and AWS Fargate. Each of these projects involved working with our engineering teams and could be the subject of their own blog posts.

The net result of our efforts was a 40% reduction in our AWS bill saving over $25K a month. Even bigger impacts were the awareness around costs, how those costs grow with usage, and how they were now growing at a vastly reduced rate.

School Bus Road — Image by StockSnap from Pixabay

Change is Hard: Getting People On the Bus

The concepts and ideas of DevOps and SRE are not new or even cutting edge, and have been around for over 10 years. Seminal books such as Continuous Delivery, The DevOps Handbook, The Phoenix Project, and The Unicorn Project to name a few. Accelerate, and the real world detailed data analysis of the DORA projects shows clear advantages for organizations. Despite this, old ideas and mindsets are prevalent and sadly the norm for most software organizations. Proven techniques are met with skepticism, cynicism, and even hostility from all levels of the organization. Grassroots efforts that are not fully understood and supported by all levels of the organization are going to be rocky, and we were woefully unprepared to navigate these waters.

The hardest thing in the world is to change the minds of people who keep saying, ‘But we’ve always done it this way.’
— Admiral Grace Hopper

Progress required us to change how things were currently done, and change is hard. Even as a team we were skeptical of the recommendations and changes but willing to disagree and commit. True change, not platitudes like a “DevOps Transformation” require continuous change, continuous investment, and a long term commitment to improvement. We were committed to change. Some folks got right on the bus, that was easy. Some got on the bus despite their skepticism that we were headed in the right direction. Some never got on the bus.

Next: External Pressures

Pressure is what you feel when you don’t know what’s going on.
— Chuck Noll

Give us a follow and read Part IV, and read about our struggles to figure out what’s going on, our failures, and what we learned.