Ensure stability and scale grocery delivery services during the Covid-19 pandemic
Monitoring at scale with Cortex
In order to ensure stability of our microservices at REWE digital, monitoring them carefully is one of the key factors. For the majority of our backend services, we use the open source software Prometheus to do so. Prometheus is super easy to operate until you want to run it at scale. During the last years REWE digital grew from 30 employees to more than 600 and the number of microservices grew along with it.
Besides the continous growth, we’ve seen a massive increase in demand for our food pickup and delivery services during the COVID-19 pandemic. Thanks to the adoption of Cortex last year, our monitoring team has been able to ensure the stability of our monitoring system at a rapidly growing scale. Let’s have a quick look at the background story!
Point of no return
With great power growth comes great responsibility. The organizational challenges which arose due to the growth have been answered with the introduction of the Spotify tribe model. This brought us four different tribes: ECOM, FULFILMENT, CONTENT and PLATFORM. Each of these tribes have their own platform maintained by a dedicated platform team.
Four separate platforms also sometimes meant four different solutions for the same infrastructure components (e.g. Kafka, Prometheus, Grafana). In order to relieve these platform teams we strived to create a more SaaS-like approach where one team can solve a challenge e.g. like monitoring for all platforms.
With our rapid growth, we realized that we needed a better solution for our Prometheus setup. Each of our tribes had their own Prometheus pair – and even though we were running it on beefy machines already, we still noticed out-of-memory kills here and there. Various Prometheus queries spanning several days, caused excessive memory spikes which sometimes led to out of memory kills of our Prometheus instances. In 2018 we decided to tackle these challenges organization-wide.
All of our tribe solutions had a lot of differences as well as similarities – all Platforms were already using Prometheus v2. Therefore, we considered Victoria Metrics, M3 or Thanos for our requirements.
And there was Cortex which just got accepted as a CNCF project. It’s actively developed by several developers from different companies and was started by Tom Wilkie and Julius Volz in June 2016. A perfect fit as a horizontally scalable monitoring tool? Spoiler alert: Sometimes we had to dive deeper into the code as there was only little documentation at this time. The project has matured since then and by now Cortex is on its way to climb into the Incubation Stage.
What was really buzzing for us is Cortex’s multi-tenant support, which also involves the different protection mechanisms built into Cortex to limit a tenant’s usage so that a single tenant doesn’t affect the performance for other tenants. Our team Quokka provides – besides other tasks – Cortex for all others within the company. This sounds more like a Software as a Service (SaaS) approach, right?
So how did we do it? We started adopting Cortex way before semantically versioned releases had been created and therefore at a very early stage. The implementation began with our back then smallest tribe which has now been merged into other tribes: the – coincidentally – Big Data Tribe.
Prometheus has been configured to write metrics to Cortex. Grafana dashboards were configured to use Cortex for the queries. In case of any failures or errors, we could simply switch back to Prometheus. Therefore, we always had the possibility to quickly mitigate problems if something goes wrong during the migration process. So far, so good.
The logical next step after migrating to Cortex was to offer Grafana as an internal SaaS product as well. We decided to create one Grafana instance organization wide and use the organizations feature within Grafana to offer multi-tenancy.
We’re now onboarding the rest of our teams to our new SaaS Grafana & Cortex and exploring Grafana Cloud Agent for the tribes that aren’t using Prometheus Alertmanager. With Grafana’s cloud agent we hope to improve the Prometheus resource footprint as it only contains the Prometheus functionality we actually need to send metrics to Cortex as our remote-write backend.
One other thing on our list is switching the storage engine used by Cortex. Currently we use a BigTable cluster and we would like to migrate to a new storage engine based on Google Cloud Storage to tackle some of the latency peaks for large queries. While we can scale the BigTable read and write capacity within a few minutes with GCS we can scale on-demand (per query) which is a great advantage for long term queries. Due to the possibility to scale on-demand with GCS the chances are good that this architecture will also be cheaper than using BigTable.
Other main advantages with our new setup are:
+Being able to provide monitoring services to small Kubernetes cluster is as easy as adding a Prometheus pair with remote write config (no more NGINX and DNS setups for a separate Grafana instance needed)
+Higher retention with Cortex (from seven days to 60 days)
+Much more stable and horizontally scalable system
+No more gaps in our Grafana dashboards/graphs
+Being able to aggregate metrics across two clusters using the same tenant ID
+Super fast, preloaded dashboards thanks to all queries being cached
+Way less OOM kills as read queries target the Cortex cluster and the write load is consistent
Furthermore, the fantastic support (of the community) in debugging some problems, which were usually misconfigurations, lead to a smooth start and everything worked out for us. Therefore, we’ll continue migrating all of our tribes to Cortex and the new Grafana setup with a good feeling.
Thanks to Cortex’s horizontal scaling, we could support REWE’s even more demanded grocery and food delivery services by offering a robust solution to monitor all backend services. During the COVID-19 pandemic, way more metrics have been ingested and we noticed a significant increase in PromQL queries initiated by developers looking more often on their Grafana dashboards. Keeping up with that additional load was as easy as scaling up the number of replicas in our Cortex deployment.
We’re also hopeful that the positive results may lead to a further adoption throughout the REWE Group. But you know what’s the biggest plus for us?
We now have a team at REWE digital that can offer monitoring as an internal service. And you know what’s even better? You can be part of it! Spend time with us to actually learn how to run Cortex and become an expert within the company, because ...we’re hiring!