Evaluating Predictive Autoscaling in Kubernetes

Over the past 6 months I’ve developed the Custom Pod Autoscaler Framework (CPA), an open source framework for building autoscalers in Kubernetes, similar to the Horizontal Pod Autoscaler. As part of this I built the Predictive Horizontal Pod Autoscaler (PHPA) which is in pre-release to provide predictive autoscaling functionality ontop of the Horizontal Pod Autoscaler using statistical models.

This Predictive Autoscaler at the time of writing supports two statistical models, ‘Linear Regression’ and ‘Holt-Winters’. The ‘Holt-Winters’ model provides the ability to do seasonal predictions; for example if there is generally a higher load between 9:00-17:00, or there is a spike at 13:00, the model can learn this by observing and can then make predictions for the future - allowing better and proactive responses, rather than reactive.

I wanted to test this out, and get some data to see if this would actually be useful and if it could see a reduction in latency at peak times and the lead-up to peak times.

Experiment Overview

I will outline the experiment here, you can see the full experiment and run it yourself if you want in the custom pod autoscaler experiments repository.

This experiment will directly compare the Kubernetes Horizontal Pod Autoscaler (HPA) and the Predictive Horizontal Pod Autoscaler (PHPA).

This experiment will run for 3 days - and is designed to have the PHPA and HPA running in their own clusters. Everything should be kept the same between the two autoscalers - they will both manage the same application, and the load will be managed by the same load testing logic.

Each test will have have three elements, the autoscaler, an application to manage, and the load testing application. The autoscaler will be the only part that changes. The application will be a simple example web server that responds OK! to GET at path /; it is the k8s.gcr.io/hpa-example that is used in Kubernetes autoscaling walkthroughs. The load testing application will be a python script that will invoke Locust load testing at set intervals, varying the load applied based on the time of day. The load testing will also periodically record how many replicas the deployment has.

See these diagrams for overviews of the test:

Predictive Horizontal Pod Autoscaler Experiment

Predictive Horizontal Pod Autoscaler
Experiment

K8s Horizontal Pod Autoscaler Experiment

K8s Horizontal Pod Autoscaler
Experiment

The load applied will be the same for each autoscaler;

  • High load (40 users) will be applied between 15:00 and 17:00.
  • Medium load (25 users) will be applied between 9:00 and 12:00.
  • Low load (15 users) will be applied for all other times.

The Horizontal Pod Autoscaler will be configured with the following options:

  • Minimum replicas: 1.
  • Maximum replicas: 20.
  • Sync Period: 15s (default).
  • Downscale Stabilization: 5m (default).
  • Tolerance: 0.1 (default).
  • CPU Initialization Period: 5m (default).
  • Initial Readiness Delay: 30s (default).
  • Metrics: Resource metric targeting CPU usage, with average utilization at 50.

The Predictive Horizontal Pod Autoscaler will have the same settings as the Horizontal Pod Autoscaler:

  • Minimum replicas: 1.
  • Maximum replicas: 20.
  • Sync Period: 15 (equivalent to15s) (default).
  • Downscale Stabilization: 300 (equivalent to5m) (default).
  • Tolerance: 0.1 (default).
  • CPU Initialization Period: 300 (equivalentto 5m) (default).
  • Initial Readiness Delay 30 (equivalent to30s) (default).
  • Metrics: Resource metric targeting CPU usage, with average utilization at 50.

The Predictive Horizontal Pod Autoscaler will also have the following configuration settings for tuning the Holt-Winters algorithm:

  • Model Holt-Winters
    • Per Interval: 1 (Run every interval)
    • Alpha: 0.1
    • Beta: 0.1
    • Gamma: 0.9
    • Season Length: 5760 (24 hours in 15 second intervals)
    • Stored Seasons: 4 (store last 4 days data)
    • Method: additive

Hypothesis

The Predictive Horizontal Pod Autoscaler using the Holt-Winters prediction method will pre-emptively scale, reacting earlier than the standard Kubernetes Horizontal Pod Autoscaler. This will be manifested in higher replica counts when scaling up, and scaling up earlier; with the result of lower average and maximum latency, and less failed requests - primarily around the moment of change from lower load levels to high load. This effect will only be apparant however after at least one full season (24 hours); for the first season as the predictor won’t have data to make a prediction it will be largely the same performance as the standard Kubernetes Horizontal Pod Autoscaler.

Results

Replica comparison

You can see from the graph above when the predictive model of the PHPA kicks in, after the second day it diverges from the HPA’s replica counts. The PHPA seems to have more erratic rescaling, which I will talk about later.

Average latency comparison

Looking at the average latency comparison, it seems to confirm my hypothesis; with the HPA consistently seeing large average latency spikes when transitioning from low to high load - while in the PHPA these average latency spikes are reduced as the experiment progresses.

Average latency comparison day 1

Taking a detailed look at the first day comparison, the average latency appears the same, with both the PHPA and HPA having a prominent spike in average latency when transitioning from low to high load. This is before the predictive model element kicks in, as it only operates once a full season (24 hours) of data is available.

Average latency comparison day 2

The second day’s average latency comparison reveals that once the predictive model comes into effect this spike in average latency is reduced in the PHPA, showing how the predictive autoscaler can pro-actively scale to predicted demand.

Average latency comparison day 3

The third day’s average latency comparison follows the same lines as the second days, seeing a reduction in the average latency spike for the PHPA.

Maximum latency comparison

Looking now at the maximum latency comparison gives us some mixed results, with the first two days behaving as expected and hypothesised; but the third day exposing some erratic and unwanted behaviour.

Maximum latency comparison day 1

The first day follow the hypothesis, with both HPA and PHPA performing in a similar way, with a spike in maximum latency on the transition from low to high load.

Maximum latency comparison day 2

The second day then shows that when the predictive model comes into effect there is a reduction in maximum latency spikes.

Maximum latency comparison day 3

The unwanted behaviour appears in day 3, with massive spikes in maximum latency throughout the day. This is occurring I believe due to the erratic fluctuations in replica count that can be seen in the third day. This thrashing and rapid change in replica count is I believe due to poor tuning of the Holt-Winters model on my part, resulting in rapid changes in replica counts. With better tuning and a more data-driven approach, I believe this issue could be fixed.

Conclusion

The PHPA outperforms the HPA in reduction of average latency spikes due to increased load for seasonal data. The PHPA can provide a valuable tool for proactive autoscaling, and if applied to regular, predictable and repeating user loads it can provide a more effective autoscaling solution than the standard Kubernetes HPA. However, the key to effective use of the PHPA is that it needs to be data driven, and as such requires tuning to be effective and useful. The PHPA should be applied in specific circumstances in which it makes sense; the decision to apply it should be driven by an understanding of the system it is being applied to and it should be backed with data to allow for better tuning and a more useful autoscaling solution; otherwise unwanted results such as erratic scaling behaviour may arise out of poor tuning decisions.

The PHPA should not be treated as a silver bullet that can be applied anywhere, but with careful planning, understanding of the system and tuning I think it could be a very useful tool. The maximum latency spikes shown in this experiment highlight some of the issues of poor tuning.

If you want to try out the Predictive Horizontal Pod Autoscaler yourself, check out the GitHub repo here or the wiki here.

Written on March 27, 2020