Application Performance Analysis with Keptn

In the dynamic world of DevOps and continuous delivery, keeping applications reliable and high-performing is a top priority.

Site reliability engineers (SREs) rely on Service Level Objectives (SLOs) to set the standards that the Service Level Indicators (SLIs) of an application must meet, like response time, error rate, or any other metric that might be relevant to the application.

The use of SLOs is not a new concept, but integrating them into an application comes with its own set of issues:

Figuring out which SLIs and SLOs to use— do you get the SLI values from one monitoring source or multiple? This complexity makes it harder to use them effectively.
Defining SLO priorities. Imagine a new version of a service that fixes a concurrency problem but slows down response time. In this case, this may be a valid trade-off and the new version should not be denied due to an increase in the response time, given that the error rate will decrease. Situations like these call for a way of defining a grading logic where different priorities can be assigned to SLOs.
Defining and storing SLOs. It's crucial to clearly define and store these goals in one central place, ideally a declarative resource in a GitOps repository, where each change can be easily traced back.

In this article, we'll explore how Keptn tackles these challenges with its new Analysis feature. We will deploy a demo application onto a Kubernetes cluster to show Keptn helps SREs gather and make sense of SLOs, making the whole process more straightforward and efficient.

The example application will provide some metrics by itself by serving them via its Prometheus endpoint, while other data will come from Dynatrace.

Defining data providers

Everything in Keptn is configured via Kubernetes Custom Resources. We notify Keptn about our monitoring data sources by adding two KeptnMetricsProvider resources to our Kubernetes cluster - one for our Prometheus instance, the other one for our Dynatrace tenant.

apiVersion: metrics.keptn.sh/v1alpha3
kind: KeptnMetricsProvider
metadata:
  name: my-prometheus-provider
  namespace: simple-go
spec:
  targetServer: <prometheus-url>
  type: prometheus
---
apiVersion: metrics.keptn.sh/v1alpha3
kind: KeptnMetricsProvider
metadata:
  name: my-dynatrace-provider
  namespace: simple-go
spec:
  targetServer: "https://<tenant-id>.live.dynatrace.com"
  type: dynatrace
  secretKeyRef:
    name: dt-api-token
    key: DT_TOKEN

Defining SLIs

Now that we have defined our data sources, let's tell Keptn what SLIs we want to monitor and how to retrieve them from Prometheus and Dynatrace. This is done by applying AnalysisValueTemplate resources to the cluster. If you have worked with Keptn in the past, you will notice that the structure of these resources is similar to the KeptnMetrics resources (see this article, if you would like to learn more about KeptnMetrics and how to use them to automatically scale your workloads).

The difference between KeptnMetrics and AnalysisValueTemplates is:

KeptnMetrics are monitored and updated continuously, meaning that they always represent the latest known value of the given metric. This makes them a good candidate for being observed by a HorizontalPodAutoscaler to make scaling decisions.
AnalysisValueTemplates provide the means to get the value of a metric during a concrete time window. This makes them well-suited for tasks such as analyzing the results of a load test that has been executed after the deployment of a new version.

In our case, we will create two AnalysisValueTemplates resources. The first one measures the error rate of our workload, using data from Prometheus:

apiVersion: metrics.keptn.sh/v1alpha3
kind: AnalysisValueTemplate
metadata:
  name: error-rate
  namespace: simple-go
spec:
  provider:
    name: my-prometheus-provider
  query: "rate(http_requests_total{status_code='500', job='{{.workload}}'}[1m]) or on() vector(0)"

As a second metric, we measure the memory usage of our application using the following AnalysisValueTemplate:

apiVersion: metrics.keptn.sh/v1alpha3
kind: AnalysisValueTemplate
metadata:
  name: memory-usage
  namespace: simple-go
spec:
  provider:
    name: my-dynatrace-provider
  query: 'builtin:kubernetes.workload.memory_working_set:filter(eq("dt.entity.cloud_application",CLOUD_APPLICATION-3B2BD00402B933C2)):splitBy("dt.entity.cloud_application"):sum' # yamllint disable-line rule:line-length

As can be seen in the spec.query field of the resource above, AnalysisValueTemplate resources support the Go templating syntax. With that, you can include placeholders in the query that are substituted at the time the concrete values for the metrics are retrieved. This comes in handy when, e.g., the query for the metrics is the same for different workloads and only differs slightly, perhaps due to different label selectors being used for different workloads. This way you do not need to create one AnalysisValueTemplate resource per workload but can reuse one for different workloads, and pass through the value for the actual workload at the time you perform an Analysis.

Defining SLOs

The next step is to set up our expectations towards our SLOs, i.e. the goals we would like them to meet. This is done via an AnalysisDefinition resource like the following:

apiVersion: metrics.keptn.sh/v1alpha3
kind: AnalysisDefinition
metadata:
  name: my-analysis-definition
  namespace: simple-go
spec:
  objectives:
    - analysisValueTemplateRef:
        name: memory-usage
      keyObjective: false
      target:
        failure:
          greaterThan:
            fixedValue: 30M
      weight: 1
    - analysisValueTemplateRef:
        name: error-rate
      keyObjective: true
      target:
        failure:
          greaterThan:
            fixedValue: 0
      weight: 3
  totalScore:
    passPercentage: 100
    warningPercentage: 75

This AnalysisDefinition resource has two objectives, which both refer to the AnalysisValueTemplate resources we created previously. If you closely inspect both, you will notice that they differ in the weights they have been assigned, meaning that the goal for the error-rate has a higher priority than memory consumption. In combination with the target scores defined in the totalScore object, this means that passing the objective for the error-rate is mandatory for an analysis to be successful, or at least to achieve the warning state. The latter would be achieved if, for example, the error rate objective is passed, but the memory consumption exceeds the defined limit of 30M. Also, note that even though we use values coming from different data sources, i.e. Prometheus and Dynatrace, in the AnalysisDefinition, we do not need to consider any implementation-specific details when referring to them. You only need to provide the name of the AnalysisValueTemplate, and the metrics-operator determines where to retrieve the data based on the information in the KeptnMetricsProviders.

Executing an Analysis

Now, it is time to trigger an Analysis. This is done by applying an Analysis resource which looks as follows:

apiVersion: metrics.keptn.sh/v1alpha3
kind: Analysis
metadata:
  name: service-analysis
  namespace: simple-go
spec:
  timeframe:
    recent: 10m
  args:
    "workload": "simple-go-service"
  analysisDefinition:
    name: my-analysis-definition

Applying this resource causes Keptn to:

Retrieve the values of the AnalysisValueTemplate resource referenced in the AnalysisDefinition that is used for this Analysis instance.
After all required values have been retrieved, the objectives of the AnalysisDefinition are evaluated, and the overall result is computed.
This analysis uses the values of the last ten minutes (due to spec.timeframe.recent being set to 10m). Alternatively, you can also specify a concrete timeframe, using the spec.timeframe.from and spec.timeframe.to properties.
We also provide the argument workload to the analysis, using the spec.args property. Arguments passed to the analysis via this property are used when computing the actual query, using the templating string of the AnalysisValueTemplates resource. In our case, we use this in the error-rate AnalysisValueTemplate, where we set the query to rate(http_requests_total{status_code='500', job='{{.workload}}'}[1m]) or on() vector(0).

Applying this resource causes Keptn to retrieve the values of the AnalysisValueTemplate resource referenced in the AnalysisDefinition that is used for this Analysis instance. After all required values have been retrieved, the objectives of the AnalysisDefinition are evaluated, and the overall result is computed.

This analysis uses the values of the last ten minutes (due to spec.timeframe.recent being set to 10m) but you can also specify a concrete timeframe, using the spec.timeframe.from and spec.timeframe.to properties.

We also provide the argument workload to the analysis, using the spec.args property. Arguments passed to the analysis via this property are used when computing the actual query, using the templating string of the AnalysisValueTemplates resource. In our case, we use this in the error-rate AnalysisValueTemplate, and set the query to rate(http_requests_total{status_code='500', job='{{.workload}}'}[1m]) or on() vector(0).

For our Analysis with spec.args.workload set to simple-go-service, the resulting query is:

rate(http_requests_total{status_code='500', job='simple-go-service'}[1m]) or on() vector(0).

Inspecting the results

After applying an Analysis resource, we can do a quick check of its state using kubectl:

$ kubectl get analysis -n simple-go 

NAME               ANALYSISDEFINITION         STATE       WARNING   PASS 
service-analysis   my-analysis-definition     Completed             true

The output of that command tells us if the Analysis has been completed already. As seen above, this is the case, and we can already see that it has passed. So now it's time to dive deeper into the results and see what information we get in the status of the resource:

kubectl get analysis service-analysis -n simple-go –oyaml

This command gives us the complete YAML representation of the Analysis:

apiVersion: metrics.keptn.sh/v1alpha3
kind: Analysis
metadata:
  name: service-analysis
  namespace: simple-go
spec:
  analysisDefinition:
    name: my-analysis-definition-2
  args:
    workload: simple-go-service
  timeframe:
    recent: 10m
status:
  pass: true
  raw: '…'
  state: Completed
  timeframe:
    from: "2023-11-15T08:15:15Z"
    to: "2023-11-15T08:25:15Z"

As you can see, this already gives us a lot more information, with the meatiest piece being the status.raw field. This is a JSON representation of the retrieved values and the goals we have set for them. However, this raw information is not easily digestible for our human eyes, so let's format it using:

kubectl get analyses service-analysis -n simple-go -o=jsonpath='{.status.raw}' | jq .

Giving us the following as a result:

{
  "objectiveResults": [
    {
      "result": {
        "failResult": {
          "operator": {
            "greaterThan": {
              "fixedValue": "50M"
            }
          },
          "fulfilled": false
        },
        "warnResult": {
          "operator": {},
          "fulfilled": false
        },
        "warning": false,
        "pass": true
      },
      "objective": {
        "analysisValueTemplateRef": {
          "name": "memory-usage"
        },
        "target": {
          "failure": {
            "greaterThan": {
              "fixedValue": "50M"
            }
          }
        },
        "weight": 1
      },
      "value": 25978197.333333,
      "query": "builtin:kubernetes.workload.memory_working_set:filter(eq(\"dt.entity.cloud_application\",CLOUD_APPLICATION-3B2BD00402B933C2)):splitBy(\"dt.entity.cloud_application\"):sum",
      "score": 1
    },
    {
      "result": {
        "failResult": {
          "operator": {
            "greaterThan": {
              "fixedValue": "0"
            }
          },
          "fulfilled": false
        },
        "warnResult": {
          "operator": {},
          "fulfilled": false
        },
        "warning": false,
        "pass": true
      },
      "objective": {
        "analysisValueTemplateRef": {
          "name": "error-rate"
        },
        "target": {
          "failure": {
            "greaterThan": {
              "fixedValue": "0"
            }
          }
        },
        "weight": 3,
        "keyObjective": true
      },
      "value": 0,
      "query": "rate(http_requests_total{status_code='500', job='simple-go-service'}[1m]) or on() vector(0)",
      "score": 3
    }
  ],
  "totalScore": 4,
  "maximumScore": 4,
  "pass": true,
  "warning": false
}

In the JSON object, we see:

A list of the objectives that we defined earlier in our AnalysisDefinition
The values of the related metrics
The actual query that was used for retrieving their data from our monitoring data sources

Based on that, each objective is assigned a score, which is equal to the weight of that objective if the objective has been met. If not, the objective gets a score of 0.

Note: You can specify warning criteria in addition to failure criteria. If that is the case, and the value of a metric does not violate the failure criteria, but the warning criteria, it gets a score that is half of the weight. This allows you to be even more granular with the grading of your analysis. In our case, both objectives have been met, so we get the full score and therefore pass the evaluation with flying colors.

Summary

To summarize, we have seen how we can define multiple monitoring data sources and let Keptn fetch the data we are interested in and provide us with a unified way of accessing this data. In the next step, we created a clear set of criteria we expect from that data to decide whether the related application is healthy or not. Finally, we have seen how we can easily perform an analysis and interpret its result. We did all this by using Kubernetes manifests and placing them in a GitOps repository next to our application's manifests.

If you would like to try out Keptn and its analysis capabilities yourself, feel free to head over to the Keptn docs and follow the guides to install Keptn, if you haven't done so already, and try out the Analysis example.