Automated Operations

Challenge: In modern microservices environments, you have to deal with systems that can expose unpredictable behavior due to the high number of interdependencies. For example, changing the configuration of one component might have an impact on a different part of the system. Besides, problems evolve and are often dynamic. The nature and impact of a problem can also change drastically over time.

Keptn addresses this challenge by introducing the concept of micro-operations that declare remediation actions for resolving certain problem types or triggering any operational tasks. Micro-operations follow a declarative approach, are atomic building blocks, and get triggered by events.

Declarative Operations as Code

Keptn complies with a declarative approach for configuring remediation and operations workflows as code on the level of individual microservices (rather than on applications). Consequently, this declaration is versioned next to the operational config and deployed with each new version of the microservice.

Below is an example of a declarative remediation.yaml file as used in Keptn. The file defines two problem types and the respective remediation actions. In case of a response time degradation, new instances are scaled up and in the case of a failure rate increase, a feature is disabled. To learn more about the remediation configuration, please continue here.

version: 0.2.0
kind: Remediation
  name: remediation-service-carts
  - problemType: Response time degradation
    - name: Scaling ReplicaSet by 1
      description: Scaling the ReplicaSet of a Kubernetes Deployment by 1
      action: scaling
        increment: +1
  - problemType: Failure rate increase
    - name: Toogle feature flag
      action: featuretoggle
      description: Toggle feature flag PromotionCampaign from ON to OFF.
        PromotionCampaign: off
  • This remediation file is interpreted by the provided remediation-service and versioned in a Git repository

  • This remediation file declares what needs to be done and leaves all the details to other components.

  • The remediation actions are defined by the developer for all services that are created. These operations instructions become additional metadata for each service.

  • Using a declarative approach, there is no need to worry about the actual execution details. Developers can leave the details to the platform engineering teams while leveraging the functionality.

Atomic Building Blocks

In Keptn, a remediation action or operational task is implemented as micro-operation. Such a micro-operation is reduced to the max, meaning that it is designed to execute a single action. This action is implemented for a single microservice rather than an entire application. Consequently, declarative instructions procedures are written on a per-microservice basis, which you can select and combine as needed.

A micro-operation is implemented by an action-provider, which is a Keptn-service with a dedicated purpose. This type of service is responsible for executing an action (aka. micro-operation) and therefore might even use another tool. An action-provider starts working, when receiving a Keptn CloudEvent of type: sh.keptn.event.action.triggered. To learn more about the implementation of a micro-operation by an action-provider, please continue here.

Event-driven Choreography

Assuming a developer has deployed a new artifact with a remediation file, the task sequence of an automated remedation looks as follows:

  1. The process gets triggered by a problem event sent out by a monitoring solution.

  2. Keptn receives this problem event and retrieves the remediation file from the Git repository.

  3. An internal Keptn-service interprets the remediation file and sends out events for action-providers.

  4. Depending on the problem type, the action-providers executes its action and informs Keptn about the execution.

  5. Keptn triggers a re-evaluation of the quality gate in this stage.

  6. Based on the result of this evaluation, Keptn sends out an event to escalate the problem or to mark it as resolved.

Task sequence of an automated remediation