Unbreakable Delivery Pipeline

This use case shows how to implement a delivery pipeline that prevents bad code changes from impacting your end users.

About this use case

The initial goal of the Unbreakable Delivery Pipeline is to implement a pipeline that prevents bad code changes from impacting your end users. This pipeline relies on three concepts known as Shift-Left, Shift-Right, and Self-Healing.

  • Shift-Left: Ability to pull data for specific entities (processes, services, or applications) through an automation API and feed it into the tools that are used to decide on whether to stop the pipeline or keep it running.

  • Shift-Right: Ability to push deployment information and meta data to your monitoring solution (e.g., to differentiate BLUE vs GREEN deployments), to push build or revision number of a deployment, or to notify about configuration changes.

  • Self-Healing: Ability for smart auto-remediation that addresses the root cause of a problem and not the symptom.

To illustrate the scenario this use case addresses, two steps are required:

  1. The source code of a service of the Dynatrace Sockshop will be changed, and the service gets deployed to the development environment. Although the service passes the quality gates in the development environment, the service does not pass the quality gate in staging due to an increase of the response time detected by a performance test. This demonstrates an early break of the delivery pipeline based on automated quality gates.

  2. For showing the self-healing capabilities, a faulty service version is deployed to the production environment and traffic is routed to this new version. Consequently, an issue is detected in production and a problem ticket is opened. To auto-remediate the bad deployment, the traffic routing is changed to redirect traffic to the previous (non-faulty) version.

Step 1: Simulate an early pipeline break

In this step you’ll release a service to staging that is not tested based on performance tests. Intentionally, the service is slowed down to fail at the end-to-end check in the staging pipeline.

  1. Introduce a slowdown in the carts service.

    1. In the directory of ~/keptn/repositories/carts/, open the file: ./src/main/resources/application.properties
    2. Change the value of delayInMillis from 0 to 1000
    3. Commit/Push the changes to your GitHub Repository carts

      $ git add .
      $ git commit -m "Property changed"
      $ git push
      
  2. You need the new version of the carts service in the staging namespace. Therefore, create a new release branch in the carts repository using the Jenkins pipeline create-release-branch:

    1. Go to Jenkins and sockshop.
    2. Click on create-release-branch pipeline and Schedule a build with parameters.
    3. For the parameter SERVICE, enter the name of the service you want to create a release for. In this case: carts

      The pipeline does the following:

      1. Reads the current version of the microservice.
      2. Creates a release branch with the name release/version.
      3. Increments the current version by 1.
      4. Commits/Pushes the new version to the Git repository.
      Pipeline create-release-branch
  3. After the create-release-branch pipeline has finished, trigger the build pipeline for the carts release and follow the pipeline:

    1. Go to sockshop, carts, and click on Scan Multibranch Pipeline Now.
    2. Open the release build by clicking on the #no.
    3. In the Console Output wait for Starting building: k8s-deploy-staging and click on that link.
    4. The pipeline should fail due to a too high response time.
    5. Click on Performance Report to see the average response time of the URI: _cart - add to cart

      Break early
  4. Remove the slowdown in the carts service

    1. In the directory of ~/keptn/repositories/carts/, open the file: ./src/main/resources/application.properties
    2. Change the value of delayInMillis from 1000 to 0
    3. Commit/Push the changes to your GitHub Repository carts

      $ git add .
      $ git commit -m "Set delay to 0"
      $ git push
      
  5. Build this new release

    1. Go to Jenkins and sockshop.
    2. Click on create-release-branch pipeline and Schedule a build with parameters.
    3. For the parameter SERVICE, enter the name of the service you want to create a release for. In this case: carts
    4. After the create-release-branch pipeline has finished, trigger the build pipeline for the carts release.

Step 2: Setup self-healing action for production deployment

In this step you will use an Ansible Tower job to release a deployment in a canary release manner. Additionally, you will create a second job that switches back to the old version in case the canary (i.e., the new version of front-end) behaves wrong.

  1. Login to your Ansible Tower instance.

    Receive the public IP from your Ansible Tower:

    $ kubectl get services -n tower
    NAME            TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)         AGE
    ansible-tower   LoadBalancer   ***.***.***.**   xxx.143.98.xxx   443:30528/TCP   1d
    

    Copy the EXTERNAL-IP into your browser and navigate to https://xxx.143.98.xxx

  2. (If you haven’t entered the license yet, see submit the Ansible Tower license.)

  3. (If you haven’t integrated Ansible Tower into Dynatrace, see Integration Ansible Tower runbook in Dynatrace.)

  4. Your login is:

    • Username: admin
    • Password: dynatrace
  5. Verify the existing job template for canary release in Ansible Tower by navigating to Templates and canary.

    • Name: canary
    • Job Type: Run
    • Inventory: inventory
    • Project: self-healing
    • Playbook: scripts/playbooks/canary.yaml
    • Skip Tags: canary_reset
    • Extra Variables: --- jenkins_user: "admin" jenkins_password: "AiTx4u8VyUV8tCKk" jenkins_url: "http://1**.2**.3**.4**/job/k8s-deploy-production-canary/build?delay=0sec" remediation_url: "https://5**.6**.7**.8**/api/v2/job_templates/xx/launch/"
    • Remarks:
      • The IP 1**.2**.3**.4** in jenkins_url is the IP of your Jenkins.
      • The IP 5**.6**.7**.8** in remediation_url is the IP of your Ansible Tower.
      • The xx before /launch is the ID of the job shown in the next step.

    After this step, your job template for canary should look as shown below:

    Ansible template

  6. Verify the existing job template for canary-reset in Ansible Tower by navigating to Templates and canary-reset.

    • Name: canary-reset
    • Job Type: Run
    • Inventory: inventory
    • Project: self-healing
    • Playbook: scripts/playbooks/canary.yaml
    • Job Tags: canary_reset
    • Remarks:
      • The IP 1**.2**.3**.4** in jenkins_url is the IP of your Jenkins.

    After this step, your job template for canary reset should look as shown below:

    Ansible job template

Step 3: Introduce a failure into front-end and deploy to production

In this step you will introduce a Java Script error into the front-end service. This version will be deployed as version v2.

  1. Open the file server.js in the master branch of the ~/keptn/repositories/front-end and set the property response-error-probability to 20:

    ...
    global.acmws['request-latency'] = 0;
    global.acmws['request-latency-catalogue'] = 500; 
    global.acmws['response-error-probability'] = 20;
    ...
    
  2. Save changes to that file.

  3. Commit your changes and push it to the remote repository.

    $ git add .
    $ git commit -m "Changes in the server component"
    $ git push
    
  4. You need the new version of the front-end service in the staging namespace, before you can start with a blue-green or canary deployment. Therefore, create a new release branch in the front-end repository using our Jenkins pipeline:

    1. Go to Jenkins and sockshop.
    2. Click on create-release-branch pipeline and Schedule a build with parameters.
    3. For the parameter SERVICE, enter the name of the service you want to create a release for. In this case: front-end

      The pipeline does the following:

      1. Reads the current version of the microservice.
      2. Creates a release branch with the name release/version.
      3. Increments the current version by 1.
      4. Commits/Pushes the new version to the Git repository.
      Pipeline create-release-branch
  5. After the create-release-branch pipeline has finished, trigger the build pipeline for the front-end service and wait until the new artefacts is deployed to the staging namespace.

    • Wait until the release/version build has finished.
  6. Deploy the new front-end to production

    1. Go to your Jenkins and click on k8s-deploy-production.update.
    2. Click on master and Build with Parameters:
      • SERVICE: front-end
      • VERSION: v2
    3. Hit Build and wait until the pipeline shows: Success.

Step 4. Simulate a bad production deployment

In this step, you will launch the above Ansible job that redirects the entire traffic to the new version of front-end in a canary release manner. Since the new front-end contains a failure, Dynatrace will open a problem and automatically invokes an auto-remediation action.

  1. Run the kubectl get svc istio-ingressgateway -n istio-system command to get the EXTERNAL-IP of your Gateway.

    $ kubectl get svc istio-ingressgateway -n istio-system
    NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                                      AGE
    istio-ingressgateway   LoadBalancer   172.21.109.129   1**.2**.1**.1**  80:31380/TCP,443:31390/TCP,31400:31400/TCP   17h
    
  2. Simulation of real-user traffic

    1. In your Dynatrace tenant, go to Synthetic and click on Create a synthetic monitor
    2. Click on Create a browser monitor
    3. Type in the EXTERNAL-IP of your ingress gateway and give your monitor a name (e.g., Sockshop Monitor).
    4. At Frequency and locations set Monitor my website every 5 minutes.
    5. Select all locations and finally click on Monitor single URL and Create browser monitor.
    6. Now, wait a couple of minutes for the synthetic traffic.
  3. Run job template in the Ansible Tower

    1. Go to Ansible Tower.
    2. Start the job template canary to trigger a canary release of front-end v2.
      Ansible job execution
  4. (Optional) Adjust sensitivity of anomaly detection

    1. In your Dynatrace tenant, go to Transaction & service and click on front-end.production.
    2. Click on the button in the top right corner and select Edit.
    3. Go to Anomaly Detection and enable the switch for Detect increases in failure rate.
      • Select using fixed thresholds
      • Alert if 2% custom error rate threshold is exceeded during any 5-minute period.
      • Sensitivity: High
    4. Go back to your service.
  5. Now, you need to wait until a problem appears in Dynatrace.

  6. When Dynatrace opens a problem notification, it automatically invokes the remediation action as defined in the canary playbook. In fact, the remediation action refers to the remediation playbook, which then triggers the canary-reset playbook. Consequently, you see the executed playbooks when navigating to Ansible Tower and Jobs. Moreover, the failure rate of the front-end service must decrease since new traffic is routed to the previous version of front-end.

    Ansible canary reset job

Step 5. Cleanup

  1. Apply the configuration of the VirtualService to use v1 only.

    $ cd ~/keptn/repositories/k8s-deploy-production/istio
    $ kubectl apply -f virtual_service.yml
    virtualservice.networking.istio.io/sockshop configured
    
  2. Remove the failure from the front-end service.

    1. Open the file server.js in the master branch of the ~/keptn/repositories/front-end and set the property response-error-probability to 0:

      ...
      global.acmws['request-latency'] = 0;
      global.acmws['request-latency-catalogue'] = 500; 
      global.acmws['response-error-probability'] = 0;
      ...
      
    2. Save the changes to that file.

    3. Commit your changes and push it to the remote repository.

      $ git add .
      $ git commit -m "Fixed issue in server component"
      $ git push
      

Understanding what happened

In this use case, you intentionally introduced a failure into a service to demonstrate an early pipeline stop. Early enough to not deploy a faulty service version into production.

Assuming a bad deployment gets deployed to a production environment, then there must be a mechanism that triggers an auto-remediation action due to a problem notification. This action takes care of achieving a desired state. To experience this scenario, you created a failure in the front-end service, which were deployed to a production environment. As Dynatrace detected an increase of the failure rate, a problem notification automatically triggered a runbook in Ansible Tower that took care of rerouting traffic to the previous version.