Accessibility Tools

  • Content scaling 100%
  • Font size 100%
  • Line height 100%
  • Letter spacing 100%

Automated Driving

Data acquisition and analysis in DevOps project

Do you have any questions?

Christina Scheckenhofer msg Automotive 150x150 v1

Quirin Kögl
Your expert on the topic

Contact expert now

Martin Scholz: How do you make sure that everything relevant gets documented?

Quirin Kögl: We create a ticket for every incident to prevent any loss of information. All the necessary information is documented in this ticket. In addition, we have an online documentation page in Confluence for each product, which provides an overview of all incidents. There is a link to the respective ticket. We use this overview as a knowledge database.
For each product, we have documented the first steps for analysis and troubleshooting of the different components. This way, each team member can address problems in each product, even if they are not technically involved (in this). In addition, communication with the customer only takes place via mailboxes, which can be viewed by every team member. This way, no information is lost if someone is not there.

Martin Scholz: Your project is very well organized and has an overview of the current incidents at all times. Metrics and alerts are essential to know the current status of the products in live operation. How do you determine these metrics for incidents and system messages to fix malfunctions quickly?

Quirin Kögl: We work with two different customer Azure environments.
In one of the environments, we use Azure’s own services to display the technical and functional metrics of our products and components in several dashboards. In these product-specific dashboards, we can see the status of our products and components at a glance. This is where we run Big Data applications and most of our Azure resources.
In the second environment, provided by the customer, our interfaces to the vehicle backend run as microservices. Here, we rely on the logging system provided by the customer. There, we define alerts and dashboards based on micrometer metrics. In addition, we use Azure services that are approved to use the data. On this basis, we install different alerts that automatically inform us about product error behavior.

Martin Scholz: When is an alert useful to you?

Quirin Kögl: Basically, our goal is to use alerts to know about potential problems in our products as quickly as possible and before users do. Alerts need to be useful and not over used, so that the important issues don't get drowned out by the noise. Therefore, we have defined the following questions in the project, which we answer for each alert:

What happens if we ignore an alert?
When we ignore the alert and nothing bad happens, then we don’t need the alert.

How can we analyze the problem that occurred?
Documentation helps us to get to the cause of the problem faster, especially if the problem occurs rarely.

How can we fix the problem that occurred?
This information helps us so that we don't have to think about the solution every time from scratch and get back to normal faster. If the solution is to restart the affected product, then we can also automate that.

Martin Scholz: I guess you started with zero alerts. How many alerts do you currently have? In what cases did you decide to create new alerts and do you ever delete alerts?

Quirin Kögl: We started with a few standard alerts, such as heartbeat alerts, which check the availability of the application. We are now at close to a hundred alerts across all products. We create alerts mostly iteratively as new issues come up, and we anticipate that they may come up again.

Recently, six alerts were actually triggered. However, the problem was not with us, but with various interface partners. We see potential here to automatically forward such issues to the affected product, if the non-availability of a neighboring system affects us.

Martin Scholz: Can you give an example for a central service in the products?

Quirin Kögl: We are talking about products that can each consist of several components. For us, these are interface services or distributed computing pipelines – in our case, Spark pipelines. One important product is our Topology Map. This is the base map on which both the route clearances for highly automated driving and, for example, recognized road works are stored. The product consists of a Spark pipeline that generates the map on a regular basis and a REST service that provides the map to external systems.

Martin Scholz: What is the conclusion from this project?

Quirin Kögl: We have achieved a lot. We have a fully automated CI/CD process and commission our infrastructure and our monitoring and alerts automatically with Terraform (infrastructure as code). This high level of automation results in fewer errors and allows us to focus more on product development with the customer.

As a result, we have gained a lot of trust from the customer. Through this trust, we take on a lot of responsibility. In return, however, we are also given the freedom and authority to live up to this responsibility. For example, we take on the functional design of the products.

Martin Scholz: Thank you for your insight into the complex topic of DevOps using incidents and alerting as an example. In the next part of the interview, we want to talk about efficient team organization of operations in a large DevOps project. I’m looking forward to it.

Do you have any questions about automated driving?

Kontaktieren Sie uns!

Invalid Input