Click to learn more about author Yuval Dror.
Many companies today have adopted the new norm of rapid iteration in software development and now live by Mark Zuckerberg’s famous motto, “Move fast and break things.” This mentality has led to the growth in popularity of a service-oriented architecture (SOA) approach to software design. In particular, we’ve seen the rise of microservices, which are an SOA-style approach to software development where companies deploy business logic in small, independent services.
While the microservices approach has several advantages, such as reducing risk, speed of deployment, and scalability, it also brings its own set of unique challenges.
As software development teams are often deploying tens, hundreds, or even thousands of features each day, one of the main operational challenges with microservices is to make sure that new features are not breaking anything within the microservices and, more importantly, to make sure that a change to one microservice does not break other, dependent microservices.
In this article, we’ll discuss one of the technologies used to address this complexity: anomaly detection for service mesh.
What is Service Mesh?
Service-oriented architectures require dedicated tools that control service-to-service communication. In particular, as network communication between microservices grows in scale and complexity, it becomes impossible to manually manage deployments, troubleshoot issues, and maintain the cluster security. Service mesh technologies give you an additional layer of insights and improve observability, traffic management, and deployment management, as well as enhancing security within the mesh. Many tools and standards are created to address the service mesh complexity; these are summarized on the Layer5 website. CNCF projects such as OpenTelemetry, Envoy, and Prometheus are becoming very popular these days.
- OpenTelemetry: OpenTelemetry describes itself as an open-source observability framework. In particular, it provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application.
- Envoy Proxy: Originally built at the company Lyft, Envoy is an open-source edge and service proxy that is designed specifically for cloud-native applications. They set out to solve two of the main issues with microservices that we’ve discussed: networking and observability.
- Prometheus: Prometheus is another open-source solution for event monitoring and alerting. It collects real-time metrics from configured targets, evaluates rule expressions, displays results, and can trigger alerts.
Drawbacks of the Service Mesh Monitoring Paradigm
One of the main issues with service mesh monitoring tools is that when you have a large number of microservices, observability is unrealistic and impractical.
In the current paradigm of service mesh monitoring, the tools have some components that are responsible for meeting the service-level agreement. For example, the service mesh Istio collects the following types of measurement in order to provide overall service mesh observability:
- Metrics: These are generated based on the Envoy Proxy statistics. Some are defined by Istio as the “golden signals” of monitoring (latency, traffic, errors, and saturation)
- Distributed Traces: Istio also generates distributed trace spans for each service
Open-source projects like Istio are very useful at collecting metrics that allow developers to create dashboards. This process works well if you’re dealing with a smaller application, and there’s a dedicated team monitoring and adjusting alerts. If you’re working on a project with large-scale deployment, however, these manual processes are much less effective.
Without the ability to visually monitor multiple clusters, service mesh technologies need to go beyond “observing” and move towards automated anomaly detection.
Anomaly Detection for Service Mesh
Anomaly detection that employs machine learning has many benefits over traditional monitoring methods, such as automatically learning the behavioral patterns of each new microservice and automatically sending alerts when significant changes are detected. These features allow you to lower the time it takes to detect anomalies and helps prevent further distribution.
AI-based anomaly detection integrates with the service mesh as a whole in order to track high-level KPIs as well as the most granular signals from each microservice.
Anomaly detection for service mesh monitoring is still an emerging field, although if you’re reviewing the available solutions, here are a few considerations to keep in mind:
- Fully Autonomous: As mentioned, the service mesh of large-scale deployments is impossible to monitor manually, so the first consideration to make is to ensure that the solution can independently track and learn from data in real-time.
- False Positive Rate: Next, you want to look for a solution that has a low false-positive rate as otherwise, this can lead to unnecessary noise and create alert fatigue.
- Correlation: Finally, an AI-based anomaly detection solution should be able to automatically learn the topology of the mesh and connect the dots.
With an anomaly detection solution, you not only get alerted about critical incidents but can also see a chronological list of corrected anomalies. This means you can easily trace back to the root of the anomaly to ensure it doesn’t happen again.
As we’ve discussed, service mesh monitoring has become an essential part of managing microservices as they provide insights into service-to-service communication. As the deployment of microservices starts to grow, however, observability becomes increasingly impractical.
Pairing service mesh technologies with an AI-based anomaly detection solution solves this challenge by enabling you to detect real-time incidents and can reliably reduce your time to resolution.