Distributed tracing with Grafana Tempo

Through this blog, we will be understanding Grafana Tempo and its uses? And about Distributed Tracing. As technology evolves so fast, many companies are shifting from Monolithic Architecture to Microservices. As Monolithic applications are too hard to maintain, scale, and deploy, on the other hand, the Microservices environment gives developers more freedom and flexibility to choose languages and frameworks depending on the functionality best suited for it. As these microservices are connected and calling one another, there are many queries flowing through, so the role of Distributed Tracing in an application comes into play. Various teams within an organization manage these services. So it becomes pretty important to know the point of failure and raise an issue to fix it.

What is Distributed Tracing?

Distributed tracing is the method of observing queries flowing through distributed systems or services. Tracing data assists us in monitoring microservices, understanding the flow of queries, identifying the source of failure, the reason for failure, giving all the infrastructure information, and determining what is affecting their performance. Let's see how it is done by Tempo.

In Tempo, traces are discovered through logs or exemplars.

Trace Discovery through logs
Trace Discovery through exemplars An exemplar is a specific trace representative of a repeated data pattern in a given time interval. It helps you identify higher cardinality metadata from specific events within time series data.

What is Grafana Tempo?

Tempo is an open-source, easy-to-use and high-scale distributed tracing Backend. It visualizes the lifecycle of a request as it passes through a set of applications, and it is cost-efficient, requiring only object storage to operate, and is deeply integrated with Grafana, Prometheus, and Loki. It is compatible with consuming the traces of OpenTelemetry, Zipkin and Jaeger. The traces are stored by their own unique ID.

Features of Tempo

Grafana Agent

Earlier, In Tempo, we have to instrument our code and do some custom work to collect all the traces and forward them to Grafana cloud or Tracing backend. Grafana Agent can be configured to run a set of tracing pipelines to collect data from your applications and write it to Tempo. It is a push-based agent built using OpenTelemetry. That contains all the traces and forward it tracing backend or Grafana cloud.

Grafana Agent also brings features like:

Pipeline processing

The Grafana Agent processes tracing data as it flows through the pipeline to make the distributed tracing system more reliable and leverage the data for other purposes such as trace discovery, tail-based sampling, and generating metrics.
Batching The Agent supports batching of traces. Batching helps better compress the data, reduces the number of outgoing connections, and is a recommended best practice.
Receiving traces The Grafana Agent supports multiple ingestion receivers: OTLP (OpenTelemetry), Jaeger, Zipkin, OpenCensus and Kafka.

Each tracing pipeline can be configured to receive traces in all these formats. Traces that arrive at a pipeline will go through the receivers/processors/exporters defined.
Attributes manipulation The Grafana Agent allows for general manipulation of attributes on spans that pass through this Agent. A common use may be to add an environment or cluster variable.
Automatice Logging : Trace discovery through logs

Tempo supports finding a trace if you know the trace identifier, so it leverages other tools like logs and metrics to discover traces.

Automatic logging provides an easy and fast way of getting trace discovery through logs. Automatic logging writes a well-formatted log line to a Loki instance or stdout for each span, root or process that passes through the tracing pipeline. On top of that, we also get metrics from traces using Loki. This allows for automatically building a mechanism for trace discovery.

Automatic logging searches for a given set of attributes in the spans and logs them as key-value pairs. This allows searching by those key-value pairs in Loki.
Tail-based sampling Probabilistic sampling strategies are easy to implement and run the risk of discarding relevant data. Tempo aims to provide an inexpensive solution that makes 100% sampling possible. However, sometimes constraints will make a lower sampling percentage necessary or desirable, such as runtime or egress traffic-related costs.

Tail based sampling is done at the end of the workflow.
Metrics from spans Span metrics allow you to generate metrics from your tracing data automatically. Span metrics aggregate request, error and duration (RED) metrics from span data. Metrics are exported in Prometheus format.

There are two options available for exporting metrics: using remote write to a Prometheus compatible backend or serving the metrics locally and scraping them.

Span metrics generate a counter that computes requests and a histogram that computes operation's durations.

Span metrics can provide in-depth monitoring of your system. The generated metrics will show application-level insight into your monitoring as far as tracing propagates through your applications.
Exporting spans The Grafana Agent can export traces to multiple backends for every tracing pipeline. Exporting is built using OpenTelemetry Collector's OTLP exporter. The Agent supports exporting tracing in OTLP format.

Aside from endpoint and authentication, the exporter also provides mechanisms for retrying failure and implements a queue buffering mechanism for transient losses, such as networking issues.

Looking into the future of Tempo

Tempo is currently can only be lookup by its trace id. We can find the trace id only through Application logs and exemplars. That makes our tracing journey a bit tough, where we have to rely on other sources of data to find trace id.

In upcoming releases, we could find a Native Search in Tempo. The approach would be based on query languages inspired by PromQL and LogQL. And searching would be oriented towards finding traces and not spans.

There will be some awesome updates in Exemplars where it's going to have long-term storage, the support for more languages and platforms and Automatic instrumentation and library support.

Tempo has improved a lot since its release regarding scaling and Efficiency. The internal tracing volume has increased from 170K to 2.2 million spans per second, and its goal is to get it to 10 to 20 million spans per second. And Tempo efficiency, which is measured as spans per second per CPU core, has increased from 2K to 7.5K spans/s/cpu.