Distributed Tracing: A Powerful Approach to Debugging Complex Systems

Distributed tracing – idea

Imagine a customer complains about slow checkout times in an e-commerce app, but with over 50+ microservices involved in the transaction, where should we start looking ? It’s hard, isn’t it ?

This is where distributed tracing shines. Among observability tools, Distributed tracing is a powerful tool that provides end-to-end visibility into how requests traverse through microservices, ETL pipelines and many more systems , enabling teams to pinpoint bottlenecks and failures with precision.

In this post, we’ll dive into the what, why, and how of distributed tracing, complete with practical examples and tools to get started.


Understanding Distributed Tracing

Distributed tracing is a technique to monitor and observe requests as they travel through multiple components of distributed system, such as microservices or data pipelines. It provides comprehensive view of system interactions, enabling debugging, performance optimizations and root cause analysis.


Tools required for Distributed Tracing

OpenTelemetry

OpenTelemetry is an open-source framework designed to provide observability for software systems by collecting, processing, and exporting telemetry data, such as traces, metrics, and logs. OpenTelemetry helps to understand the health and performance of a system, making it easier to diagnose issues, optimize performance, and improve user experience.

It consists of set of APIs, libraries, agents, and instrumentation to provide distributed tracing and metrics collection. A Huge benefit of using OpenTelemetry is that it is not tied to any specific observability vendor. It can send telemetry data to a variety of backends like Jaeger, Zipkin, Prometheus, Datadog, and others.

Tracing Backend

Common tracing backends used to collect tracing data include Jaeger and Zipkin. Jaeger is a distributed tracing system that gathers and visualizes trace data, enabling the monitoring and troubleshooting of complex applications and architectures. It can be utilized in any distributed system where tracing is beneficial. Jaeger offers tools to visualize traces, identify performance bottlenecks, and track the flow of requests across different components of the system.


How Request Tracing Works in Microservices

The main components of distributed tracing are TraceID, Span, Context Propagation, visualization

  • Trace ID: A unique identifier for a single end-to-end request across services. Generated at the entry point of the system (e.g., by the orchestrator or an API gateway).
  • Span: It is an individual operation (e.g., a service call or a database query) within the trace. Includes metadata like duration, service name, operation name, and tags.
  • Context Propagation: Trace IDs and other metadata are passed between services via headers (e.g., HTTP headers) or messaging queues (e.g., Kafka).
  • Visualization: Tools like Jaeger, Zipkin, or AWS X-Ray visualize traces to show how requests propagate through the system.

Trace ID and Span ID propagation

Distributed Tracing Flow: Spans and Trace Context Propagation Across Microservices

Consider an e-commerce application , where there request to place an order. It involves 3 microservices

  1. Order Service
  2. Payment Service
  3. Inventory Service
  1. When the Order Service starts a trace, it generates a trace_id (same throughout the transaction) and span_id for itself.
  2. Later span_id of OrderService is passed as a parent trace to PaymentService. The Payment Service generates its own unique span_id (ab23cd67ef890123) for its span, which it passes to the next service which Inventory Service.

This hierarchy ensures the trace provides an accurate view of the request flow.

OpenTelemetry provides an HTTP instrumentation library for Java (supports other programming languages like Python, GO etc) that integrates seamlessly with RestTemplate, automatically propagating the context for distributed tracing.


Distributed Tracing in Data Application Systems

Data applications often involve ETL (Extract, Transform, Load) pipelines, machine learning workflows, and real-time processing. These workflows include multiple interdependent tasks, such as:

  • Data extraction from databases.
  • Data transformation and cleaning.
  • Model training and evaluation.
  • Data storage or serving results to downstream systems.

Distributed tracing provides a detailed view of these operations, enabling better monitoring and debugging. Tracing tools like Jaeger and OpenTelemetry provide graphical timelines that display the sequence and duration of tasks. These visualizations help in understanding:

  • Execution order of tasks.
  • Parallelism in the workflow.
  • Idle times between tasks, indicating possible inefficiencies.

Example Application – Spark-based ETL pipeline

I implemented a simple Spark-based ETL pipeline integrated with tools Docker, Airflow, Spark, and Jaeger to demonstrate how distributed tracing can provide actionable insights into processing times and task dependencies.

Tools : OpenTelemetry, Airflow, Spark, Jaeger, Docker

  1. Integrate a Tracing Library: Use a library like OpenTelemetry to instrument the code, like in this example spark job. For example:
    • Annotate tasks with span names (e.g., data_extraction, data_transformation).This is usually done using span.set_attribute(“key”)
    • Add contextual metadata to spans (e.g., dataset size, parameters).
  2. Deploy a Tracing Backend: Set up a backend like Jaeger or Zipkin to collect and visualize traces. In the example I have used Jaeger.
  3. Analyze Traces: Use the tracing backend to inspect task durations, execution paths, and errors.

Spark job — instrumentation to enable tracing

Notes on setup : View traces on Jaeger UI, available at http://localhost:16686

Once the docker containers are up and running, the Jaeger UI provides a detailed view of each task’s duration. This breakdown enables quick identification of bottlenecks within workflows.

Highlighted task annotations and durations captured in Jaeger UI

These insights are particularly critical for jobs with strict SLAs, as they allow teams to pinpoint and address delays proactively. With proper instrumentation in place, distributed tracing ensures that these performance metrics are readily available, fostering informed decisions for optimization.

If you’re interested in replicating(and extending) this setup locally , as I did, feel free to check out my GitHub repository for detailed instructions.

distributed tracing spark ETL

Why Logging Systems won’t be sufficient ?

Distributed tracing may be a overkill when there are only few services or services that do not often suffer from huge traffic. However knowing the difference between logging systems and distributed tracing is also crucial.

In a microservices architecture, using logging systems like Splunk or Elasticsearch for monitoring and debugging is quite common. However, integrating distributed tracing with tools like OpenTelemetry and Jaeger offers significant advantages over relying solely on logs for monitoring.

Logs are the basic building blocks for monitoring. They are textual records created by an application (or middleware) that provide insights into the system’s behavior. These logs might contain error messages, stack traces, or business metrics, and they are usually exported to centralized systems like Splunk or Elasticsearch.

  • Relying on logs often lack context across multiple services or requests.
  • Logs are typically isolated events and do not provide a full end-to-end trace of a user request as it moves through various microservices or they often do not provide task dependencies in a intuitive way.
  • Manual correlation is required to establish end-to-end flow between services for a given request.

Wrapping Up!

Distributed tracing offers invaluable insights into request propagation in complex systems such as microservices, task durations in ETL pipelines, dependencies in real-time processing, and machine learning workflows. The ability to visualize task dependencies and durations provides clarity on execution paths, helping to streamline processes, meet SLAs, and optimize resource utilization.

As a data platform software engineer, I can’t emphasize enough how game-changing this will be. With this information readily accessible, engineers will no longer need to waste time digging through logs. Instead, they can focus on addressing real issues and enhancing applications.

As always it is a good learning for me and hope it serves the some purpose for those who wants to enable distributed tracing.