Observability — OpenTelemetry (OTel) and Related Technologies
Prerequisites: Familiarity with distributed systems concepts (microservices, HTTP/gRPC APIs, containers). Basic experience with at least one of Python, Node.js, or Go. Understanding of what a web server, database call, and message queue are. No prior observability tooling experience required.
What Is Observability and Why It Matters
Observability is the ability to understand a system's internal state by examining its external outputs — logs, metrics, and traces. The concept originates from control theory, formalized by Hungarian-American engineer Rudolf Kálmán in 1960: a system is observable if you can fully determine its internal state from its outputs alone. Applied to software, this means you shouldn't need to deploy new code or attach a debugger to understand what's happening inside your services.
This is a fundamentally different posture than traditional debugging. Debugging is reactive — something breaks, you form a hypothesis, reproduce the issue, and poke around until you find the cause. Observability is proactive. It lets you ask arbitrary, novel questions about your system's behavior without having anticipated them in advance. As Charity Majors, co-founder of Honeycomb, puts it:
"Observability is about being able to ask arbitrary questions about your production environment without having to know ahead of time what you wanted to ask."
The Mindmap of Observability
Before diving deeper, here's a high-level view of observability — what it is, why it matters, how it works, and the benefits it delivers.
mindmap
root((Observability))
Why
Unknown unknowns
Distributed complexity
Ephemeral infrastructure
Polyglot systems
What
External outputs
Logs
Metrics
Traces
Reveal internal state
How
Three pillars
High-cardinality data
High-dimensionality data
Correlation across signals
Benefits
Faster MTTR
Proactive debugging
SLO-driven operations
Ask novel questions
Why Observability Became Essential
A decade ago, monitoring was often enough. You had a handful of monolithic applications running on known servers. You could SSH into a box, tail a log file, and reason about what went wrong. The set of failure modes was relatively small, and you could write an alert for each one.
That world no longer exists. Modern systems are built from dozens or hundreds of microservices, running in ephemeral containers on platforms like Kubernetes. Services are written in different languages. A single user request may fan out across ten services, three databases, a message queue, and an external API — each with its own failure modes, latencies, and retry logic. The Google SRE book captures this shift well:
"Monitoring a complex distributed system is fundamentally different from monitoring a single machine or a small collection of machines."
The shift to microservices, containers, and cloud-native architectures created an explosion of distributed state that traditional monitoring simply cannot handle. Here's why:
- Microservices: A request's fate is determined by the interaction of many independent services, not a single process.
- Ephemeral containers: The server you want to debug may have already been replaced. There's no box to SSH into.
- Polyglot systems: Teams use different languages, frameworks, and data stores — each emitting different signals in different formats.
- Dynamic orchestration: Kubernetes reschedules pods, autoscalers add or remove instances, and load balancers shift traffic constantly.
The "Unknown Unknowns" Problem
Traditional monitoring is built around known failure modes. You know the database might run out of connections, so you alert on the connection pool. You know the disk might fill up, so you alert at 90% capacity. These are known unknowns — you don't know when they'll happen, but you know what to watch for.
The real danger in distributed systems is the unknown unknowns — failures you never anticipated. A subtle interaction between a retry storm in Service A, a slow garbage collection pause in Service B, and a connection pool exhaustion in Service C produces a cascading failure that no single alert would catch. You can't write dashboards for problems you've never imagined.
Monitoring tells you when something is broken. Observability helps you understand why — even for failure modes you never predicted. Monitoring is a subset of observability, not a replacement for it.
Ben Sigelman, co-creator of Dapper (Google's distributed tracing system) and co-founder of LightStep, frames the distinction sharply:
"Monitoring is for known-unknowns. Observability is for unknown-unknowns."
Observability vs. Debugging: A Comparison
| Dimension | Traditional Debugging | Observability |
|---|---|---|
| Posture | Reactive — triggered by an incident | Proactive — continuous understanding |
| Prerequisite | Reproduce the issue locally or in staging | Interrogate production data directly |
| Failure types | Known failure modes with predefined alerts | Novel, emergent failures (unknown unknowns) |
| Data model | Aggregated metrics, static dashboards | High-cardinality, high-dimensionality events |
| Iteration speed | Minutes to hours (deploy, reproduce, inspect) | Seconds (slice, dice, correlate live data) |
High-Cardinality, High-Dimensionality Data
Observability depends on a specific kind of data. It's not enough to know that your p99 latency spiked. You need to know which users, hitting which endpoints, from which regions, using which build version, experienced that latency spike. This is where cardinality and dimensionality come in.
What these terms mean
- High cardinality refers to the number of unique values a field can take. A
user_idfield with millions of unique values is high-cardinality. Astatus_codefield with five possible values (200, 301, 400, 404, 500) is low-cardinality. Observability requires you to slice data by high-cardinality fields — individual user IDs, request IDs, container IDs, trace IDs — to isolate specific behaviors. - High dimensionality refers to the number of fields (dimensions) attached to each event. A single request event might carry 50+ attributes: user ID, endpoint, HTTP method, status code, latency, region, deployment version, feature flags, database query count, and more. The more dimensions you capture, the more "angles" from which you can interrogate your system.
Traditional monitoring tools struggle with high-cardinality data because they rely on pre-aggregation. They compute averages and percentiles ahead of time, which destroys the ability to drill down into individual events. Observability tools store raw, high-cardinality, high-dimensionality events and let you query them on the fly.
When instrumenting your services, capture wide, structured events rather than narrow, pre-aggregated metrics. Every request event should carry as much context as possible — user ID, feature flags, build SHA, region. You can always aggregate later; you can't disaggregate data that was averaged at write time.
The Bottom Line
Observability isn't a product you buy or a dashboard you build — it's a property of your system. A system is observable when its telemetry data is rich enough and queryable enough that engineers can answer novel questions about production behavior without shipping new code. In the chapters ahead, you'll learn about the specific signals (traces, metrics, logs) and the tooling (OpenTelemetry) that make this possible.
Observability vs Traditional Monitoring: A Paradigm Shift
Monitoring and observability are often used interchangeably, but they represent fundamentally different philosophies about understanding production systems. The distinction matters because it changes how you design instrumentation, how you respond to incidents, and ultimately how well you can reason about complex distributed systems.
Traditional monitoring asks: "Is the system working?" Observability asks: "Why is the system not working for this specific user, this specific request, at this specific moment?" That difference — from aggregate health checks to fine-grained exploratory analysis — is the paradigm shift.
Known-Knowns vs Unknown-Unknowns
Traditional monitoring is built around known-knowns. You anticipate failure modes in advance, set up dashboards for them, and configure threshold-based alerts. CPU above 90%? Alert. Disk at 95%? Alert. Error rate above 1%? Page someone. This works when you can predict what will go wrong.
The problem is that modern distributed systems fail in ways you cannot predict. A single request might traverse dozens of microservices, hit multiple caches and databases, and pass through load balancers, service meshes, and API gateways. When a user in São Paulo reports that checkout is slow on Tuesdays at 3pm, no pre-built dashboard will show you why.
Observability addresses unknown-unknowns — the failures you didn't anticipate. Instead of pre-defining what questions you can ask, you instrument your systems to emit rich, high-cardinality telemetry data. Then you explore that data interactively, slicing and dicing by any dimension: user ID, request ID, geographic region, feature flag, tenant, cart size — whatever the investigation demands.
Cardinality refers to the number of unique values a field can have. A status_code field (200, 404, 500) is low-cardinality. A user_id field with millions of unique values is high-cardinality. Traditional monitoring tools choke on high-cardinality data because they pre-aggregate metrics. Observability platforms are designed to handle it — and that's precisely what enables you to debug individual user experiences.
The Key Differences
The following table captures the core distinctions between the two approaches. These aren't just aesthetic differences — they shape tooling choices, team workflows, and incident response times.
| Dimension | Traditional Monitoring | Observability |
|---|---|---|
| Core question | "Is the system up?" | "Why is it broken for this request?" |
| Failure model | Known-knowns (anticipated failures) | Unknown-unknowns (novel failures) |
| Data model | Pre-aggregated metrics, low-cardinality tags | High-cardinality events, traces, structured logs |
| Query style | Pre-built dashboards, static queries | Ad-hoc exploratory queries, drill-down |
| Alert philosophy | Threshold-based (CPU > 90%, errors > 1%) | SLO-based (error budget burn rate) |
| Debugging workflow | Check dashboard → scan logs → guess | Start from symptom → slice by dimensions → find root cause |
| Instrumentation | Agent-based, black-box, per-component | Code-level, white-box, request-scoped |
| Cost driver | Number of hosts/services monitored | Volume and cardinality of telemetry data |
| Scalability | Works well for monoliths and small clusters | Designed for distributed microservice architectures |
The Evolution: SNMP to Modern Observability
The shift didn't happen overnight. Monitoring tooling has evolved through distinct generations, each responding to the limitations of the previous era and the growing complexity of infrastructure.
Generation 1: SNMP and Network Monitoring (1990s)
Simple Network Management Protocol (SNMP) gave operators a way to poll network devices for health data — interface traffic, error counts, uptime. Tools like MRTG generated graphs from SNMP data. This was sufficient when "infrastructure" meant routers, switches, and a handful of servers. The data model was entirely device-centric: you monitored boxes, not applications or user experiences.
Generation 2: Nagios and Check-Based Monitoring (2000s)
Nagios introduced the concept of service checks — small scripts that test whether something is working and return OK, WARNING, or CRITICAL. This was a leap forward because it let you monitor application-level concerns (is the web server responding? is the database accepting connections?). But Nagios was fundamentally binary: things were either green or red. It had no concept of trends, no way to answer "is this getting worse?", and scaling it to hundreds of services required heroic configuration management.
Generation 3: Graphite, StatsD, and Metrics Aggregation (2010s)
The rise of StatsD (born at Etsy in 2011) and Graphite gave engineers the ability to emit custom metrics from application code. For the first time, you could track business-level signals — orders per minute, payment processing latency, cache hit ratios — alongside infrastructure metrics. This era also brought time-series databases (InfluxDB, Prometheus) and sophisticated dashboarding (Grafana). The limitation? Everything was still pre-aggregated. You could see average request latency, but not the latency of a specific request. You could see error rates by endpoint, but not by individual user.
Generation 4: Modern Observability Platforms (2018–present)
Modern observability platforms (Honeycomb, Jaeger, Tempo, the OpenTelemetry ecosystem) store individual events and traces rather than pre-aggregated metrics. This preserves the full context of each request, enabling you to GROUP BY any field, filter to any dimension, and follow a single request across your entire distributed system. The three pillars — traces, metrics, and logs — are correlated, so you can jump from a latency spike on a dashboard directly into the specific traces that caused it.
The Dashboard-Driven Trap
If your incident response starts with "let me check the dashboards," you're likely operating in a monitoring mindset rather than an observability mindset. Dashboards aren't inherently bad, but over-reliance on them creates three distinct failure modes:
1. Alert Fatigue
Threshold-based alerts on pre-aggregated metrics produce noise. A brief CPU spike at 3am that self-resolves, a momentary uptick in 5xx errors during a deploy — these generate pages that train on-call engineers to ignore alerts. Studies show that teams with more than a few alerts per on-call shift start to treat all alerts as low-priority. The signal drowns in noise.
2. Dashboard Rot
Dashboards proliferate over time. Someone creates one during an incident, another team clones it with modifications, a third version gets built for a quarterly review. Within a year, you have dozens of dashboards — many showing stale metrics for services that have been renamed or decommissioned. Nobody knows which dashboard to trust during an incident, so engineers waste critical minutes hunting for the "right" one.
3. The Inability to Ask New Questions
This is the most fundamental limitation. A dashboard can only answer the questions it was designed to answer. If your checkout-latency dashboard breaks down by endpoint and region, you cannot suddenly ask "what's the latency for users with more than 50 items in their cart?" without modifying instrumentation, adding a new metric, deploying it, and waiting for data to accumulate. In an observability-first system, that query is immediate — because the raw event data already contains the cart-size attribute.
Don't throw away your dashboards. They're valuable for situational awareness — a quick glance at system health. The mistake is treating them as your primary debugging tool. Use dashboards to detect that something is wrong; use observability tooling to figure out what and why.
From Reactive Firefighting to Proactive Understanding
The most impactful change observability enables isn't technical — it's cultural. With traditional monitoring, the incident lifecycle looks like this: alert fires → engineer opens dashboards → engineer scans logs → engineer forms a hypothesis → engineer deploys a fix → hope it works. Each step is reactive, and the debugging process relies heavily on tribal knowledge ("oh, when that dashboard looks like that, it usually means the database connection pool is exhausted").
With proper observability, the workflow shifts. You start from a symptom — a spike in error budget burn rate, a slow trace, a user report — and explore the telemetry data interactively. You don't need to know in advance which service is at fault. You slice the data by attributes, compare baselines, and follow the evidence. This is closer to scientific investigation than pattern matching.
More importantly, observability enables proactive work. Because you can query your telemetry data freely, you can ask questions before incidents happen: "Are there any endpoints where p99 latency has been creeping up over the past week?" or "Which tenants are seeing the highest error rates, even if overall error rates look healthy?" These questions surface problems before they become pages — and that's the real paradigm shift.
You don't need to rip out all your monitoring and replace it overnight. Start by adding rich, high-cardinality instrumentation (using OpenTelemetry) to one critical service. Practice exploratory debugging with that data. As the team builds muscle memory for the observability workflow, expand to other services incrementally.
The Three Pillars: Logs, Metrics, and Traces
Observability rests on three complementary signal types — logs, metrics, and traces. Each one captures a different facet of what your system is doing at any given moment. Understanding what each pillar does well (and what it doesn't) is the key to building systems you can actually debug under pressure.
graph LR
U["👤 User Request"] --> GW["API Gateway"]
GW --> SVC["Application Service"]
SVC --> L["📝 Logs
Discrete event records"]
SVC --> M["📊 Metrics
Numeric aggregates"]
SVC --> T["🔗 Traces
Distributed spans"]
L --> OB["Observability Backend
(Correlation & Analysis)"]
M --> OB
T --> OB
OB --> D["Dashboards & Alerts"]
OB --> I["Investigation & Root Cause"]
style L fill:#2d6a4f,stroke:#40916c,color:#fff
style M fill:#e76f51,stroke:#f4a261,color:#fff
style T fill:#457b9d,stroke:#a8dadc,color:#fff
style OB fill:#6c567b,stroke:#c9b1d0,color:#fff
Logs: The Narrative Record
A log is a discrete, timestamped record of something that happened — a request arrived, a database query ran, an error was thrown. Logs are the oldest observability signal and the one most developers reach for first. They give you the why behind a problem: the full error message, the malformed input, the stack trace.
Logs come in two flavors: unstructured (free-form text) and structured (key-value pairs, typically JSON). Structured logs are dramatically easier to search and aggregate, which is why every modern logging library defaults to them.
2024-03-15 14:32:07 ERROR PaymentService - Failed to charge card ending 4242 for order #8812: gateway timeout after 30s
{
"timestamp": "2024-03-15T14:32:07.341Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123def456",
"span_id": "span-789",
"order_id": "8812",
"card_last4": "4242",
"error": "gateway_timeout",
"duration_ms": 30004,
"message": "Failed to charge card: gateway timeout"
}
Notice the structured version includes a trace_id and span_id. These fields are what let you correlate a log entry with a specific distributed trace — a connection that becomes critical during incident investigation.
Standard Severity Levels
Most logging frameworks follow a common severity hierarchy. The OpenTelemetry specification defines severity numbers that map onto these familiar levels:
| Level | When to Use | Example |
|---|---|---|
TRACE / DEBUG | Development-time detail; disabled in production | SQL query text, request/response bodies |
INFO | Normal operational events worth recording | Server started, order placed, user logged in |
WARN | Unexpected but recoverable conditions | Retry succeeded on second attempt, cache miss |
ERROR | Failures that affect a single operation | Payment declined, upstream 5xx response |
FATAL | Unrecoverable failures; process is shutting down | Out of memory, configuration missing |
Metrics: The Quantitative Pulse
Metrics are numeric measurements aggregated over time. Unlike logs (one record per event), metrics compress thousands of events into a single number: "there were 1,247 requests in the last minute" or "p99 latency is 340ms." This compression makes metrics extremely cheap to store and fast to query — perfect for dashboards and alerting.
There are three fundamental metric types you'll encounter everywhere:
| Type | What It Measures | Examples |
|---|---|---|
| Counter | Monotonically increasing total | Total requests served, total errors, bytes transferred |
| Gauge | Point-in-time value that goes up or down | Current CPU usage, active connections, queue depth |
| Histogram | Distribution of values across buckets | Request latency distribution, response size distribution |
Here's what emitting these looks like in practice with the OpenTelemetry SDK:
from opentelemetry import metrics
meter = metrics.get_meter("checkout-service")
# Counter — total number of checkout attempts
checkout_counter = meter.create_counter(
name="checkout.attempts",
description="Total checkout attempts",
)
# Histogram — latency distribution of checkout operations
checkout_duration = meter.create_histogram(
name="checkout.duration",
unit="ms",
description="Time taken to complete checkout",
)
# Gauge — items currently in processing queue
queue_depth = meter.create_up_down_counter(
name="checkout.queue_depth",
description="Items waiting in the checkout queue",
)
A spike in your checkout.duration histogram tells you something is slow, but it won't tell you which service or which database query is the bottleneck. That's where traces come in.
Traces: The Distributed Story
A distributed trace follows a single request as it moves across service boundaries — from the frontend to the API gateway, through the order service, into the payment provider, and back. Each segment of work is called a span, and spans are linked together by a shared trace ID to form a tree that represents causality and timing.
Traces answer the question "where did the time go?" With a single trace, you can see that a checkout request took 3.2 seconds total, and 2.8 seconds of that was spent waiting for the payment gateway.
from opentelemetry import trace
tracer = trace.get_tracer("checkout-service")
def process_checkout(order):
with tracer.start_as_current_span("process_checkout") as span:
span.set_attribute("order.id", order.id)
span.set_attribute("order.total", order.total)
# Child span for inventory check
with tracer.start_as_current_span("check_inventory"):
inventory_ok = inventory_service.check(order.items)
# Child span for payment processing
with tracer.start_as_current_span("charge_payment") as pay_span:
pay_span.set_attribute("payment.method", order.payment_method)
result = payment_service.charge(order)
return result
This code produces a trace with three spans: a parent process_checkout span and two children (check_inventory and charge_payment). In a trace visualization tool like Jaeger or Grafana Tempo, you'd see these as a waterfall, instantly revealing which operation consumed the most time.
How the Pillars Complement Each Other
No single signal type is sufficient on its own. Each pillar answers a different question during an incident, and the real power comes from using them together:
| Signal | Question Answered | Strength | Limitation |
|---|---|---|---|
| Metrics | "Is something wrong?" | Cheap, fast, great for alerting | No detail about individual requests |
| Traces | "Where is the problem?" | Shows causality across services | Typically sampled; doesn't capture every request |
| Logs | "Why did this happen?" | Rich detail, full context | Expensive at scale, hard to aggregate |
The investigation flow is almost always the same: metrics fire an alert → you use traces to pinpoint the slow or failing service → you read the logs from that service to understand the root cause. This is sometimes summarized as: metrics tell you something is wrong, traces show you where, logs tell you why.
Practical Example: Diagnosing a Slow Checkout
Let's walk through a realistic incident to see how all three pillars work together. Imagine you run an e-commerce platform, and customers start complaining that checkout is slow.
-
Metrics alert fires
Your dashboard shows that
checkout.durationp99 has jumped from 400ms to 3,200ms. Thecheckout.attemptscounter is steady, so traffic hasn't spiked — the slowness is internal. You also noticepayment_service.error_ratehas climbed from 0.1% to 12%. -
Traces reveal the bottleneck
You open your tracing UI and filter for slow checkout traces (duration > 2s). The waterfall view makes it obvious: the
charge_paymentspan is taking 2,800ms instead of the usual 150ms. Upstream spans likecheck_inventoryare fine. -
Logs explain the root cause
You copy the
trace_idfrom a slow trace and search your log backend. The payment service logs tell the full story:json{ "timestamp": "2024-03-15T14:32:07.341Z", "level": "WARN", "service": "payment-service", "trace_id": "abc123def456", "message": "Connection pool exhausted, waiting for available connection", "pool_size": 10, "active_connections": 10, "wait_time_ms": 2650 } { "timestamp": "2024-03-15T14:32:09.991Z", "level": "ERROR", "service": "payment-service", "trace_id": "abc123def456", "message": "Payment gateway request failed: TLS handshake timeout", "gateway_host": "api.payments.example.com", "retry_attempt": 2 }Root cause: the payment gateway is experiencing TLS handshake slowdowns, which exhausts the connection pool and cascades into timeouts for all checkout requests.
Without metrics, you wouldn't know there was a problem until customers complained. Without traces, you'd be guessing which service was slow. Without logs, you'd know the payment service was the bottleneck but not that it was a connection pool issue triggered by TLS handshake failures.
The Emerging Fourth Signal: Profiles
Continuous profiling is increasingly recognized as the fourth pillar of observability. While traces show you that a service is slow, profiles show you exactly which function or line of code is consuming CPU, allocating memory, or blocking on I/O. Tools like Pyroscope, Parca, and Grafana's profiling integration let you attach flame graphs directly to trace spans.
OpenTelemetry added profiling support as a signal type in 2024, cementing its place alongside logs, metrics, and traces. The investigation chain becomes: metrics → traces → profiles → logs — narrowing from "something is wrong" all the way down to a specific hot code path.
If you're building observability from scratch, prioritize metrics (for alerting) and traces (for debugging) first. Add structured logging with trace context correlation next. Continuous profiling is valuable but addresses a narrower set of problems — add it once the first three pillars are solid.
OpenTelemetry: Origin Story and Project Structure
OpenTelemetry didn't appear from nowhere. It's the product of a hard-won merger between two competing open-source observability projects — OpenTracing and OpenCensus — each of which solved part of the instrumentation puzzle but neither of which could win the ecosystem alone. Understanding that history explains many of OTel's design decisions today.
The Two Predecessors
OpenTracing (2016)
OpenTracing launched in 2016 as a vendor-neutral API specification for distributed tracing. It defined a standard set of interfaces — spans, contexts, and propagation — that library authors could instrument against without coupling their code to a specific tracing backend like Jaeger, Zipkin, or Datadog. The key idea: instrument once, choose your backend later.
OpenTracing was API-only. It didn't ship an SDK, a collector, or any wire protocol. Vendors and open-source projects provided the actual implementations. This made it lightweight but also meant there was no "batteries included" experience — you always needed a third-party library to actually do anything with the traces.
OpenCensus (2018)
OpenCensus came out of Google in 2018 and took a different approach. Rather than being API-only, it provided a complete package: API, SDK, and built-in exporters for both tracing and metrics. You could drop OpenCensus into your Go or Java service and immediately start shipping data to Prometheus, Stackdriver, Jaeger, or Zipkin without stitching together multiple libraries.
The trade-off was tighter coupling. OpenCensus was more opinionated — the exporters lived in-tree, and the project's scope was broader. It also originated from Google's internal Census library, which meant it carried certain design assumptions from that world.
Why They Merged
By 2018, the observability community had a real problem: two credible, CNCF-adjacent projects solving overlapping problems with incompatible APIs. Library maintainers had to choose which standard to instrument for — or worse, instrument for both. Vendors had to support two integration paths. End users were confused about which project to adopt.
The fragmentation created three specific pain points:
- Library authors couldn't pick a winner. If you maintained a popular HTTP framework, instrumenting for OpenTracing meant OpenCensus users got nothing (and vice versa).
- Duplicate engineering effort. Both projects were building context propagation, span management, and exporter pipelines — the same fundamental problems solved twice.
- Vendor fatigue. Backend vendors (Jaeger, Datadog, Lightstep, etc.) had to maintain parallel integrations for two standards that were supposed to reduce fragmentation in the first place.
In May 2019, the two projects announced their merger into OpenTelemetry. The new project entered the CNCF as a Sandbox project, aiming to combine OpenTracing's clean API design with OpenCensus's batteries-included approach. The goal: a single, unified standard for all telemetry signals — traces, metrics, and eventually logs.
OpenTelemetry explicitly provides bridge packages for both OpenTracing and OpenCensus. If you have existing instrumentation using either predecessor, you can migrate incrementally — you don't have to rip and replace everything at once.
Timeline of Major Milestones
| Date | Milestone | Significance |
|---|---|---|
| Nov 2016 | OpenTracing joins CNCF | First vendor-neutral tracing API standard |
| Jan 2018 | OpenCensus launched by Google | Combined tracing + metrics with built-in exporters |
| May 2019 | OpenTelemetry merger announced | Two projects unite under one umbrella |
| May 2019 | CNCF Sandbox admission | Official CNCF home for the merged project |
| Aug 2021 | CNCF Incubation | Recognized project maturity and adoption |
| Feb 2022 | Tracing specification reaches GA | Stable APIs/SDKs for traces across major languages |
| May 2023 | Metrics specification reaches GA | Stable APIs/SDKs for metrics (Go, Java, .NET first) |
| Apr 2024 | Logs specification reaches Stable | All three signals now stable — the "triple play" |
| 2024 | 2nd most active CNCF project | Behind only Kubernetes in contributor activity |
Project Structure
OpenTelemetry is not a single repository or a single binary. It's an ecosystem of coordinated components, all governed by a central specification. Understanding how these pieces fit together is essential before you start instrumenting anything.
graph TD
SPEC["📜 OTel Specification
(Language-agnostic rules)"]
SEMCONV["📖 Semantic Conventions
(Standard attribute names)"]
OTLP["📡 OTLP Protocol
(Wire format for telemetry)"]
SDK_GO["Go SDK"]
SDK_JAVA["Java SDK"]
SDK_PY["Python SDK"]
SDK_JS["JS/TS SDK"]
SDK_DOTNET[".NET SDK"]
SDK_OTHER["Ruby, Rust, C++, …"]
COLLECTOR["🔧 OTel Collector
(Receive, process, export)"]
CONTRIB["📦 Contrib Packages
(Community extensions)"]
SPEC --> SEMCONV
SPEC --> OTLP
SPEC --> SDK_GO
SPEC --> SDK_JAVA
SPEC --> SDK_PY
SPEC --> SDK_JS
SPEC --> SDK_DOTNET
SPEC --> SDK_OTHER
SPEC --> COLLECTOR
OTLP --> COLLECTOR
OTLP --> SDK_GO
OTLP --> SDK_JAVA
SEMCONV --> SDK_GO
SEMCONV --> SDK_JAVA
SEMCONV --> SDK_PY
SEMCONV --> SDK_JS
COLLECTOR --> CONTRIB
SDK_GO --> CONTRIB
SDK_JAVA --> CONTRIB
The Specification
Everything in OpenTelemetry starts with the specification — a language-agnostic document that defines how telemetry data should be created, processed, and exported. It specifies API interfaces (what instrumentation authors call), SDK behavior (how telemetry data is processed and batched), and data model semantics (what a span, metric point, or log record looks like).
Language implementations must conform to the specification. This is what makes OpenTelemetry truly portable: a span created in a Go service and a span created in a Python service share the same structural contract, enabling seamless end-to-end distributed traces.
Language-Specific APIs and SDKs
Each supported language has its own repository containing two layers:
- API — A minimal, zero-dependency interface that library authors instrument against. The API is safe to call even when no SDK is installed (it becomes a no-op).
- SDK — The full implementation that application owners install. It handles sampling, batching, resource detection, and export. You configure the SDK at application startup.
The major language implementations (Go, Java, Python, JavaScript/TypeScript, .NET) have reached GA stability for traces and metrics. Others like Ruby, Rust, C++, Swift, and Erlang/Elixir are in varying stages of maturity.
OTLP — OpenTelemetry Protocol
OTLP is OpenTelemetry's native wire protocol for transmitting telemetry data. It supports all three signals (traces, metrics, logs) over both gRPC and HTTP/protobuf transports. OTLP is now widely supported — most observability backends (Grafana, Datadog, Honeycomb, New Relic, Dynatrace, etc.) accept OTLP natively, which means you can often skip vendor-specific exporters entirely.
Semantic Conventions
Semantic conventions define standard names and values for telemetry attributes. For example, an HTTP server span should use http.request.method for the HTTP method, url.path for the path, and server.port for the port number. Without these conventions, every team would invent their own attribute names, making cross-service queries and dashboards a nightmare.
The Collector
The OTel Collector is a standalone binary that acts as a telemetry pipeline. It receives data (from SDKs or other sources), processes it (filtering, batching, enriching, sampling), and exports it to one or more backends. You can deploy it as a sidecar, a DaemonSet, or a standalone gateway. The Collector decouples your application's instrumentation from your backend choice — your app always sends OTLP to the Collector, and the Collector handles the rest.
Contrib Repositories
Each language SDK and the Collector have a corresponding -contrib repository. These contain community-maintained extensions: auto-instrumentation libraries for popular frameworks (Express, Flask, Spring), additional exporters (Prometheus, Zipkin), and Collector processors/receivers contributed by vendors and the community. The contrib repos keep the core lean while enabling a broad ecosystem.
When evaluating language SDK maturity, check the OTel status page. Each signal (traces, metrics, logs) has its own stability level per language — "stable" in one language doesn't automatically mean stable in all.
Governance and the Path to CNCF Graduation
OpenTelemetry's governance is designed for a project of its scale — over 1,000 contributors across dozens of repositories. Three layers keep things organized:
- Technical Committee (TC) — A small elected group that owns the specification and resolves cross-cutting technical decisions. They ensure consistency across language implementations.
- Governance Committee (GC) — Handles non-technical project governance: community health, contributor experience, CNCF relationship, and project-wide policies.
- Special Interest Groups (SIGs) — Each language implementation, the Collector, and major feature areas (like semantic conventions or logging) have their own SIG with dedicated maintainers and regular meetings. SIGs operate with significant autonomy within the bounds of the spec.
As of 2024, OpenTelemetry is the 2nd most active CNCF project (behind Kubernetes) by contributor count and commit volume. With all three telemetry signals now stable, the project is on the path toward CNCF Graduation — the highest maturity level, signaling production readiness and sustainable governance. The project has completed or is finalizing due diligence for this milestone.
CNCF Graduation requires an independent security audit, a documented governance model, and demonstrated adoption by multiple organizations. OpenTelemetry's adoption — spanning cloud providers, SaaS vendors, and enterprises — positions it well for this milestone.
The API/SDK Separation: OTel's Key Design Decision
OpenTelemetry's most important architectural choice isn't about how it collects data — it's about how it separates concerns. The project splits itself into two distinct layers: a lightweight API for instrumenting code, and a heavier SDK for processing and exporting telemetry. This separation is what makes OTel viable as a universal standard rather than just another vendor library.
The API is a thin, stable interface that defines how you create spans, record metrics, and emit logs. By itself, it does nothing — every call is a no-op. The SDK is the engine that gives those calls meaning: it collects spans, batches metrics, and ships everything to your backend of choice. Understanding why these are separate packages is the key to understanding how OTel works in practice.
The Architecture at a Glance
The layered design ensures that instrumentation code (in libraries and applications) only depends on the API, while the SDK — with its heavier dependencies on exporters, processors, and configuration — is wired up once at the application's entry point.
graph TD
subgraph Your Code
APP["Application Code"]
LIB["Library Code
(e.g., HTTP client, ORM)"]
end
subgraph OTel API Layer
TAPI["Tracer API"]
MAPI["Meter API"]
LAPI["Logger API"]
end
subgraph OTel SDK Layer
TP["TracerProvider"]
MP["MeterProvider"]
LP["LoggerProvider"]
SP["SpanProcessors"]
MR["MetricReaders"]
end
subgraph Exporters
OTLP["OTLP Exporter"]
JAEG["Jaeger Exporter"]
PROM["Prometheus Exporter"]
CONS["Console Exporter"]
end
APP -->|"instruments with"| TAPI
APP -->|"instruments with"| MAPI
LIB -->|"instruments with"| TAPI
LIB -->|"instruments with"| MAPI
LIB -->|"instruments with"| LAPI
TAPI -->|"delegates to"| TP
MAPI -->|"delegates to"| MP
LAPI -->|"delegates to"| LP
TP --> SP
MP --> MR
SP --> OTLP
SP --> JAEG
MR --> PROM
SP --> CONS
Why Separate API from SDK?
The separation solves a real dependency management problem. Imagine you maintain a popular HTTP client library and want to add tracing. If you depend on a full SDK, you've just forced every consumer of your library to pull in OTLP exporters, gRPC dependencies, and configuration machinery — even if they don't want telemetry at all. That's a non-starter for library authors.
With the API/SDK split, your library depends only on the API package — a handful of interfaces with zero transitive dependencies. Application owners then choose whether to install the SDK. If they do, your library's instrumentation lights up. If they don't, every tracing call silently does nothing with near-zero overhead.
| Concern | API Package | SDK Package |
|---|---|---|
| Who uses it | Library authors & app developers | Application owners (at the entry point) |
| Dependencies | Near-zero (interfaces only) | Heavier (exporters, processors, gRPC/HTTP) |
| Stability | Very stable — rarely changes | Evolves more frequently |
| Default behavior | No-op (does nothing) | Configurable pipelines |
| Vendor lock-in | None | None (but you choose exporters here) |
The Provider Bridge: TracerProvider, MeterProvider, LoggerProvider
The API and SDK are connected through providers. Each telemetry signal has one: TracerProvider for traces, MeterProvider for metrics, and LoggerProvider for logs. The API defines provider interfaces; the SDK supplies concrete implementations.
When your code calls tracer.start_span("process-order"), the API looks up the globally registered TracerProvider. If an SDK has been configured, that provider creates a real span with timing data, attributes, and context propagation. If no SDK is installed, the global provider is a no-op implementation that returns a dummy span — no allocations, no side effects, no overhead.
The no-op default isn't a fallback — it's a deliberate design goal. It means any library in the ecosystem can add OTel instrumentation without imposing runtime cost on users who haven't opted into observability. The API package typically adds <100KB to your dependency tree and the no-op code paths are optimized to be effectively free.
In Practice: Library Code vs. Application Code
The split creates two distinct roles in any codebase. Library code instruments with the API only — it creates spans and records metrics but never configures where that data goes. Application code (your main.py or index.ts) sets up the SDK: it chooses providers, attaches exporters, and defines resource attributes that identify the service.
Library Code — API Only
A library author depends solely on the API package. The code creates spans and records metrics without any knowledge of how (or whether) they'll be exported.
# my_http_library/client.py
# Depends ONLY on: opentelemetry-api
from opentelemetry import trace
# Get a tracer scoped to this library's name and version
tracer = trace.get_tracer("my-http-library", "1.2.0")
def fetch(url: str, method: str = "GET") -> Response:
with tracer.start_as_current_span(
f"{method} {url}",
kind=trace.SpanKind.CLIENT,
attributes={"http.method": method, "http.url": url},
) as span:
response = _do_request(url, method)
span.set_attribute("http.status_code", response.status_code)
if response.status_code >= 400:
span.set_status(trace.Status(trace.StatusCode.ERROR))
return response
// my-http-library/src/client.ts
// Depends ONLY on: @opentelemetry/api
import { trace, SpanKind, SpanStatusCode } from "@opentelemetry/api";
// Get a tracer scoped to this library's name and version
const tracer = trace.getTracer("my-http-library", "1.2.0");
export async function fetch(url: string, method = "GET"): Promise<Response> {
return tracer.startActiveSpan(
`${method} ${url}`,
{ kind: SpanKind.CLIENT, attributes: { "http.method": method, "http.url": url } },
async (span) => {
const response = await doRequest(url, method);
span.setAttribute("http.status_code", response.status);
if (response.status >= 400) {
span.setStatus({ code: SpanStatusCode.ERROR });
}
span.end();
return response;
}
);
}
Notice that neither example imports anything from an SDK package. There's no mention of exporters, processors, or backends. If a user installs this library without an OTel SDK, tracer.start_as_current_span() returns a no-op span and the with block (or callback) runs with negligible overhead.
Application Code — SDK Setup
The application owner installs the SDK and configures the full pipeline. This is typically done once, at startup, before any instrumented code runs. Here you choose your exporters, define resource attributes (like service name), and register providers globally.
# app/main.py
# Depends on: opentelemetry-sdk, opentelemetry-exporter-otlp
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
# 1. Define resource attributes that identify this service
resource = Resource.create({
"service.name": "order-service",
"service.version": "2.4.1",
"deployment.environment": "production",
})
# 2. Create a TracerProvider with the resource
provider = TracerProvider(resource=resource)
# 3. Attach a BatchSpanProcessor with an OTLP exporter
otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
# 4. Register the provider globally — this "activates" all API instrumentation
trace.set_tracer_provider(provider)
# Now, any library using the OTel API (like my-http-library) will
# produce real spans that get exported to your collector.
// app/tracing.ts — import this before anything else
// Depends on: @opentelemetry/sdk-trace-node, @opentelemetry/exporter-trace-otlp-grpc
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { Resource } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from "@opentelemetry/semantic-conventions";
import { trace } from "@opentelemetry/api";
// 1. Define resource attributes that identify this service
const resource = new Resource({
[ATTR_SERVICE_NAME]: "order-service",
[ATTR_SERVICE_VERSION]: "2.4.1",
"deployment.environment": "production",
});
// 2. Create a TracerProvider with the resource
const provider = new NodeTracerProvider({ resource });
// 3. Attach a BatchSpanProcessor with an OTLP exporter
const exporter = new OTLPTraceExporter({ url: "http://otel-collector:4317" });
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
// 4. Register the provider globally — this "activates" all API instrumentation
provider.register();
// Now, any library using @opentelemetry/api will produce real spans
// that get batched and exported to your collector via gRPC.
Always initialize the SDK before importing any instrumented code. In Node.js, use the --require or --import flag to load your tracing setup file first: node --require ./tracing.js app.js. In Python, configure at the top of your entry point before other imports. If the SDK registers after instrumented code runs, those early spans are lost to the no-op provider.
How the Pieces Connect at Runtime
The global registration step (trace.set_tracer_provider(provider) in Python, provider.register() in TypeScript) is where the magic happens. Before that call, the API's global state holds a no-op provider. After it, every get_tracer() call returns a real tracer backed by your configured BatchSpanProcessor and exporter.
This means library code doesn't need to be "aware" of the SDK at all — not at compile time, not at import time. The API uses a service-locator pattern: it looks up the current global provider at the moment you request a tracer. Libraries written months before your application was configured will still produce telemetry, as long as the SDK is initialized before their code paths execute.
The API/SDK split works identically for metrics (MeterProvider) and logs (LoggerProvider). Each signal has its own provider, processor, and exporter chain. You can enable just traces, just metrics, or all three — the pattern is the same. Libraries that record metrics via the API will see the same no-op-by-default behavior until you register a MeterProvider with the SDK.
OTLP: The OpenTelemetry Protocol and Unified Data Model
OTLP (OpenTelemetry Protocol) is the native wire protocol for transmitting traces, metrics, and logs from instrumented applications to backends. Unlike vendor-specific formats that lock you into a particular ecosystem, OTLP provides a single, well-defined protocol that every OpenTelemetry SDK speaks natively — no translation layer required.
Understanding OTLP matters because it dictates how your telemetry data is structured, serialized, and delivered. The protocol defines both the transport mechanism (how bytes move over the network) and the data model (what those bytes represent). Let's break down both.
Transport Options
OTLP supports three transport variants, each suited to different deployment scenarios. The choice of transport affects performance, compatibility, and debuggability.
| Transport | Content Type | Best For | Trade-offs |
|---|---|---|---|
| gRPC | application/grpc | Production workloads, high throughput | Bidirectional streaming, HTTP/2 multiplexing; may be blocked by some proxies/firewalls |
| HTTP/protobuf | application/x-protobuf | Firewall-restricted environments | Same binary efficiency as gRPC; works through HTTP/1.1 proxies and load balancers |
| HTTP/JSON | application/json | Debugging, manual inspection | Human-readable; 5–10× larger payload than protobuf; not recommended for production |
gRPC is the default and recommended transport. It leverages HTTP/2 for multiplexed streams — a single TCP connection can carry traces, metrics, and logs concurrently without head-of-line blocking. The bidirectional streaming capability also enables efficient flow control between the client and the collector.
When gRPC isn't an option (corporate proxies, AWS ALBs that don't support HTTP/2, browser environments), HTTP/protobuf gives you the same binary encoding efficiency over plain HTTP/1.1. HTTP/JSON is the last resort — useful when you need to curl an endpoint or inspect payloads during development.
gRPC uses service definitions (e.g., opentelemetry.proto.collector.trace.v1.TraceService/Export). HTTP transports use fixed paths: /v1/traces, /v1/metrics, and /v1/logs. If your collector isn't receiving data, check that you're hitting the right port and path.
The OTLP Data Model: Three Nesting Layers
Every OTLP payload — whether it carries traces, metrics, or logs — follows the same three-level nesting pattern. This isn't arbitrary; it's designed to avoid redundant data on the wire. Instead of stamping every single span with the service name and SDK version, that metadata is declared once at the top level and inherited by everything below it.
Resource
A Resource identifies the entity producing the telemetry: a service, a process, a container. It carries key-value attributes like service.name, service.version, host.name, and cloud.region. Every span, metric, and log record within a given export request shares the same Resource, so it's sent exactly once per batch.
InstrumentationScope
The InstrumentationScope identifies the library or module that generated the telemetry. If your application uses an HTTP client library that auto-instruments requests and a database library that instruments queries, each gets its own scope with a name and version. This lets backends filter and attribute telemetry to specific instrumentation libraries.
Signal-Specific Data
Below the scope layer, you find the actual telemetry records — Spans for traces, Metrics (with their data points) for metrics, and LogRecords for logs. Each signal type has its own protobuf message structure, but they all slot into the same Resource → Scope → Data hierarchy.
Traces: The Protobuf Structure
A trace export request wraps spans in ResourceSpans → ScopeSpans → Span. Here's the simplified protobuf schema showing the key fields on a Span:
message ExportTraceServiceRequest {
repeated ResourceSpans resource_spans = 1;
}
message ResourceSpans {
Resource resource = 1; // service.name, host, etc.
repeated ScopeSpans scope_spans = 2;
}
message ScopeSpans {
InstrumentationScope scope = 1; // library name + version
repeated Span spans = 2;
}
message Span {
bytes trace_id = 1; // 16-byte globally unique trace ID
bytes span_id = 2; // 8-byte unique span ID
bytes parent_span_id = 4; // empty for root spans
string name = 5; // e.g., "GET /api/users"
SpanKind kind = 6; // CLIENT, SERVER, PRODUCER, CONSUMER, INTERNAL
fixed64 start_time_unix_nano = 7;
fixed64 end_time_unix_nano = 8;
repeated KeyValue attributes = 9;
repeated Event events = 10; // timestamped annotations
repeated Link links = 11; // cross-trace references
Status status = 15; // OK, ERROR, UNSET
}
The trace_id ties all spans in a distributed transaction together. Each span has its own span_id, and the parent_span_id establishes the causal tree. Events are timestamped annotations attached to a span (e.g., an exception event with a stack trace), while Links connect spans across trace boundaries (useful for batch processing where one span triggers many traces).
Metrics: Data Points and Temporality
Metrics follow the same nesting: ResourceMetrics → ScopeMetrics → Metric. But a Metric isn't a single value — it wraps typed data points that carry the actual measurements.
message Metric {
string name = 1; // e.g., "http.server.request.duration"
string description = 2;
string unit = 3; // e.g., "ms", "By", "{request}"
oneof data {
Gauge gauge = 5;
Sum sum = 7; // counter or up-down counter
Histogram histogram = 9;
ExponentialHistogram exp_histogram = 10;
Summary summary = 11; // legacy, avoid
}
}
message NumberDataPoint {
repeated KeyValue attributes = 7;
fixed64 time_unix_nano = 3;
oneof value {
double as_double = 4;
sfixed64 as_int = 6;
}
}
The oneof data field means each Metric message carries exactly one type of measurement. A Sum with is_monotonic = true represents a counter; a Histogram carries bucket boundaries and counts. Each data point includes its own set of attributes (dimensions/labels) and a timestamp.
Logs: Bridging Structured and Unstructured
Logs complete the three-signal data model with ResourceLogs → ScopeLogs → LogRecord. The LogRecord is designed to accommodate both structured observability logs and legacy unstructured text.
message LogRecord {
fixed64 time_unix_nano = 1;
SeverityNumber severity_number = 2; // 1–24 (TRACE to FATAL)
string severity_text = 3; // "INFO", "ERROR", etc.
AnyValue body = 5; // the log message itself
repeated KeyValue attributes = 6;
bytes trace_id = 9; // correlation with traces
bytes span_id = 10; // correlation with spans
}
The trace_id and span_id fields are what make OTel logs fundamentally more useful than traditional log aggregation. When a log record carries these IDs, your backend can jump directly from a log line to the exact trace and span that produced it — no regex parsing of correlation IDs required.
Wire Efficiency: Compression and Batching
OTLP is designed for efficiency on the wire. The three-layer nesting deduplicates Resource and Scope metadata across potentially thousands of spans or data points in a single request. Beyond structural efficiency, OTLP supports gzip compression on all transports via the Content-Encoding: gzip header (or gRPC's built-in compression). In practice, gzip reduces payload size by 70–90% for typical telemetry data.
On the SDK side, the OTLP exporter batches telemetry before sending. The default batch size is 512 spans (configurable via OTEL_BSP_MAX_EXPORT_BATCH_SIZE), and the exporter flushes every 5 seconds or when the batch is full — whichever comes first.
Retries and Partial Success
OTLP defines explicit retry semantics. When the collector returns a retryable status code (HTTP 429, 502, 503, 504, or gRPC UNAVAILABLE), the exporter retries with exponential backoff. The response may include a Retry-After header or a retry_info in the gRPC status details that the exporter should respect.
// The collector can accept some items and reject others
message ExportTracePartialSuccess {
int64 rejected_spans = 1; // number of spans rejected
string error_message = 2; // human-readable reason
}
Partial success is a feature unique to OTLP among telemetry protocols. A collector can accept 900 out of 1000 spans and tell the exporter exactly how many were rejected and why. This prevents the all-or-nothing failure mode where a single bad span causes an entire batch to be dropped. The exporter can then decide whether to retry just the rejected items or log the loss.
To inspect what your application actually sends, temporarily switch the exporter to HTTP/JSON and point it at a local listener: OTEL_EXPORTER_OTLP_PROTOCOL=http/json OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318. Then run nc -l 4318 or use a tool like otel-desktop-viewer to see the raw JSON payload.
Configuring the OTLP Exporter
All OpenTelemetry SDKs support a standard set of environment variables to configure OTLP export without code changes. Here are the most important ones:
# Protocol: grpc (default), http/protobuf, or http/json
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
# Endpoint — gRPC uses port 4317, HTTP uses port 4318
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
# Enable gzip compression (highly recommended)
export OTEL_EXPORTER_OTLP_COMPRESSION=gzip
# Timeout for each export request (default: 10s)
export OTEL_EXPORTER_OTLP_TIMEOUT=10000
# Custom headers (e.g., for authentication)
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer tok123,X-Tenant=team-a"
You can also set per-signal overrides. For example, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT overrides the endpoint for traces only, while OTEL_EXPORTER_OTLP_METRICS_PROTOCOL lets you use HTTP/protobuf for metrics even if traces use gRPC. This flexibility is useful when traces go to one backend and metrics to another.
Distributed Tracing: Spans, Traces, and Trace Context
A trace represents the complete journey of a single request as it moves through your distributed system. Structurally, a trace is a directed acyclic graph (DAG) of spans, where each span captures one unit of work — an HTTP handler, a database query, a message publish. The edges in this graph encode causal relationships: span B was triggered by span A.
Every span in the same trace shares a common trace_id. Parent-child relationships between spans are established via parent_span_id, which lets backend systems reconstruct the full call graph and render it as a waterfall timeline.
sequenceDiagram
participant C as Client
participant GW as API Gateway
participant US as User Service
participant DB as Database
Note over C: trace_id: abc123 generated
C->>GW: GET /users/42
traceparent: 00-abc123-span01-01
activate GW
Note over GW: Creates span02
parent: span01
GW->>US: GET /internal/users/42
traceparent: 00-abc123-span02-01
activate US
Note over US: Creates span03
parent: span02
US->>DB: SELECT * FROM users WHERE id=42
traceparent: 00-abc123-span03-01
activate DB
Note over DB: Creates span04
parent: span03
DB-->>US: Row result
deactivate DB
US-->>GW: 200 OK { user data }
deactivate US
GW-->>C: 200 OK { user data }
deactivate GW
Note over C,DB: All 4 spans share trace_id abc123
In this flow, the client generates a trace_id and the first span_id. Each downstream service receives the trace context via the traceparent HTTP header, creates its own span, and sets the incoming span_id as its parent_span_id. The result is a four-span trace that captures the full request lifecycle.
Anatomy of a Span
A span is the fundamental building block of a trace. Each span captures a discrete operation with precise timing, identity, and contextual metadata. Here are the core fields every span carries:
| Field | Size / Type | Description |
|---|---|---|
trace_id | 128-bit (32 hex chars) | Globally unique identifier shared by all spans in the same trace |
span_id | 64-bit (16 hex chars) | Unique identifier for this specific span |
parent_span_id | 64-bit (16 hex chars) | The span_id of the parent; empty for root spans |
name | string | Operation name (e.g., GET /users/{id}, SELECT users) |
kind | enum | Role of the span: CLIENT, SERVER, PRODUCER, CONSUMER, or INTERNAL |
start_time | nanosecond timestamp | When the operation began |
end_time | nanosecond timestamp | When the operation completed |
attributes | key-value pairs | Structured metadata (e.g., http.method, db.system) |
events | timestamped list | Annotations at specific moments within the span's lifetime |
links | list of span contexts | References to spans in other traces (e.g., batch triggers) |
status | enum | OK, ERROR, or UNSET |
Here's what a span looks like as structured data when exported to a tracing backend:
{
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"parent_span_id": "b3e4a2f8cd91d5a0",
"name": "GET /users/{id}",
"kind": "SERVER",
"start_time": "2024-11-15T09:30:00.000000000Z",
"end_time": "2024-11-15T09:30:00.047000000Z",
"status": { "code": "OK" },
"attributes": {
"http.method": "GET",
"http.url": "https://api.example.com/users/42",
"http.status_code": 200,
"http.route": "/users/{id}"
}
}
Span Kind
The kind field describes the role a span plays in the overall trace topology. Getting this right matters — tracing backends use span kind to correctly pair client and server spans, calculate service-to-service latency, and build dependency graphs.
| Span Kind | Description | Example |
|---|---|---|
CLIENT | The span initiates an outbound request to a remote service | An HTTP client call, a gRPC stub invocation |
SERVER | The span handles an inbound request from a remote client | An HTTP handler, a gRPC service method |
PRODUCER | The span creates a message for later async processing | Publishing to Kafka, enqueuing to RabbitMQ |
CONSUMER | The span processes a message produced by a PRODUCER | Kafka consumer handler, SQS message processor |
INTERNAL | An internal operation that doesn't cross a process boundary | Business logic computation, in-memory cache lookup |
When Service A calls Service B, you get two spans for that single hop: a CLIENT span in Service A and a SERVER span in Service B. Both share the same trace_id, and the SERVER span's parent_span_id points to the CLIENT span. The time difference between them reveals network latency.
Events, Links, and Status
Events are timestamped annotations attached to a span. They mark specific moments within the span's lifetime — things like "connection acquired from pool" or "retry attempted." The most common event type is the exception event, which OTel libraries record automatically when an error is caught.
{
"name": "exception",
"timestamp": "2024-11-15T09:30:00.035000000Z",
"attributes": {
"exception.type": "ConnectionTimeoutError",
"exception.message": "Connection to db-primary:5432 timed out after 5000ms",
"exception.stacktrace": "at DbPool.acquire (pool.js:142)\n at UserRepo.findById ..."
}
}
Links connect spans across separate traces. Unlike the parent-child relationship, a link is a lateral reference — it says "this span is related to that other span, but wasn't caused by it." This is essential for batch processing: a single consumer span that processes 50 messages can link back to each of the 50 producer spans, even though they belong to different traces.
Status has three possible values. UNSET is the default and means the span completed without the instrumentation explicitly setting a status. OK means the operation was explicitly validated as successful. ERROR means the operation failed. Server frameworks typically set ERROR for 5xx responses and leave status UNSET for 4xx, since a 404 is not a server error.
W3C Trace Context
For traces to survive across service boundaries, there must be a standardized way to encode and transmit trace identity. The W3C Trace Context specification defines two HTTP headers that solve this: traceparent and tracestate.
The traceparent Header
This header carries the essential trace identity in a fixed format with four fields separated by hyphens:
traceparent: {version}-{trace_id}-{parent_id}-{trace_flags}
Example:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
^^ ^^
version=00 flags: 01 = sampled
| Field | Length | Description |
|---|---|---|
version | 2 hex chars | Format version; currently always 00 |
trace_id | 32 hex chars (128-bit) | The trace this span belongs to |
parent_id | 16 hex chars (64-bit) | The span_id of the calling span |
trace_flags | 2 hex chars | Bit field; 01 = sampled, 00 = not sampled |
The tracestate Header
The tracestate header carries vendor-specific trace data as a list of key-value pairs. Each tracing vendor can store its own context without clobbering others. This allows systems using Datadog, Dynatrace, and OpenTelemetry to coexist on the same request path.
tracestate: dd=s:1;t.dm:-4,ot=th:256
# Format: vendor1=value1,vendor2=value2
# Each vendor manages its own key; unknown keys are forwarded unchanged
Context Propagation in Practice
Trace context must propagate across every boundary in your system — not just HTTP. The mechanism changes depending on the transport, but the principle is always the same: inject context on the sending side, extract it on the receiving side.
HTTP — Headers
HTTP propagation is the most straightforward case. OTel SDKs automatically inject traceparent and tracestate into outgoing request headers and extract them from incoming requests. Here's what manual propagation looks like in Python:
from opentelemetry import context, trace
from opentelemetry.propagate import inject, extract
# --- Receiving side (server/consumer) ---
# Extract trace context from incoming HTTP headers
ctx = extract(carrier=request.headers)
with trace.get_tracer(__name__).start_as_current_span(
"handle_request", context=ctx, kind=trace.SpanKind.SERVER
):
# --- Sending side (client/producer) ---
# Inject current trace context into outgoing headers
headers = {}
inject(carrier=headers)
response = httpx.get("http://user-service/users/42", headers=headers)
Message Queues — Message Attributes
For asynchronous messaging (Kafka, RabbitMQ, SQS), trace context travels as message attributes or headers rather than HTTP headers. The producer injects context when publishing; the consumer extracts it when processing. This creates a PRODUCER → CONSUMER span relationship that can span minutes or hours.
# Producer: inject context into Kafka message headers
with tracer.start_as_current_span("publish_order", kind=trace.SpanKind.PRODUCER):
kafka_headers = {}
inject(carrier=kafka_headers)
producer.send("orders", value=order_data, headers=kafka_headers)
# Consumer: extract context from Kafka message headers
def handle_message(message):
ctx = extract(carrier=dict(message.headers))
with tracer.start_as_current_span(
"process_order", context=ctx, kind=trace.SpanKind.CONSUMER
):
process(message.value)
Async Boundaries — In-Process Propagation
Within a single process, trace context is stored in a context object tied to the current execution (thread-local in Java, contextvars in Python, AsyncLocalStorage in Node.js). When you spawn threads, goroutines, or async tasks, you must explicitly pass the context — otherwise the new execution unit starts with an empty trace context and creates a disconnected trace.
const { context, trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');
tracer.startActiveSpan('parent-operation', (parentSpan) => {
// Capture the current context before entering async boundary
const currentCtx = context.active();
setTimeout(() => {
// Explicitly restore context inside the async callback
context.with(currentCtx, () => {
tracer.startActiveSpan('delayed-child', (childSpan) => {
// This child correctly appears under parent-operation
childSpan.end();
});
});
}, 1000);
parentSpan.end();
});
The number one cause of disconnected traces in production is lost context propagation. If you use thread pools, goroutine dispatchers, or Promise.all() patterns, verify that trace context is being carried across those boundaries. Auto-instrumentation handles most HTTP cases, but custom async patterns almost always need manual context passing.
Putting It All Together: Trace Waterfall View
When a tracing backend (Jaeger, Tempo, Honeycomb) receives all spans for a trace, it reconstructs the DAG and renders it as a waterfall — a timeline where each span is a horizontal bar, indented under its parent. Here's how the four-span trace from the sequence diagram above would appear:
Trace: 4bf92f3577b34da6a3ce929d0e0e4736 Duration: 47ms
Service Operation Timeline (0ms ────────────────── 47ms)
─────────────────────────────────────────────────────────────────────────────
api-gateway GET /users/42 ██████████████████████████████████████ 0–47ms
user-service GET /internal/users/42 ██████████████████████████████████ 3–45ms
user-service SELECT users ██████████████████████████ 8–40ms
postgres db.query ████████████████████ 12–38ms
Reading a waterfall, you can immediately see that the database query (26ms) dominates the total latency. The 3ms gap between the gateway span and the user-service span is network latency. Each indent level represents a parent-child relationship — the same structure encoded by parent_span_id in the raw span data.
Use low-cardinality operation names like GET /users/{id} rather than GET /users/42. High-cardinality names (with unique IDs baked in) make it impossible to aggregate spans into meaningful groups. Put the specific values in span attributes instead: http.route = "/users/{id}" and user.id = 42.
Metrics Instruments, Aggregation, and Temporality
OpenTelemetry defines a set of metric instruments — typed objects you use to record measurements. Each instrument type carries specific semantics about what kind of data it records and how that data should be aggregated. Choosing the right instrument is not a stylistic preference; it determines what your backend can compute and display.
There are two categories: synchronous instruments (you call them inline in your code) and asynchronous/observable instruments (you register a callback that OTel invokes at collection time). Let's walk through each one.
Counter
A Counter records monotonically increasing values — things that only go up. Total HTTP requests served, bytes sent, errors encountered. You call add() with a non-negative value, and the SDK accumulates the sum.
from opentelemetry import metrics
meter = metrics.get_meter("my.service", "1.0.0")
request_counter = meter.create_counter(
name="http.server.request.count",
description="Total number of HTTP requests received",
unit="1",
)
# In your request handler:
request_counter.add(1, {"http.method": "GET", "http.route": "/users"})
The aggregated value only goes up. If you need to track something that can decrease — like active connections or queue depth — use an UpDownCounter instead.
UpDownCounter
An UpDownCounter tracks a value that can both increase and decrease. Think active connections, items in a queue, or allocated memory pools. You call add() with positive or negative deltas.
active_connections = meter.create_up_down_counter(
name="http.server.active_connections",
description="Number of currently active HTTP connections",
unit="1",
)
# Connection opened
active_connections.add(1, {"server.address": "0.0.0.0"})
# Connection closed
active_connections.add(-1, {"server.address": "0.0.0.0"})
Histogram
A Histogram records the distribution of values. This is the instrument you reach for when you care about percentiles, not just averages — request latency, response payload sizes, query durations. The SDK sorts each recorded value into configured buckets and also tracks the sum, count, min, and max.
import time
request_duration = meter.create_histogram(
name="http.server.request.duration",
description="Duration of HTTP requests in seconds",
unit="s",
)
start = time.perf_counter()
# ... handle request ...
elapsed = time.perf_counter() - start
request_duration.record(elapsed, {
"http.method": "POST",
"http.route": "/orders",
"http.status_code": 201,
})
The default bucket boundaries ([0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000]) are often too coarse for sub-second latencies. You'll learn how to customize them with Views later in this section.
Gauge
A Gauge records a point-in-time snapshot of a value that doesn't aggregate meaningfully over time. CPU usage percentage, current temperature, memory utilization. In the OTel Python SDK, you set the value directly.
cpu_gauge = meter.create_gauge(
name="system.cpu.utilization",
description="Current CPU utilization as a fraction",
unit="1",
)
# Periodically record the current value
import psutil
cpu_gauge.set(psutil.cpu_percent() / 100.0)
Observable (Async) Instruments
Observable instruments flip the control model. Instead of calling add() or record() from your code, you register a callback function that the SDK invokes on each collection cycle. This is ideal for metrics you poll from an external source — system stats, connection pool sizes, or values from a shared data structure you don't want to instrument inline.
There are observable variants for Counter, UpDownCounter, and Gauge: ObservableCounter, ObservableUpDownCounter, and ObservableGauge.
import psutil
from opentelemetry.metrics import Observation
def cpu_usage_callback(options):
"""Called by the SDK on each collection interval."""
usage = psutil.cpu_percent(percpu=True)
for idx, pct in enumerate(usage):
yield Observation(
value=pct / 100.0,
attributes={"cpu.id": str(idx)},
)
meter.create_observable_gauge(
name="system.cpu.utilization",
callbacks=[cpu_usage_callback],
description="Per-core CPU utilization",
unit="1",
)
Use synchronous instruments when the measurement happens at a known point in your code (e.g., inside a request handler). Use observable instruments when the value exists independently of your code flow and you just need to sample it periodically (e.g., system metrics, connection pool stats).
Quick Reference: All Instrument Types
| Instrument | Sync/Async | Monotonic | Example Use Case | Default Aggregation |
|---|---|---|---|---|
Counter | Sync | Yes | Total requests, bytes sent | Sum |
UpDownCounter | Sync | No | Active connections, queue depth | Sum |
Histogram | Sync | — | Request latency, payload size | Explicit bucket histogram |
Gauge | Sync | — | CPU usage, temperature | Last value |
ObservableCounter | Async | Yes | System CPU time, disk I/O totals | Sum |
ObservableUpDownCounter | Async | No | Process thread count | Sum |
ObservableGauge | Async | — | Per-core CPU utilization | Last value |
Aggregation Temporality: Cumulative vs Delta
When the SDK exports metric data, it must decide what time range each data point covers. This is aggregation temporality, and it has two modes:
- Cumulative — each data point represents the running total since the process started (or since the metric stream began). The value at time T includes everything from time 0 to T.
- Delta — each data point represents only the change since the last successful export. The value covers the interval between the previous export and now.
This is not an abstract concern. Your backend dictates which temporality it expects, and sending the wrong one causes incorrect results.
| Backend | Expected Temporality | Why |
|---|---|---|
| Prometheus | Cumulative | Prometheus computes rates from cumulative counters using rate() and increase(). It expects monotonically increasing values and detects resets. |
| Datadog | Delta | Datadog's intake API expects pre-computed deltas. Sending cumulative values results in double-counting. |
| StatsD | Delta | StatsD aggregates deltas on the server side. Each flush is treated as an increment. |
| OTLP (generic) | Either | OTLP supports both. The OTel Collector can convert between them using the cumulativetodelta or deltatocumulative processors. |
You configure temporality on the exporter, not on individual instruments. Here's how to set delta temporality for a Periodic exporting reader:
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
PeriodicExportingMetricReader,
AggregationTemporality,
)
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import (
OTLPMetricExporter,
)
# Delta temporality for all instrument types (e.g., for Datadog)
delta_temporality = {
Counter: AggregationTemporality.DELTA,
UpDownCounter: AggregationTemporality.CUMULATIVE,
Histogram: AggregationTemporality.DELTA,
ObservableCounter: AggregationTemporality.DELTA,
ObservableUpDownCounter: AggregationTemporality.CUMULATIVE,
ObservableGauge: AggregationTemporality.CUMULATIVE,
}
exporter = OTLPMetricExporter(
preferred_temporality=delta_temporality,
)
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=10000)
provider = MeterProvider(metric_readers=[reader])
Even when targeting a delta-preferring backend, UpDownCounter and ObservableUpDownCounter typically remain cumulative. A delta UpDownCounter can produce negative values that many backends misinterpret. Keep these cumulative unless your backend explicitly requires delta for all types.
Views: Customizing Aggregation
Views let you override how an instrument's data is aggregated without changing the instrumentation code. This is powerful: a library author records a Histogram with default buckets, and you — the application operator — reconfigure the bucket boundaries, drop unwanted attributes, or even change the aggregation type entirely.
The most common use case is adjusting histogram bucket boundaries for sub-second latencies:
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.view import View
from opentelemetry.sdk.metrics.export import (
ExplicitBucketHistogramAggregation,
)
# Custom buckets for request latency (in seconds)
latency_view = View(
instrument_name="http.server.request.duration",
aggregation=ExplicitBucketHistogramAggregation(
boundaries=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
),
)
# Drop a high-cardinality attribute to reduce series count
drop_user_id_view = View(
instrument_name="http.server.request.duration",
attribute_keys=["http.method", "http.route", "http.status_code"],
# Only these attributes are kept; "user.id" and others are dropped
)
provider = MeterProvider(
metric_readers=[reader],
views=[latency_view, drop_user_id_view],
)
Views match instruments by name (exact or wildcard), instrument type, meter name, or meter schema URL. You can register multiple views, and they apply independently — a single instrument can produce multiple metric streams if matched by more than one view.
Exemplars: Bridging Metrics and Traces
Metrics tell you what is happening (p99 latency spiked to 2.3s). Traces tell you why (a specific database query took 1.8s). Exemplars are the bridge — they attach sampled trace context (trace ID, span ID) directly to individual metric data points.
When your Histogram records a latency value that falls into a particular bucket, an exemplar can capture the trace ID of the request that produced that value. In your metrics backend, you click from the spike in your p99 chart directly into the offending trace.
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
PeriodicExportingMetricReader,
)
from opentelemetry.sdk.metrics.view import View
from opentelemetry.sdk.metrics._internal.exemplar import (
AlwaysOnExemplarFilter,
TraceBasedExemplarFilter,
)
# TraceBasedExemplarFilter: attach exemplars only when a sampled
# trace context is active (recommended for production)
provider = MeterProvider(
metric_readers=[reader],
exemplar_filter=TraceBasedExemplarFilter(),
)
# AlwaysOnExemplarFilter: attach exemplars for every measurement
# (useful for debugging, high overhead in production)
Exemplars work automatically when both tracing and metrics are configured in the same process. The SDK captures the active span's trace ID and span ID at the moment a measurement is recorded. No changes to your instrumentation code are needed — just configure the exemplar filter on the MeterProvider.
Grafana (with Tempo + Mimir/Prometheus) and Datadog both support exemplar-based metric-to-trace navigation. Verify your backend supports exemplars before enabling them — if it doesn't, the exemplar data is exported but silently ignored, adding overhead for no benefit.
Structured Logging and Log-Trace Correlation
Most applications start with plain-text log lines like INFO 2024-03-15 Order placed for user 42. This works when you're tailing a single log file on a single server. It falls apart the moment you have dozens of services, thousands of requests per second, and a need to answer questions like "show me every log line related to this failed checkout."
Unstructured logs are expensive to parse at query time, impossible to filter reliably with simple string matching, and completely disconnected from the traces and metrics that give them context. Structured logging and log-trace correlation solve all three problems.
Why Unstructured Logs Break at Scale
Consider a traditional log line:
2024-03-15 09:22:17 ERROR [PaymentService] Failed to charge card for order #8812 — timeout after 30s
A human can read this. A machine cannot — at least not without a fragile regex that breaks the next time someone changes the log format. You can't reliably extract the order ID, you can't filter by severity without parsing the word "ERROR" out of a free-text blob, and you have zero connection to the request trace that triggered this payment attempt.
Structured logging solves the parsing problem by emitting log records as key-value pairs in a machine-readable format. The two dominant formats are JSON and logfmt.
Structured Formats: JSON vs logfmt
| Format | Example | Strengths |
|---|---|---|
| JSON | {"level":"error","msg":"charge failed","order_id":8812} |
Universal parser support, nested values, wide ecosystem |
| logfmt | level=error msg="charge failed" order_id=8812 |
Human-readable, compact, easy to scan in a terminal |
JSON is the safer default for production systems — every log aggregator (Elasticsearch, Loki, Datadog) natively parses it. logfmt shines in local development and CLI tools where you read logs with your eyes. Both formats let you query on specific fields (order_id=8812) without regex gymnastics.
The OTel Logs Data Model
OpenTelemetry defines a standard LogRecord structure that goes beyond simple key-value logging. Every log record carries a consistent set of fields that connect it to the broader observability picture — traces, resources, and instrumentation scopes.
The key fields in a LogRecord are:
| Field | Type | Purpose |
|---|---|---|
timestamp | uint64 (nanoseconds) | When the log was emitted |
observed_timestamp | uint64 (nanoseconds) | When the collector received the log |
severity_number | int (1–24) | Numeric severity (maps to TRACE, DEBUG, INFO, WARN, ERROR, FATAL) |
severity_text | string | Human-readable severity like "ERROR" |
body | any | The actual log message (string, map, or array) |
attributes | key-value map | Structured metadata — order_id, user_id, http.status_code |
resource | key-value map | The entity producing the log: service.name, host.name, k8s.pod.name |
trace_id | bytes (16) | Links this log to a distributed trace |
span_id | bytes (8) | Links this log to a specific span within that trace |
The trace_id and span_id fields are what make log-trace correlation possible. Without them, logs and traces live in completely separate silos. With them, you can click a log line in Grafana and jump directly to the exact trace — and vice versa.
The Log Bridge API — Integrate, Don't Replace
OTel does not ask you to throw away your existing logging library. Instead, it provides a Log Bridge API that sits between your application's logger and the OTel SDK. Your code continues to call logging.error() in Python, logger.error() in Log4j, or Log.Error() in Serilog. The bridge intercepts those calls, converts them into OTel LogRecord objects, enriches them with trace context and resource attributes, and exports them via OTLP.
This design has two major advantages. First, you get zero-disruption adoption — no rewriting of application code. Second, the bridge automatically injects the active trace_id and span_id from the current context, which is something you'd otherwise have to wire up manually in every log call.
flowchart LR
A["Application Code\ncalls logger.error()"] --> B["Standard Logger\n(Python logging, Log4j, Serilog)"]
B --> C["OTel Log Bridge\nLoggingHandler / Appender"]
C --> D["Inject trace_id\n& span_id from\nactive context"]
D --> E["Enrich with\nResource Attributes\n(service.name, host, etc.)"]
E --> F["OTLP Exporter"]
F --> G["Backend\n(Grafana, Jaeger, Datadog)"]
G --> H["Query logs\nby trace_id"]
Code Example: Python Logging with OTel Log Bridge
The following example sets up the OTel log bridge for Python's built-in logging module. After this setup, every log call automatically includes trace context — no changes needed in your application code.
import logging
from opentelemetry import trace
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
from opentelemetry.sdk.resources import Resource
# 1. Define the service resource
resource = Resource.create({"service.name": "payment-service"})
# 2. Set up the OTel LoggerProvider with an OTLP exporter
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(
BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://localhost:4317"))
)
# 3. Attach OTel's LoggingHandler to Python's root logger
handler = LoggingHandler(level=logging.DEBUG, logger_provider=logger_provider)
logging.getLogger().addHandler(handler)
logging.getLogger().setLevel(logging.DEBUG)
# 4. Application code — unchanged, uses standard logging
logger = logging.getLogger("payment")
logger.info("Payment processing started", extra={"order_id": 8812})
The LoggingHandler is the bridge. It converts every Python LogRecord into an OTel LogRecord, automatically pulling trace_id and span_id from the active span context. The extra dict fields become OTel log attributes.
Auto-Injection of Trace Context into Logs
Once the bridge is wired up, trace correlation happens automatically within any active span. Here's what it looks like in practice — you create a span, and every log call inside that span carries the trace context.
tracer = trace.get_tracer("payment")
logger = logging.getLogger("payment")
def process_payment(order_id: int, amount: float):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("order.id", order_id)
logger.info("Charging card", extra={"order_id": order_id, "amount": amount})
try:
result = charge_card(order_id, amount)
logger.info("Payment succeeded", extra={"order_id": order_id})
except TimeoutError:
logger.error("Payment gateway timeout", extra={"order_id": order_id})
span.set_status(trace.StatusCode.ERROR, "gateway timeout")
raise
Every logger.info() and logger.error() call inside the with block automatically gets the trace_id and span_id of the process_payment span. The exported log records look like this:
{
"timestamp": "2024-03-15T09:22:17.384Z",
"severity_text": "ERROR",
"severity_number": 17,
"body": "Payment gateway timeout",
"attributes": {
"order_id": 8812
},
"resource": {
"service.name": "payment-service"
},
"trace_id": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4",
"span_id": "f6e5d4c3b2a1f6e5"
}
With this record in your backend, you can query all logs for trace a1b2c3d4... and immediately see every log emitted during that request — across every service in the chain. In Grafana, this is a single click from a log panel to the trace timeline view.
If you also want trace_id in your console output during development, add it to your log formatter: logging.Formatter('%(asctime)s %(levelname)s [trace=%(otelTraceID)s] %(message)s'). The OTel handler injects otelTraceID and otelSpanID as attributes on the Python LogRecord object.
How Log-Trace Correlation Works End-to-End
The correlation chain has three links. First, the OTel tracing SDK creates a span with a unique trace_id and span_id, stored in the current context. Second, when your code emits a log, the Log Bridge reads these IDs from the context and stamps them onto the LogRecord. Third, the backend indexes logs and traces by trace_id, enabling bidirectional navigation.
This means your debugging workflow changes fundamentally. Instead of grep-searching through gigabytes of text, you start from an error log, click through to the trace, see exactly which service and span failed, inspect the span's attributes and events, and find every other log line emitted during that same request. The trace_id becomes the universal join key across all three signals — logs, traces, and metrics.
Context Propagation and Baggage Across Services
Distributed tracing only works if every service in a request chain knows which trace it belongs to. The mechanism that makes this possible is context propagation — the automatic passing of trace identity and metadata across process boundaries via HTTP headers, message headers, or other transport mechanisms.
Without context propagation, each service would create isolated, unrelated traces. With it, you get a single connected trace that follows a request from ingress to the deepest downstream dependency.
The OTel Context Object
At the heart of propagation is the Context object — an immutable key-value store that carries two critical pieces of data through your application:
- Span Context — the trace ID, span ID, trace flags, and trace state that identify the current position in a distributed trace
- Baggage — arbitrary key-value pairs you attach for cross-service communication (covered in detail below)
Context is immutable by design. Every time you add or modify a value, a new Context is returned. This prevents race conditions in concurrent code and ensures that each operation sees a consistent snapshot of the propagation state.
Context is a carrier, not a trace element. A Span is something you create and end. Context is the invisible thread that links spans together across goroutines, threads, and network boundaries. Think of it as the envelope; the span context and baggage are the letters inside.
How Propagation Works: Injection and Extraction
Propagation involves two complementary operations that happen at service boundaries:
- Injection — On the outgoing side, the propagator serializes the current Context into a carrier (e.g., HTTP request headers). This happens automatically when you use OTel-instrumented HTTP clients or messaging libraries.
- Extraction — On the incoming side, the propagator deserializes the carrier back into a Context object. Your server middleware or framework integration does this before your handler code runs.
sequenceDiagram
participant A as Service A
(Order API)
participant B as Service B
(Billing API)
participant C as Service C
(Notification API)
Note over A: Creates root span
Sets baggage: tenant-id=acme
A->>A: inject(context, headers)
A->>B: POST /charge
traceparent: 00-abc123-span01-01
baggage: tenant-id=acme
Note over B: extract(headers) → context
B->>B: Read baggage: tenant-id=acme
Create child span (parent=span01)
B->>B: inject(context, headers)
B->>C: POST /notify
traceparent: 00-abc123-span02-01
baggage: tenant-id=acme
Note over C: extract(headers) → context
C->>C: Read baggage: tenant-id=acme
Create child span (parent=span02)
Notice how the traceparent header carries the same trace ID (abc123) across all three services, but the parent span ID updates at each hop. The baggage header travels unchanged, making the tenant-id available everywhere without each service needing to look it up independently.
Propagators: The Serialization Formats
A propagator defines how context is serialized into and deserialized from a carrier. OTel supports multiple propagator formats to interoperate with different tracing ecosystems.
| Propagator | Headers Used | When to Use |
|---|---|---|
W3CTraceContextPropagator | traceparent, tracestate | Default. Use unless you have a reason not to. W3C standard supported by all major vendors. |
W3CBaggagePropagator | baggage | Carries baggage key-value pairs. Must be explicitly added alongside the trace context propagator. |
B3Propagator | b3 or X-B3-TraceId, etc. | Zipkin compatibility. Use when migrating from Zipkin or communicating with Zipkin-instrumented services. |
CompositePropagator | Combines multiple | Use when your system needs multiple formats simultaneously (e.g., W3C + B3 during migration). |
In most setups, you want a composite propagator that includes both the trace context and baggage propagators. Here's how to configure that:
const { W3CTraceContextPropagator } = require('@opentelemetry/core');
const { W3CBaggagePropagator } = require('@opentelemetry/core');
const { CompositePropagator } = require('@opentelemetry/core');
const { propagation } = require('@opentelemetry/api');
// Register both propagators so baggage travels with traces
propagation.setGlobalPropagator(
new CompositePropagator({
propagators: [
new W3CTraceContextPropagator(),
new W3CBaggagePropagator(),
],
})
);
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositeHTTPPropagator
from opentelemetry.trace.propagation import TraceContextTextMapPropagator
from opentelemetry.baggage.propagation import W3CBaggagePropagator
set_global_textmap(
CompositeHTTPPropagator([
TraceContextTextMapPropagator(),
W3CBaggagePropagator(),
])
)
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/propagation"
)
otel.SetTextMapPropagator(
propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
),
)
Baggage: Cross-Service Key-Value Pairs
Baggage is the OTel mechanism for passing arbitrary key-value metadata alongside trace context. Unlike span attributes (which are local to a single span), baggage entries propagate across every service boundary in the request chain. This makes baggage ideal for values that multiple services need without making independent lookups.
Common Use Cases
- Tenant ID — Multi-tenant systems can route, filter, and label telemetry per tenant without each service querying a database
- Feature flags — Propagate which feature variant is active so downstream services can branch behavior consistently
- A/B test group — Ensure the entire request chain knows which experiment cohort the user belongs to
- Request priority — Let downstream services adjust timeout or queue priority based on the origin's classification
Setting and Reading Baggage
const { propagation, context } = require('@opentelemetry/api');
// --- Service A: Setting baggage at the edge ---
const baggage = propagation.createBaggage({
'tenant-id': { value: 'acme' },
'ab-group': { value: 'experiment-42' },
});
const ctxWithBaggage = propagation.setBaggage(context.active(), baggage);
// Run downstream calls within this context
context.with(ctxWithBaggage, () => {
// Any HTTP call made here will include the baggage header
fetch('http://billing-api/charge', { method: 'POST', body });
});
// --- Service B: Reading baggage downstream ---
app.post('/charge', (req, res) => {
const currentBaggage = propagation.getBaggage(context.active());
const tenantId = currentBaggage?.getEntry('tenant-id')?.value;
console.log(`Processing charge for tenant: ${tenantId}`);
// tenantId === 'acme'
});
from opentelemetry import baggage, context
# --- Service A: Setting baggage at the edge ---
ctx = baggage.set_baggage("tenant-id", "acme")
ctx = baggage.set_baggage("ab-group", "experiment-42", context=ctx)
# Attach context so outgoing calls propagate it
token = context.attach(ctx)
try:
requests.post("http://billing-api/charge", json=payload)
finally:
context.detach(token)
# --- Service B: Reading baggage downstream ---
tenant_id = baggage.get_baggage("tenant-id")
print(f"Processing charge for tenant: {tenant_id}")
# tenant_id == "acme"
import (
"go.opentelemetry.io/otel/baggage"
)
// --- Service A: Setting baggage at the edge ---
tenantMember, _ := baggage.NewMember("tenant-id", "acme")
abMember, _ := baggage.NewMember("ab-group", "experiment-42")
bag, _ := baggage.New(tenantMember, abMember)
ctx = baggage.ContextWithBaggage(ctx, bag)
// Pass ctx to your HTTP client — headers are injected automatically
req, _ := http.NewRequestWithContext(ctx, "POST", "http://billing-api/charge", body)
client.Do(req)
// --- Service B: Reading baggage downstream ---
bag := baggage.FromContext(ctx)
tenantID := bag.Member("tenant-id").Value()
fmt.Printf("Processing charge for tenant: %s\n", tenantID)
// tenantID == "acme"
Baggage Gotchas and Security Concerns
Baggage is powerful but comes with real trade-offs you need to understand before using it in production. Because baggage entries are serialized into HTTP headers on every request, they're subject to both size constraints and security risks.
Baggage values are transmitted as HTTP headers in cleartext. Never put sensitive data — passwords, tokens, PII, or API keys — in baggage. Any proxy, load balancer, or service in the chain can read (and log) these values. Treat baggage like a URL query parameter: assume everyone can see it.
Key Constraints
| Constraint | Detail |
|---|---|
| Header size limits | Most HTTP servers and proxies enforce header size limits (e.g., Nginx defaults to 8 KB total headers). A bloated baggage header can cause 431 Request Header Fields Too Large errors. |
| Propagation overhead | Every baggage entry is serialized and deserialized at every hop. Keep entries small (short keys, short values) and few (under 10 entries is a good rule of thumb). |
| No automatic cleanup | Baggage accumulates. If Service A sets 5 entries and Service B adds 3 more, Service C sees all 8. There is no built-in TTL or scoping mechanism. |
| Cross-trust boundaries | Baggage from external callers flows into your system. Validate and sanitize baggage values at trust boundaries to prevent injection attacks or unexpected behavior. |
Baggage entries don't automatically appear in your trace backend. If you want to query traces by tenant-id, write a custom SpanProcessor that reads baggage from the current context and copies selected entries into span attributes at span start. This keeps baggage lean for propagation while making the data queryable in your observability platform.
Adding B3 Compatibility During Migration
If you're migrating from Zipkin to OTel (or need to communicate with services that still use B3 headers), you can configure a composite propagator that handles both formats. Incoming requests with either traceparent or b3 headers will be understood, and outgoing requests will include both.
const { CompositePropagator, W3CTraceContextPropagator } = require('@opentelemetry/core');
const { W3CBaggagePropagator } = require('@opentelemetry/core');
const { B3Propagator, B3InjectEncoding } = require('@opentelemetry/propagator-b3');
const { propagation } = require('@opentelemetry/api');
propagation.setGlobalPropagator(
new CompositePropagator({
propagators: [
new W3CTraceContextPropagator(),
new W3CBaggagePropagator(),
new B3Propagator({ injectEncoding: B3InjectEncoding.MULTI_HEADER }),
],
})
);
// Outgoing requests now include: traceparent, baggage, X-B3-TraceId, etc.
Once all services are migrated to W3C Trace Context, you can safely remove the B3 propagator from the composite and drop the extra headers.
Auto-Instrumentation vs Manual Instrumentation
OpenTelemetry gives you two distinct paths to instrument your applications: auto-instrumentation, which hooks into common frameworks and libraries without touching your code, and manual instrumentation, where you explicitly create spans, record metrics, and emit logs using the OTel API. In practice, most production systems use both — auto-instrumentation as the foundation, manual instrumentation for business-specific detail.
Auto-Instrumentation: Zero-Code Telemetry
Auto-instrumentation works by intercepting calls to well-known libraries — HTTP clients, database drivers, web frameworks, messaging systems — and automatically creating spans, propagating context, and recording attributes. You get distributed tracing across service boundaries without writing a single line of instrumentation code.
Each language ecosystem provides its own mechanism for this. Here's how you set it up across four major languages:
Install the auto-instrumentation package and run your app with the opentelemetry-instrument wrapper:
# Install the SDK and auto-instrumentation packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
# Install instrumentation libraries for detected packages
opentelemetry-bootstrap -a install
# Run your Flask/Django/FastAPI app — no code changes needed
OTEL_SERVICE_NAME=order-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
opentelemetry-instrument python app.py
The opentelemetry-bootstrap command inspects your installed packages and installs matching instrumentation libraries (e.g., opentelemetry-instrumentation-flask, opentelemetry-instrumentation-psycopg2). The opentelemetry-instrument command monkey-patches these libraries at startup.
Download the Java agent JAR and attach it to your JVM process:
# Download the latest agent JAR
curl -L -o opentelemetry-javaagent.jar \
https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
# Attach the agent to your Java application
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=order-service \
-Dotel.exporter.otlp.endpoint=http://localhost:4318 \
-jar my-app.jar
The Java agent uses bytecode manipulation to instrument over 100 libraries — Spring, JAX-RS, JDBC, Hibernate, Kafka, gRPC, and more. No recompilation required. It's the most mature auto-instrumentation in the OTel ecosystem.
Install the meta-package and register it before your application loads:
# Install SDK and auto-instrumentation packages
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http
// tracing.js — load this BEFORE your app code
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const sdk = new NodeSDK({
serviceName: "order-service",
traceExporter: new OTLPTraceExporter({ url: "http://localhost:4318/v1/traces" }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
# Start your app with the tracing setup loaded first
node --require ./tracing.js app.js
.NET offers a NuGet-based approach with OpenTelemetry.AutoInstrumentation:
# Download and install the auto-instrumentation package
dotnet tool install --global OpenTelemetry.AutoInstrumentation
# Set environment variables and run your app
OTEL_SERVICE_NAME=order-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
OTEL_DOTNET_AUTO_HOME=$HOME/.otel-dotnet-auto \
CORECLR_ENABLE_PROFILING=1 \
CORECLR_PROFILER="{918728DD-259F-4A6A-AC2B-B85E1B658571}" \
CORECLR_PROFILER_PATH=${OTEL_DOTNET_AUTO_HOME}/linux-x64/OpenTelemetry.AutoInstrumentation.Native.so \
dotnet run
The .NET agent uses the CLR profiling API to inject instrumentation into ASP.NET Core, HttpClient, SqlClient, Entity Framework, and other common libraries at runtime.
Manual Instrumentation: Fine-Grained Control
Auto-instrumentation covers the infrastructure layer — HTTP requests, database queries, message queue operations. But it knows nothing about your business logic. When you need spans for "process payment," "validate inventory," or "calculate shipping cost," you reach for manual instrumentation.
Manual instrumentation uses the OpenTelemetry API directly. You acquire a tracer, create spans, set attributes, and record events. This gives you full control over span names, hierarchies, and the metadata attached to each operation.
from opentelemetry import trace
tracer = trace.get_tracer("order-service", "1.0.0")
def process_order(order):
with tracer.start_as_current_span("process-order") as span:
span.set_attribute("order.id", order.id)
span.set_attribute("order.item_count", len(order.items))
span.set_attribute("order.total_usd", order.total)
with tracer.start_as_current_span("validate-inventory"):
check_stock(order.items)
with tracer.start_as_current_span("charge-payment") as payment_span:
result = charge_card(order.payment_method, order.total)
payment_span.set_attribute("payment.provider", result.provider)
payment_span.set_attribute("payment.status", result.status)
span.add_event("order.completed", {"order.id": order.id})
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
Tracer tracer = GlobalOpenTelemetry.getTracer("order-service", "1.0.0");
public void processOrder(Order order) {
Span span = tracer.spanBuilder("process-order").startSpan();
try (Scope scope = span.makeCurrent()) {
span.setAttribute("order.id", order.getId());
span.setAttribute("order.item_count", order.getItems().size());
// Child spans are linked automatically via context
validateInventory(order.getItems());
chargePayment(order.getPaymentMethod(), order.getTotal());
span.addEvent("order.completed");
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR, e.getMessage());
throw e;
} finally {
span.end();
}
}
const { trace } = require("@opentelemetry/api");
const tracer = trace.getTracer("order-service", "1.0.0");
async function processOrder(order) {
return tracer.startActiveSpan("process-order", async (span) => {
try {
span.setAttribute("order.id", order.id);
span.setAttribute("order.item_count", order.items.length);
await tracer.startActiveSpan("validate-inventory", async (child) => {
await checkStock(order.items);
child.end();
});
await tracer.startActiveSpan("charge-payment", async (child) => {
const result = await chargeCard(order.paymentMethod, order.total);
child.setAttribute("payment.status", result.status);
child.end();
});
span.addEvent("order.completed");
} catch (err) {
span.recordException(err);
span.setStatus({ code: trace.SpanStatusCode.ERROR, message: err.message });
throw err;
} finally {
span.end();
}
});
}
The first argument to getTracer() is the instrumentation scope name — typically your service or library name. This isn't just a label; backends use it to group and filter telemetry. Always use a meaningful, consistent name rather than an empty string.
The Hybrid Approach: Best of Both
The most effective strategy is layering manual instrumentation on top of auto-instrumentation. Auto-instrumentation gives you the full picture of how requests flow through infrastructure — HTTP handlers, database calls, external API requests. Manual instrumentation fills in the business logic gaps that sit between those infrastructure calls.
Here's what the hybrid looks like in a Python Flask application. Auto-instrumentation handles the Flask route and the database query. You add manual spans for the domain logic in between:
from flask import Flask, request, jsonify
from opentelemetry import trace
app = Flask(__name__)
tracer = trace.get_tracer("order-service")
@app.route("/orders", methods=["POST"])
def create_order():
# AUTO: Flask instrumentation creates a span for "POST /orders"
data = request.get_json()
# MANUAL: Custom span for business validation
with tracer.start_as_current_span("validate-order") as span:
span.set_attribute("order.customer_id", data["customer_id"])
errors = validate_business_rules(data)
if errors:
span.set_attribute("validation.passed", False)
return jsonify({"errors": errors}), 400
span.set_attribute("validation.passed", True)
# MANUAL: Custom span for pricing calculation
with tracer.start_as_current_span("calculate-pricing") as span:
pricing = compute_total(data["items"], data.get("coupon_code"))
span.set_attribute("pricing.subtotal", pricing.subtotal)
span.set_attribute("pricing.discount_pct", pricing.discount)
# AUTO: SQLAlchemy instrumentation captures the INSERT query
order = save_order_to_db(data, pricing)
# AUTO: requests instrumentation captures the outgoing HTTP call
notify_warehouse(order)
return jsonify({"order_id": order.id}), 201
The resulting trace contains a clean hierarchy: the auto-generated Flask span at the root, your manual business spans nested inside, and auto-generated database and HTTP spans beneath those. You see the full story — infrastructure and domain logic — in one trace.
Trade-offs at a Glance
| Factor | Auto-Instrumentation | Manual Instrumentation |
|---|---|---|
| Setup effort | Minutes — add an agent or package, set env vars | Hours to days — instrument each operation explicitly |
| Code changes | Zero (or one bootstrap file) | Requires modifying application code |
| Coverage | Common frameworks and libraries only | Anything you choose to instrument |
| Granularity | Infrastructure-level (HTTP routes, SQL queries) | Business-level (domain operations, custom attributes) |
| Span naming | Generic (e.g., GET /api/orders/:id) | Domain-specific (e.g., apply-loyalty-discount) |
| Performance overhead | Slightly higher — instruments everything it can | You control exactly what's traced |
| Maintenance | Updates with agent/library versions | Must update code when logic changes |
| Startup impact | Adds 1–5s startup time (especially Java agent) | Negligible |
Don't try to manually instrument everything on day one. Deploy auto-instrumentation first, observe your traces in a backend like Jaeger or Grafana Tempo, then identify the gaps — the "black box" spans where you know business logic runs but can't see what happened. Add manual spans there. This incremental approach gives you the fastest time-to-value.
Auto-Instrumentation Library Support
Not all languages have the same breadth of auto-instrumentation support. Here's a snapshot of coverage for common libraries:
| Library / Framework | Python | Java | Node.js | .NET |
|---|---|---|---|---|
| HTTP server (Express, Flask, Spring, etc.) | ✅ | ✅ | ✅ | ✅ |
| HTTP client (requests, HttpClient) | ✅ | ✅ | ✅ | ✅ |
| PostgreSQL / MySQL | ✅ | ✅ (JDBC) | ✅ | ✅ |
| Redis | ✅ | ✅ | ✅ | ✅ |
| Kafka | ✅ | ✅ | ✅ | Partial |
| gRPC | ✅ | ✅ | ✅ | ✅ |
| GraphQL | Limited | ✅ | ✅ | Limited |
| MongoDB | ✅ | ✅ | ✅ | ✅ |
Auto-instrumentation agents can conflict with other bytecode manipulation tools (APM agents, security agents), increase memory usage, and occasionally break on major library version upgrades. Test auto-instrumentation in staging before production rollout, and pin your agent versions in CI/CD pipelines to avoid surprises.
Sampling Strategies: Head-Based, Tail-Based, and Beyond
A moderately busy microservices application can generate millions of spans per minute. Storing and indexing every single one of them is expensive — both in network bandwidth leaving your services and in backend storage costs. The uncomfortable truth is that most traces are uninteresting: a successful 50ms GET request looks nearly identical to the one before it and the one after it.
Sampling is how you keep your observability costs under control without losing the traces that actually matter. The core question every sampling strategy answers is: which traces do we keep, and when do we decide?
flowchart LR
subgraph head ["Head-Based Sampling"]
direction TB
A1["Request arrives"] --> A2{"Sampler decides\nimmediately"}
A2 -->|"Sampled ✓"| A3["Spans collected\n& exported"]
A2 -->|"Not sampled ✗"| A4["Spans dropped\nat creation"]
end
subgraph tail ["Tail-Based Sampling"]
direction TB
B1["Request arrives"] --> B2["All spans collected\ninto buffer"]
B2 --> B3["Trace completes"]
B3 --> B4{"Tail sampler\nevaluates full trace"}
B4 -->|"Interesting"| B5["Trace kept"]
B4 -->|"Boring"| B6["Trace dropped"]
end
Head-Based Sampling
Head-based sampling makes the keep-or-drop decision at the very beginning of a trace — before any spans are even created. The decision propagates through the entire call chain via the W3C tracestate / traceparent headers, so every service in a distributed transaction agrees on whether to record or not. This is cheap and simple: you never allocate memory for spans you won't keep.
OpenTelemetry SDKs ship with four built-in head-based samplers:
| Sampler | Behavior | Use Case |
|---|---|---|
AlwaysOn | Keeps 100% of traces | Development, low-traffic services |
AlwaysOff | Drops 100% of traces | Disabling tracing without removing instrumentation |
TraceIdRatioBased | Keeps a fixed percentage based on a hash of the trace ID | Steady-state production sampling at a known rate |
ParentBased | Respects the sampling decision of the upstream (parent) span | Almost always — wraps another sampler to maintain consistency |
TraceIdRatioBased works by hashing the 128-bit trace ID and checking whether the result falls below a threshold. Because the hash is deterministic, every service that sees the same trace ID makes the same decision — no coordination needed. ParentBased wraps any other sampler and delegates to it only for root spans; for child spans, it simply inherits the parent's decision.
from opentelemetry.sdk.trace.sampling import (
ParentBasedTraceIdRatio,
)
from opentelemetry.sdk.trace import TracerProvider
# Keep ~10% of traces; respect parent decisions for child spans
sampler = ParentBasedTraceIdRatio(0.1)
provider = TracerProvider(sampler=sampler)
Head-based sampling's fundamental limitation is that you can't know if a trace is "interesting" at the moment it starts. A request that will eventually fail with a 500 error or take 30 seconds looks identical to a fast, successful one at creation time. You'll inevitably drop traces you wish you'd kept.
Tail-Based Sampling
Tail-based sampling flips the model: collect all spans first, buffer them until the trace is complete (or a timeout fires), and then evaluate the entire trace against a set of policies. This means you can keep every trace that contains an error, every trace that exceeds a latency threshold, and probabilistically sample the rest. You get the best of both worlds — low storage costs and complete visibility into problems.
The trade-off is operational complexity. The tail sampler needs to hold spans in memory while waiting for a trace to finish, which means higher resource usage. More critically, all spans belonging to a single trace must arrive at the same collector instance — otherwise the sampler sees a partial trace and makes a bad decision. This is solved by load-balancing upstream by trace_id.
The OTel Collector's tail_sampling Processor
The OpenTelemetry Collector (contrib distribution) ships with a tail_sampling processor that supports composable policies. Here's a production-realistic configuration:
processors:
tail_sampling:
decision_wait: 10s # how long to buffer spans
num_traces: 100000 # max traces held in memory
policies:
# Always keep traces with errors
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
# Always keep slow traces (> 2 seconds)
- name: latency-policy
type: latency
latency:
threshold_ms: 2000
# Keep 5% of everything else
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 5
# Hard cap: never exceed 100 traces/sec
- name: rate-limit-policy
type: rate_limiting
rate_limiting:
spans_per_second: 500
The key policy types are:
| Policy Type | What It Does | Typical Use |
|---|---|---|
status_code | Matches traces containing spans with a specific status (ERROR, OK) | Keep all error traces |
latency | Matches traces whose end-to-end duration exceeds a threshold | Keep slow traces for debugging |
probabilistic | Randomly keeps a percentage of traces | Baseline sampling for "normal" traffic |
rate_limiting | Caps the number of spans per second | Cost control / burst protection |
composite | Combines multiple sub-policies with AND/OR logic and rate allocation | Complex multi-criteria decisions |
Load Balancing by Trace ID
When you run multiple collector instances (and you should for availability), you need a load-balancing layer in front that routes by trace_id. The OTel Collector's loadbalancing exporter handles this — it hashes the trace ID and consistently routes to the same downstream collector.
# Gateway collector — routes spans to sampling tier
exporters:
loadbalancing:
protocol:
otlp:
tls:
insecure: true
resolver:
dns:
hostname: otel-sampling-collectors.svc.cluster.local
port: 4317
If you run tail-based sampling without trace-ID-based routing, the sampler will see incomplete traces. It might drop the spans that contain the error while keeping the boring parent — exactly the opposite of what you want. Always pair tail_sampling with loadbalancing in a two-tier collector architecture.
Beyond Head and Tail: Advanced Strategies
The head-vs-tail dichotomy covers the basics, but several vendor-specific and community-driven approaches push sampling further. These systems recognize that static rules don't adapt well to changing traffic patterns.
Priority Sampling (Datadog)
Datadog's tracing libraries assign each trace a priority value at creation time: USER_REJECT (-1), AUTO_REJECT (0), AUTO_KEEP (1), or USER_KEEP (2). The Datadog Agent then uses these priorities to make downstream decisions. Any trace manually marked as USER_KEEP by application code (for example, in a critical transaction handler) is always retained, while the agent dynamically adjusts the rate for AUTO_KEEP traces based on throughput. This gives developers an escape hatch: flag the traces you know matter, and let the system handle the rest.
Dynamic Sampling (Honeycomb / Refinery)
Honeycomb's Refinery is an open-source tail-sampling proxy that goes beyond static policies. It groups traces into "key spaces" — combinations of attributes like service.name, http.status_code, and http.route — and dynamically adjusts sample rates per group to maintain a target throughput. If your /healthz endpoint suddenly spikes to 10× its normal volume, Refinery automatically samples it more aggressively, while keeping the sample rate for rare error paths low (capturing more of them).
The key insight behind dynamic sampling is that sample rate should vary inversely with how interesting or rare a traffic pattern is. High-volume, uniform traffic gets aggressively sampled down. Low-volume, unusual traffic gets kept at higher rates. This is fundamentally better than a flat probabilistic rate across all traffic.
Whichever strategy you choose, always record the sample_rate as a span attribute. Without it, you can't accurately reconstruct metrics like request counts or error rates from sampled trace data. A trace sampled at 10% represents 10 similar traces — your analysis tooling needs that multiplier.
Choosing a Strategy
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Head-based (ratio) | Zero overhead, simple to configure | Blind to trace outcomes; drops errors | Low-traffic services, getting started |
| Tail-based (Collector) | Keeps interesting traces, drops boring ones | Memory-intensive; requires trace-ID routing | Medium-to-high traffic with SLO monitoring |
| Dynamic (Refinery-style) | Adapts to traffic patterns automatically | Additional infrastructure; tuning required | High-traffic, diverse workloads |
| Priority (Datadog-style) | Developer control over critical paths | Vendor-specific; requires code changes | Teams needing guaranteed capture of key flows |
In practice, most production systems combine strategies. A common pattern is head-based ParentBased(TraceIdRatio(0.1)) in the SDK to cut volume by 90% before it hits the network, followed by tail-based sampling in the Collector to ensure the remaining 10% is biased toward the traces that matter most.
Hands-On: Instrumenting a Python Microservice
This walkthrough instruments a FastAPI-based payment service with OpenTelemetry from scratch. By the end you'll have distributed traces, request metrics, and structured logs flowing into a local Jaeger instance through the OTel Collector.
The full setup runs in Docker Compose so you can experiment without installing anything beyond Docker on your host machine.
Project Structure
Here's what the finished project looks like on disk. Every file is covered in the steps that follow.
payment-service/
├── app/
│ ├── main.py # FastAPI app + OTel bootstrap
│ ├── telemetry.py # All OTel configuration
│ └── routes/
│ └── payments.py # Business logic with manual spans
├── requirements.txt
├── Dockerfile
├── otel-collector-config.yaml
└── docker-compose.yaml
Step 1 — Install the Packages
OpenTelemetry's Python ecosystem is modular: you install the core SDK plus individual instrumentation libraries for each framework you use. This keeps your dependency tree tight — you only pull in what you need.
# Core
fastapi==0.111.0
uvicorn==0.30.1
requests==2.32.3
sqlalchemy==2.0.31
# OpenTelemetry SDK + API
opentelemetry-api==1.25.0
opentelemetry-sdk==1.25.0
# OTLP exporter (sends data to the Collector)
opentelemetry-exporter-otlp==1.25.0
# Auto-instrumentation libraries
opentelemetry-instrumentation-fastapi==0.46b0
opentelemetry-instrumentation-requests==0.46b0
opentelemetry-instrumentation-sqlalchemy==0.46b0
Install locally with pip install -r requirements.txt, or let the Dockerfile handle it (shown later).
Step 2 — Configure the Telemetry Module
Centralizing all OTel setup in a single telemetry.py keeps your application code clean. This module configures three providers — one each for traces, metrics, and logs — and wires them to the OTLP exporter.
The service.name, service.version, and deployment.environment attributes are attached to every signal (trace, metric, log) your service emits. They're the primary keys backends use to filter and group telemetry, so set them accurately.
import logging
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry._logs import set_logger_provider
def init_telemetry(
service_name: str = "payment-service",
service_version: str = "1.0.0",
environment: str = "development",
otlp_endpoint: str = "http://otel-collector:4317",
) -> None:
"""Bootstrap tracing, metrics, and logging providers."""
# --- Shared resource (attached to all signals) ---
resource = Resource.create({
SERVICE_NAME: service_name,
SERVICE_VERSION: service_version,
"deployment.environment": environment,
})
# --- Tracing ---
tracer_provider = TracerProvider(resource=resource)
span_exporter = OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True)
tracer_provider.add_span_processor(BatchSpanProcessor(span_exporter))
trace.set_tracer_provider(tracer_provider)
# --- Metrics ---
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint=otlp_endpoint, insecure=True),
export_interval_millis=10_000, # flush every 10 s
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
# --- Logging (bridge Python stdlib logging → OTel) ---
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(
BatchLogRecordProcessor(
OTLPLogExporter(endpoint=otlp_endpoint, insecure=True)
)
)
set_logger_provider(logger_provider)
handler = LoggingHandler(level=logging.INFO, logger_provider=logger_provider)
logging.getLogger().addHandler(handler)
What each provider does
| Provider | Exporter | Batching Strategy |
|---|---|---|
TracerProvider | OTLPSpanExporter | BatchSpanProcessor — buffers spans and flushes in bulk (default: 512 spans or 5 s) |
MeterProvider | OTLPMetricExporter | PeriodicExportingMetricReader — pushes aggregated metrics every 10 s |
LoggerProvider | OTLPLogExporter | BatchLogRecordProcessor — same batching semantics as traces |
Step 3 — Wire Up Auto-Instrumentation in the FastAPI App
Auto-instrumentation libraries monkey-patch framework internals to create spans and propagate context automatically. Two lines of code give you full request-level tracing for FastAPI and outbound HTTP calls.
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from app.telemetry import init_telemetry
from app.routes.payments import router as payments_router
# 1. Initialize all three OTel providers BEFORE the app starts
init_telemetry()
# 2. Create the FastAPI app
app = FastAPI(title="Payment Service")
app.include_router(payments_router, prefix="/payments")
# 3. Auto-instrument FastAPI (creates server spans for every request)
FastAPIInstrumentor.instrument_app(app)
# 4. Auto-instrument outbound HTTP via `requests` library
RequestsInstrumentor().instrument()
After these four lines, every inbound request generates a server span with attributes like http.method, http.route, and http.status_code. Outbound requests.get() / requests.post() calls produce client spans with context propagation headers injected automatically.
Step 4 — Add Custom Metrics and Manual Spans
Auto-instrumentation captures the HTTP layer, but your business logic lives deeper. Manual spans let you trace domain operations like "process a payment" and attach semantically meaningful attributes. Custom metrics give you counters and histograms tailored to your KPIs.
import time
import logging
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from opentelemetry import trace, metrics
logger = logging.getLogger(__name__)
router = APIRouter()
# --- Tracer & Meter (lazy-resolved via global providers) ---
tracer = trace.get_tracer("payment-service.payments")
meter = metrics.get_meter("payment-service.payments")
# --- Custom metrics ---
payment_counter = meter.create_counter(
name="payments.processed.total",
description="Total number of payments processed",
unit="1",
)
payment_latency = meter.create_histogram(
name="payments.processing.duration",
description="Time spent processing a payment",
unit="ms",
)
class PaymentRequest(BaseModel):
order_id: str
amount: float
currency: str = "USD"
method: str = "credit_card"
@router.post("/process")
async def process_payment(req: PaymentRequest):
start = time.perf_counter()
# --- Manual span for business logic ---
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.order_id", req.order_id)
span.set_attribute("payment.amount", req.amount)
span.set_attribute("payment.currency", req.currency)
span.set_attribute("payment.method", req.method)
logger.info(
"Processing payment for order %s, amount %.2f %s",
req.order_id, req.amount, req.currency,
)
# Simulate validation
with tracer.start_as_current_span("validate_payment"):
if req.amount <= 0:
span.set_status(trace.StatusCode.ERROR, "Invalid amount")
raise HTTPException(status_code=400, detail="Amount must be positive")
# Simulate gateway call
with tracer.start_as_current_span("call_payment_gateway") as gw_span:
gw_span.set_attribute("gateway.provider", "stripe")
time.sleep(0.05) # simulate latency
transaction_id = f"txn_{req.order_id}_001"
gw_span.set_attribute("gateway.transaction_id", transaction_id)
span.set_attribute("payment.transaction_id", transaction_id)
span.set_status(trace.StatusCode.OK)
# --- Record metrics ---
elapsed_ms = (time.perf_counter() - start) * 1000
payment_counter.add(1, {"payment.method": req.method, "payment.currency": req.currency})
payment_latency.record(elapsed_ms, {"payment.method": req.method})
logger.info("Payment %s completed in %.1fms", transaction_id, elapsed_ms)
return {"status": "success", "transaction_id": transaction_id}
Notice how the manual spans nest inside the auto-instrumented FastAPI server span. When you view this in Jaeger, you'll see a clean hierarchy: POST /payments/process → process_payment → validate_payment → call_payment_gateway.
Step 5 — Configure the OTel Collector
The Collector sits between your application and your backends. It receives OTLP data, can transform it (add attributes, sample, filter), and exports to one or more destinations. This config is minimal on purpose — it receives OTLP on gRPC :4317 and forwards traces to Jaeger and logs/metrics to the debug exporter (stdout).
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 5s
send_batch_size: 512
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/jaeger, debug]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [debug]
logs:
receivers: [otlp]
processors: [batch]
exporters: [debug]
Step 6 — Docker Compose for Local Testing
This Compose file brings up three services: your payment app, the OTel Collector, and Jaeger (which natively accepts OTLP). After docker compose up, you can hit the API on port 8000 and view traces at http://localhost:16686.
FROM python:3.12-slim
WORKDIR /srv
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ app/
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
version: "3.9"
services:
payment-service:
build: .
ports:
- "8000:8000"
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
depends_on:
- otel-collector
otel-collector:
image: otel/opentelemetry-collector-contrib:0.104.0
command: ["--config", "/etc/otelcol/config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otelcol/config.yaml:ro
ports:
- "4317:4317" # OTLP gRPC
depends_on:
- jaeger
jaeger:
image: jaegertracing/all-in-one:1.58
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686" # Jaeger UI
- "4317" # OTLP gRPC (internal)
Step 7 — Run It and Send a Request
Bring everything up, fire a test payment, and verify the trace in Jaeger.
# Start all services
docker compose up -d --build
# Wait a few seconds for startup, then send a test payment
curl -s -X POST http://localhost:8000/payments/process \
-H "Content-Type: application/json" \
-d '{"order_id": "ord_42", "amount": 99.95, "currency": "USD", "method": "credit_card"}'
# Expected response:
# {"status":"success","transaction_id":"txn_ord_42_001"}
# Open Jaeger UI
open http://localhost:16686
In the Jaeger UI, select payment-service from the service dropdown and click Find Traces. You should see a trace with four spans in a parent-child hierarchy.
Click on the process_payment span in Jaeger and expand its tags. You'll see every custom attribute you set — payment.order_id, payment.amount, gateway.transaction_id, etc. These attributes are what make traces actually useful for debugging production issues.
Key Takeaways
| Concept | What You Did | Why It Matters |
|---|---|---|
| Resource attributes | Set service.name, service.version, deployment.environment | Every signal is tagged, making filtering trivial in backends |
| Auto-instrumentation | Two calls: FastAPIInstrumentor + RequestsInstrumentor | Full request-level tracing with zero business code changes |
| Manual spans | Wrapped process_payment and sub-steps in start_as_current_span | Visibility into domain logic, not just HTTP plumbing |
| Custom metrics | Created a counter and histogram for payments | Business KPIs (throughput, latency) available for dashboards and alerts |
| Log bridging | Attached LoggingHandler to Python's root logger | Logs are correlated with trace IDs automatically |
| Collector as gateway | Routed all signals through the OTel Collector | Decouples your app from backend choice — swap Jaeger for Tempo without code changes |
Hands-On: Instrumenting Node.js and Go Services
A single-language trace is useful, but real-world systems are polyglot. In this section you'll add a Node.js service and a Go service to the existing Python microservice, wire them all up with OpenTelemetry, and watch a single distributed trace flow across three runtimes. The call chain is Python → Node.js → Go, with trace context propagating automatically via HTTP headers.
Project Structure
After this section your project will look like this:
microservices-demo/
├── python-service/ # Existing Flask service (entry point)
├── node-service/ # New — Express service
│ ├── tracing.js
│ ├── app.js
│ ├── package.json
│ └── Dockerfile
├── go-service/ # New — net/http service
│ ├── main.go
│ ├── go.mod
│ └── Dockerfile
└── docker-compose.yaml # Updated with all three services
Node.js Service
The Node.js service receives requests from Python, performs some business logic, then calls the Go service downstream. OpenTelemetry's @opentelemetry/sdk-node package provides a single NodeSDK class that wires up tracing, and the auto-instrumentation package patches Express, HTTP, and other libraries automatically.
Install Dependencies
cd node-service
npm init -y
npm install express axios
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/api
Configure the SDK — tracing.js
This file must be loaded before your application code so that auto-instrumentation can monkey-patch modules at require time. The NodeSDK handles TracerProvider setup, context propagation (W3C TraceContext by default), and batching.
// tracing.js — load with: node --require ./tracing.js app.js
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const sdk = new NodeSDK({
serviceName: "node-service",
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://otel-collector:4318/v1/traces",
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on("SIGTERM", () => {
sdk.shutdown().then(() => process.exit(0));
});
Application Code with Custom Spans — app.js
Auto-instrumentation gives you HTTP and Express spans for free. For business-specific operations you create manual spans using the @opentelemetry/api tracer. The key insight: because you loaded tracing.js first, the SDK is already running and axios calls automatically inject the traceparent header into outgoing requests.
// app.js
const express = require("express");
const axios = require("axios");
const { trace } = require("@opentelemetry/api");
const app = express();
const tracer = trace.getTracer("node-service");
const GO_SERVICE_URL = process.env.GO_SERVICE_URL || "http://go-service:8082";
app.get("/process", async (req, res) => {
// Custom span for business logic
const result = await tracer.startActiveSpan("validate-order", async (span) => {
span.setAttribute("order.source", "web");
span.setAttribute("order.priority", "high");
// Simulate validation work
const isValid = Math.random() > 0.1;
span.setAttribute("order.valid", isValid);
if (!isValid) {
span.setStatus({ code: 2, message: "Validation failed" });
span.end();
return { valid: false };
}
// Call downstream Go service — trace context propagates automatically
const goResponse = await axios.get(`${GO_SERVICE_URL}/finalize`);
span.end();
return { valid: true, downstream: goResponse.data };
});
res.json({ service: "node-service", result });
});
app.listen(3000, () => console.log("Node service listening on :3000"));
Dockerfile
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
CMD ["node", "--require", "./tracing.js", "app.js"]
--require ./tracing.js?Auto-instrumentation works by wrapping module imports at load time. If you import Express before the SDK initializes, those HTTP handlers won't be patched. The --require flag guarantees tracing.js runs first, before any application module loads.
Go Service
The Go service sits at the end of the call chain. It receives requests from Node.js, extracts trace context from the incoming traceparent header, runs some business logic, and exports the resulting spans. Go's OpenTelemetry SDK is explicit — you set up each component yourself, which gives you precise control over batching, exporters, and resource attributes.
Initialize the Module and Install Packages
cd go-service
go mod init go-service
go get go.opentelemetry.io/otel \
go.opentelemetry.io/otel/sdk \
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc \
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp \
go.opentelemetry.io/otel/trace \
go.opentelemetry.io/otel/attribute
Full Service Code — main.go
Unlike Node.js where a single NodeSDK call configures everything, Go requires you to create the exporter, build a TracerProvider, and register it globally. The otelhttp middleware wraps your HTTP handler to automatically extract incoming trace context and create server spans.
package main
import (
"context"
"encoding/json"
"log"
"math/rand"
"net/http"
"os"
"time"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
"go.opentelemetry.io/otel/trace"
)
var tracer trace.Tracer
func initTracer() func() {
endpoint := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
if endpoint == "" {
endpoint = "otel-collector:4317"
}
ctx := context.Background()
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint(endpoint),
otlptracegrpc.WithInsecure(),
)
if err != nil {
log.Fatalf("failed to create exporter: %v", err)
}
res := resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("go-service"),
semconv.ServiceVersion("1.0.0"),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
)
otel.SetTracerProvider(tp)
tracer = tp.Tracer("go-service")
return func() {
_ = tp.Shutdown(ctx)
}
}
func finalizeHandler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
// Manual span for business logic
_, span := tracer.Start(ctx, "compute-final-price",
trace.WithAttributes(
attribute.String("pricing.currency", "USD"),
attribute.Float64("pricing.base_amount", 99.95),
),
)
// Simulate computation
time.Sleep(time.Duration(rand.Intn(50)) * time.Millisecond)
discount := rand.Float64() * 20
finalPrice := 99.95 - discount
span.SetAttributes(
attribute.Float64("pricing.discount", discount),
attribute.Float64("pricing.final_price", finalPrice),
)
span.End()
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]interface{}{
"service": "go-service",
"final_price": finalPrice,
})
}
func main() {
shutdown := initTracer()
defer shutdown()
mux := http.NewServeMux()
// otelhttp.NewHandler wraps the handler with automatic span creation
mux.Handle("/finalize", otelhttp.NewHandler(
http.HandlerFunc(finalizeHandler), "GET /finalize",
))
log.Println("Go service listening on :8082")
log.Fatal(http.ListenAndServe(":8082", mux))
}
Dockerfile
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /go-service .
FROM alpine:3.19
COPY --from=builder /go-service /go-service
EXPOSE 8082
CMD ["/go-service"]
Updated Python Service
The existing Python service acts as the entry point. It needs one change: after doing its own work, it calls the Node.js service. With OpenTelemetry's requests auto-instrumentation already in place, the outgoing HTTP call automatically carries the traceparent header.
# python-service/app.py (updated route)
import os, requests
from flask import Flask, jsonify
from opentelemetry import trace
app = Flask(__name__)
tracer = trace.get_tracer("python-service")
NODE_SERVICE_URL = os.environ.get("NODE_SERVICE_URL", "http://node-service:3000")
@app.route("/order")
def create_order():
with tracer.start_as_current_span("receive-order") as span:
span.set_attribute("order.id", "ORD-2024-001")
span.set_attribute("order.items_count", 3)
# Call Node.js service — traceparent header injected automatically
response = requests.get(f"{NODE_SERVICE_URL}/process")
node_result = response.json()
return jsonify({
"service": "python-service",
"order_id": "ORD-2024-001",
"downstream": node_result,
})
Updated docker-compose.yaml
This Compose file brings up all five components: the three application services, the OpenTelemetry Collector, and Jaeger for visualization. The depends_on entries ensure the Collector starts before any application service tries to export spans.
version: "3.9"
services:
# ── Observability Infrastructure ──
otel-collector:
image: otel/opentelemetry-collector-contrib:0.96.0
volumes:
- ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
jaeger:
image: jaegertracing/all-in-one:1.54
ports:
- "16686:16686" # Jaeger UI
- "14250:14250" # gRPC from Collector
environment:
- COLLECTOR_OTLP_ENABLED=true
# ── Application Services ──
python-service:
build: ./python-service
ports:
- "8080:8080"
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
- NODE_SERVICE_URL=http://node-service:3000
depends_on:
- otel-collector
- node-service
node-service:
build: ./node-service
ports:
- "3000:3000"
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
- GO_SERVICE_URL=http://go-service:8082
depends_on:
- otel-collector
- go-service
go-service:
build: ./go-service
ports:
- "8082:8082"
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=otel-collector:4317
depends_on:
- otel-collector
Notice the Go service uses otel-collector:4317 (gRPC, no scheme prefix) while the Node.js and Python services use http://otel-collector:4318 (HTTP with the full URL). The Go OTLP gRPC exporter expects a bare host:port, not a URL. Mixing these up is the most common "spans aren't showing up" bug.
How Trace Context Propagates
The magic that connects spans across three different runtimes is W3C Trace Context propagation. Here's exactly what happens on each hop:
| Hop | What Happens | Header Sent |
|---|---|---|
| Client → Python | Python creates a root span. No incoming traceparent, so a new trace ID is generated. |
— |
| Python → Node.js | requests instrumentation injects the current span's context into outgoing headers. |
traceparent: 00-{traceId}-{spanId}-01 |
| Node.js → Go | axios instrumentation does the same — injects context from the active Node span. |
traceparent: 00-{traceId}-{spanId}-01 |
| Go receives | otelhttp middleware extracts traceparent, creates a child span under the same trace ID. |
— |
The critical point: the trace ID stays the same across all three services. Each service generates its own span IDs, but they all reference the same parent trace. This is how Jaeger can stitch them into a single waterfall view.
Run and Verify
-
Start all servicesbash
docker compose up --build -d -
Send a request through the full chainbash
curl http://localhost:8080/order | jq .You should see a nested JSON response with data from all three services:
json{ "service": "python-service", "order_id": "ORD-2024-001", "downstream": { "service": "node-service", "result": { "valid": true, "downstream": { "service": "go-service", "final_price": 87.32 } } } } -
View the distributed trace in Jaeger
Open http://localhost:16686 in your browser. Select python-service from the Service dropdown and click Find Traces. You'll see a single trace with spans from all three services.
Reading the Jaeger Trace
The trace you see in Jaeger will contain approximately six spans arranged in a parent-child waterfall. Here's what each span represents and which service produced it:
Trace: a] a] a] a] a] a] a] a] a] a] a] a] a] a] total: ~120ms
├─ python-service: GET /order ████████████████████████████████ 120ms
│ ├─ python-service: receive-order ██████████████████████████████ 115ms
│ │ ├─ node-service: GET /process ████████████████████████ 100ms
│ │ │ ├─ node-service: validate-order █████████████████████ 85ms
│ │ │ │ ├─ go-service: GET /finalize ██████████████ 60ms
│ │ │ │ │ └─ go-service: ████████ 40ms
│ │ │ │ │ compute-final-price
Each indentation level represents a parent-child relationship. The python-service root span encompasses the entire request lifecycle. Inside it, you can see the hand-off from Python to Node.js to Go, with both auto-generated spans (like GET /order) and your custom business spans (like validate-order and compute-final-price). The custom spans carry the attributes you set — click any span in Jaeger to inspect order.valid, pricing.final_price, and other attributes.
If you only see spans from one or two services, check two things: (1) the Collector endpoint format matches the protocol (gRPC on 4317, HTTP on 4318), and (2) the services can resolve each other's hostnames on the Docker network. Run docker compose logs otel-collector to see if spans are arriving at the Collector.
The OTel Collector: Architecture and Deployment Models
The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It sits between your instrumented applications and your observability backends, decoupling the two so you can change destinations without touching application code. Think of it as a telemetry router with built-in transformation capabilities.
The Collector is a single binary that you configure entirely through YAML. Every configuration defines one or more pipelines, and each pipeline is composed of three types of components: Receivers, Processors, and Exporters.
graph LR
subgraph Receivers
R1["OTLP\n(gRPC/HTTP)"]
R2["Prometheus\n(scrape)"]
R3["Filelog\n(log files)"]
end
subgraph Processors
P1["memory_limiter"]
P2["batch"]
P3["attributes"]
end
subgraph Exporters
E1["OTLP → Tempo"]
E2["Prometheus\nRemote Write"]
E3["Loki\n(logs)"]
end
R1 --> P1
R2 --> P1
R3 --> P1
P1 --> P2
P2 --> P3
P3 --> E1
P3 --> E2
P3 --> E3
subgraph Deployment Topology
direction TB
APP["App + SDK"] -->|"OTLP"| AGENT["Agent\n(DaemonSet/Sidecar)"]
AGENT -->|"OTLP"| GW["Gateway\n(centralized)"]
GW --> BACKEND["Backend\n(Tempo, Prom, Loki)"]
end
Pipeline Architecture: Receivers → Processors → Exporters
A Collector pipeline has a strict data flow: Receivers ingest data, Processors transform it in order, and Exporters send it out. Each pipeline handles one signal type — traces, metrics, or logs. A single Collector instance typically runs multiple pipelines simultaneously.
Receivers — Data Ingestion
Receivers are how data gets into the Collector. They can be either push-based (the Collector listens for incoming data) or pull-based (the Collector actively scrapes a target). You can configure multiple receivers per pipeline, and the same receiver can feed into multiple pipelines.
| Receiver | Type | Signal | Description |
|---|---|---|---|
otlp | Push | Traces, Metrics, Logs | Native OTel protocol over gRPC (4317) and HTTP (4318) |
prometheus | Pull | Metrics | Scrapes Prometheus-format endpoints |
jaeger | Push | Traces | Accepts Jaeger Thrift and gRPC formats |
zipkin | Push | Traces | Accepts Zipkin JSON v2 spans |
filelog | Pull | Logs | Tails log files with configurable parsing |
hostmetrics | Pull | Metrics | Collects CPU, memory, disk, and network metrics from the host |
Processors — Transformation Pipeline
Processors run sequentially in the order you define them in the configuration. The order matters — data flows through processor A, then B, then C. This is where you shape, filter, sample, and enrich your telemetry before it leaves the Collector.
| Processor | Purpose | Why You Need It |
|---|---|---|
memory_limiter | Backpressure | Prevents the Collector from running out of memory under load |
batch | Batching | Groups data into batches for more efficient export (reduces network calls) |
attributes | Enrichment | Adds, updates, or deletes resource/span/metric attributes |
filter | Filtering | Drops telemetry that matches conditions (e.g., health-check spans) |
tail_sampling | Sampling | Makes sampling decisions after seeing the full trace (requires Gateway) |
transform | OTTL transforms | Applies arbitrary transformations using the OpenTelemetry Transformation Language |
Always place memory_limiter first in the processor chain — it needs to be able to reject data before other processors allocate memory for it. A typical order is: memory_limiter → filter → attributes / transform → batch (batch last, so it batches the final transformed data).
Exporters — Data Output
Exporters send data to one or more backends. Like receivers, you can attach multiple exporters to a single pipeline — the Collector fans out copies of the data to each one. This means you can send traces to Tempo and Jaeger simultaneously from the same pipeline.
Common exporters include otlp (for OTLP-native backends like Tempo or Grafana Cloud), prometheusremotewrite (for Prometheus/Mimir/Thanos), loki (for log aggregation), and vendor-specific exporters like datadog, splunk_hec, and awsxray.
Connectors — Joining Pipelines
Connectors are a newer component type that acts as both an exporter for one pipeline and a receiver for another. They bridge signal types — for example, the spanmetrics connector reads trace data and produces metrics like request rate, error rate, and duration histograms (RED metrics) without you needing a separate tool.
Here is a complete Collector configuration that demonstrates receivers, processors, exporters, connectors, and pipeline wiring:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
filelog:
include: [/var/log/app/*.log]
operators:
- type: json_parser
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
batch:
send_batch_size: 1024
timeout: 5s
attributes:
actions:
- key: environment
value: production
action: upsert
exporters:
otlp/tempo:
endpoint: tempo.monitoring:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://mimir.monitoring:9009/api/v1/push
loki:
endpoint: http://loki.monitoring:3100/loki/api/v1/push
connectors:
spanmetrics:
dimensions:
- name: http.method
- name: http.status_code
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo, spanmetrics]
metrics:
receivers: [otlp, spanmetrics]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp, filelog]
processors: [memory_limiter, attributes, batch]
exporters: [loki]
Deployment Models
How you deploy the Collector has a significant impact on reliability, latency, and what processing you can do. There are four common patterns, ranging from "no Collector at all" to a multi-tier architecture.
1. No Collector (Direct SDK Export)
Your application's OTel SDK exports telemetry directly to the backend. This is the simplest setup — no extra infrastructure to manage. However, it tightly couples your app to the backend, offers no local buffering, and means every SDK change requires an application redeploy. Suitable for local development and small prototypes only.
2. Agent Mode (Sidecar / DaemonSet)
A Collector instance runs alongside your application — either as a Kubernetes sidecar container in the same pod, or as a DaemonSet with one Collector per node. The SDK exports to localhost, which is fast and reliable. The Agent handles retry logic, batching, and basic enrichment, keeping the SDK configuration minimal.
3. Gateway Mode (Centralized)
A pool of Collector instances runs as a standalone, horizontally-scaled deployment behind a load balancer. All telemetry routes through this central tier. The Gateway is the only place you can run tail_sampling (which needs to see all spans of a trace) and cross-service aggregation. It is also the single point to manage export credentials for your backends.
4. Agent + Gateway (Recommended)
The production-grade pattern combines both tiers. Agents run locally for fast buffering and basic processing. They forward to a Gateway tier for advanced operations like tail sampling, attribute enrichment with external data, and routing to multiple backends. This gives you local resilience and centralized control.
| Model | Buffering | Tail Sampling | Complexity | Best For |
|---|---|---|---|---|
| No Collector | None | No | Minimal | Local dev, prototypes |
| Agent only | Local | No | Low | Small clusters, simple backends |
| Gateway only | Central | Yes | Medium | Centralized control without local agents |
| Agent + Gateway | Both | Yes | Higher | Production workloads at scale |
Distributions: Core vs Contrib
The Collector ships in two official distributions, and the difference matters when you plan your component usage.
otelcol (Core) includes only the most stable, well-tested components: the OTLP receiver/exporter, the batch and memory_limiter processors, and a handful of others. It is small, secure, and suitable when you only need OTLP-to-OTLP forwarding.
otelcol-contrib (Contrib) bundles hundreds of community-contributed receivers, processors, and exporters — including filelog, hostmetrics, prometheusremotewrite, loki, tail_sampling, and vendor-specific exporters. Most production deployments use Contrib because they need at least one component that is not in Core.
The Contrib binary includes every contributed component, resulting in a large binary (~200+ MB) with a broad attack surface. If you only need 5 components, you are shipping hundreds of unused ones. For production, build a custom Collector with only the components you actually use.
Building a Custom Collector with OCB
The OpenTelemetry Collector Builder (OCB) lets you create a custom Collector binary that includes exactly the receivers, processors, exporters, and connectors you need — nothing more. You define a manifest YAML file listing your components and their versions, then OCB generates the Go source code and compiles it into a single binary.
# otel-builder-manifest.yaml
dist:
name: my-otelcol
description: Custom Collector for our platform
output_path: ./build
otelcol_version: 0.104.0
receivers:
- gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.104.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/filelogreceiver v0.104.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver v0.104.0
processors:
- gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.104.0
- gomod: go.opentelemetry.io/collector/processor/memorylimiterprocessor v0.104.0
exporters:
- gomod: go.opentelemetry.io/collector/exporter/otlpexporter v0.104.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter v0.104.0
connectors:
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/connector/spanmetricsconnector v0.104.0
Then build and run your custom Collector:
# Install OCB
go install go.opentelemetry.io/collector/cmd/builder@latest
# Build the custom Collector from the manifest
builder --config=otel-builder-manifest.yaml
# Run it with your Collector config
./build/my-otelcol --config=otel-config.yaml
All component versions in your manifest must match the otelcol_version. Version mismatches between core and contrib modules are the number one cause of OCB build failures. Pin everything to the same release tag and update them together.
Collector Configuration: Pipelines, Processors, and Recipes
The OpenTelemetry Collector is configured via a single YAML file that declares what data comes in, how it gets transformed, and where it goes out. Every Collector deployment — sidecar, agent, or gateway — shares the same configuration model. Mastering this file is the single most important skill for running OTel in production.
Top-Level Configuration Structure
A Collector config has six top-level keys. Each one defines a pool of named components that you wire together in the service.pipelines section.
| Key | Purpose | Example Components |
|---|---|---|
receivers | Ingest data from external sources (push or pull) | otlp, prometheus, filelog, hostmetrics |
processors | Transform, filter, batch, or sample data in-flight | batch, attributes, tail_sampling, filter |
exporters | Send data to backends or downstream collectors | otlp, otlphttp, prometheusremotewrite, loki |
connectors | Bridge one pipeline's output to another pipeline's input | spanmetrics, routing, forward |
extensions | Auxiliary services (health checks, auth, storage) | health_check, pprof, zpages |
service | Declares active pipelines and enabled extensions | — |
Here is the skeleton structure that every config follows. Defining a component under receivers: alone does nothing — you must reference it in a pipeline under service: for it to run.
receivers:
# Named receiver instances go here
processors:
# Named processor instances go here
exporters:
# Named exporter instances go here
connectors:
# Named connector instances (optional)
extensions:
# Named extension instances (optional)
service:
extensions: [health_check, pprof, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/backend]
metrics:
receivers: [otlp, prometheus]
processors: [batch]
exporters: [prometheusremotewrite]
logs:
receivers: [filelog]
processors: [attributes, batch]
exporters: [otlphttp]
You can create multiple instances of the same component type using a type/name suffix — for example, otlp/frontend and otlp/backend. The part after the slash is an arbitrary label you choose. This is how you target different endpoints with the same exporter type.
Environment Variable Substitution
Never hard-code secrets or environment-specific values. The Collector natively supports ${ENV_VAR} syntax anywhere in the YAML, with optional defaults via ${ENV_VAR:-fallback}. This makes the same config portable across dev, staging, and production.
exporters:
otlp/backend:
endpoint: ${OTEL_BACKEND_ENDPOINT:-localhost:4317}
headers:
authorization: "Bearer ${API_TOKEN}"
tls:
insecure: ${TLS_INSECURE:-false}
receivers:
otlp:
protocols:
grpc:
endpoint: ${OTLP_GRPC_LISTEN:-0.0.0.0:4317}
http:
endpoint: ${OTLP_HTTP_LISTEN:-0.0.0.0:4318}
Recipe 1: Basic Tracing Pipeline — OTLP to Jaeger/Tempo
This is the starting point for most tracing setups. Applications send spans via OTLP (gRPC or HTTP), the Collector batches them for efficiency, and forwards them to a Jaeger or Grafana Tempo backend. The batch processor is critical — without it, each span triggers a separate network request to your backend.
# recipe-1-basic-tracing.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 1024 # Flush after 1024 spans
timeout: 5s # ...or after 5 seconds, whichever comes first
send_batch_max_size: 2048 # Hard cap per batch
exporters:
otlp/tempo:
endpoint: ${TEMPO_ENDPOINT:-tempo.observability.svc:4317}
tls:
insecure: true # Set false in production with real certs
extensions:
health_check:
endpoint: 0.0.0.0:13133
service:
extensions: [health_check]
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
The batch processor works on two triggers: a size threshold (send_batch_size) and a time deadline (timeout). In low-traffic environments, the timeout ensures spans aren't held indefinitely. In high-traffic bursts, the size threshold caps memory usage.
Recipe 2: Prometheus Scraping and Remote Write
You can use the Collector as a drop-in replacement for Prometheus' scrape loop. The prometheus receiver accepts standard Prometheus scrape_configs syntax, meaning you can port your existing prometheus.yml jobs directly. The Collector then converts scraped metrics into OTLP format internally and exports them via prometheusremotewrite to any compatible backend (Cortex, Mimir, Thanos, etc.).
# recipe-2-prometheus-scrape.yaml
receivers:
prometheus:
config:
scrape_configs:
- job_name: "kubernetes-pods"
scrape_interval: 30s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: $$1
- job_name: "node-exporter"
scrape_interval: 15s
static_configs:
- targets: ["node-exporter:9100"]
processors:
batch:
timeout: 10s
exporters:
prometheusremotewrite:
endpoint: ${MIMIR_ENDPOINT:-http://mimir.observability.svc:9009/api/v1/push}
headers:
X-Scope-OrgID: ${TENANT_ID:-default}
resource_to_telemetry_conversion:
enabled: true # Promotes OTel resource attributes to metric labels
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [prometheusremotewrite]
The resource_to_telemetry_conversion setting is important. Without it, OTel resource attributes (like service.name) are dropped during the conversion to Prometheus format. Enabling it promotes them to metric labels so your Grafana dashboards can filter by service.
Recipe 3: Log Collection with Multiline Parsing
The filelog receiver tails log files from disk — useful for collecting application logs in Kubernetes (from /var/log/pods) or VMs. The real power is in its operator pipeline: you can parse multiline stack traces, extract structured fields, and enrich logs before they leave the node. This recipe collects Java-style logs with multiline exceptions and ships them to Loki.
# recipe-3-log-collection.yaml
receivers:
filelog:
include:
- /var/log/pods/*/*/*.log
exclude:
- /var/log/pods/*/otel-collector/*.log # Don't collect our own logs
start_at: end # Don't re-read historical logs
include_file_path: true # Add log.file.path attribute
operators:
# Step 1: Combine multiline Java stack traces into a single log entry
- type: multiline
line_start_pattern: '^\d{4}-\d{2}-\d{2}'
overwrite_text: body
# Step 2: Parse the timestamp from the log line
- type: regex_parser
regex: '^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\s+(?P<severity>\w+)\s+\[(?P<thread>[^\]]+)\]\s+(?P<logger>\S+)\s+-\s+(?P<message>.*)'
timestamp:
parse_from: attributes.timestamp
layout: "%Y-%m-%d %H:%M:%S.%L"
severity:
parse_from: attributes.severity
processors:
attributes/logs:
actions:
- key: environment
value: ${DEPLOY_ENV:-production}
action: upsert
- key: cluster
value: ${CLUSTER_NAME}
action: upsert
- key: thread
action: delete # Clean up parsed temp attributes
batch:
timeout: 5s
send_batch_size: 512
exporters:
loki:
endpoint: ${LOKI_ENDPOINT:-http://loki.observability.svc:3100/loki/api/v1/push}
default_labels_enabled:
exporter: false
job: true
service:
pipelines:
logs:
receivers: [filelog]
processors: [attributes/logs, batch]
exporters: [loki]
The multiline operator is the key to sane log collection. Without it, each line of a Java stack trace becomes a separate log entry, making debugging impossible. The line_start_pattern regex identifies where a new log record begins — all subsequent lines until the next match are folded into the same entry.
Recipe 4: Tail Sampling Gateway
Head-based sampling (deciding at trace start) is simple but wasteful — you might drop the one trace that shows a critical error. Tail sampling waits until a trace is complete, then decides whether to keep it based on what actually happened. This requires a gateway Collector that receives all spans, groups them into complete traces, and applies policies.
All spans for a given trace must arrive at the same gateway instance. If you run multiple gateway replicas, you need a load balancer that routes by trace_id — use the loadbalancing exporter on your agent Collectors to achieve this. Without this, the gateway sees incomplete traces and makes bad sampling decisions.
# recipe-4-tail-sampling-gateway.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
# Step 1: Buffer spans and group them into complete traces
groupbytrace:
wait_duration: 30s # Wait up to 30s for all spans of a trace to arrive
num_traces: 100000 # Max traces held in memory simultaneously
# Step 2: Apply sampling policies to complete traces
tail_sampling:
decision_wait: 10s # Additional wait after groupbytrace
num_traces: 100000
policies:
# Policy 1: Always keep traces containing errors
- name: errors-policy
type: status_code
status_code:
status_codes:
- ERROR
# Policy 2: Keep traces with high latency (> 2 seconds)
- name: latency-policy
type: latency
latency:
threshold_ms: 2000
# Policy 3: Always keep traces from critical services
- name: critical-services
type: string_attribute
string_attribute:
key: service.name
values:
- payment-service
- auth-service
# Policy 4: Probabilistic sampling for everything else (10%)
- name: catch-all
type: probabilistic
probabilistic:
sampling_percentage: 10
batch:
send_batch_size: 1024
timeout: 5s
exporters:
otlp/tempo:
endpoint: ${TEMPO_ENDPOINT:-tempo.observability.svc:4317}
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [groupbytrace, tail_sampling, batch]
exporters: [otlp/tempo]
The processor order matters. groupbytrace must come before tail_sampling so that the sampler sees complete traces, not individual spans arriving out of order. The batch processor comes last because you want to batch the already-sampled output. Policies are evaluated with OR logic — if any policy matches, the trace is kept.
Recipe 5: Multi-Tenant Routing
In a shared platform, different teams or tenants need their telemetry routed to separate backends. The routing connector reads a resource attribute (like tenant.id) and directs data to different sub-pipelines, each with its own exporter and destination.
# recipe-5-multi-tenant-routing.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
connectors:
routing:
table:
- statement: route() where resource.attributes["tenant.id"] == "team-alpha"
pipelines: [traces/alpha]
- statement: route() where resource.attributes["tenant.id"] == "team-beta"
pipelines: [traces/beta]
default_pipelines: [traces/default]
processors:
batch:
timeout: 5s
exporters:
otlp/alpha:
endpoint: ${ALPHA_ENDPOINT:-tempo-alpha.svc:4317}
tls:
insecure: true
otlp/beta:
endpoint: ${BETA_ENDPOINT:-tempo-beta.svc:4317}
tls:
insecure: true
otlp/default:
endpoint: ${DEFAULT_ENDPOINT:-tempo-shared.svc:4317}
tls:
insecure: true
service:
pipelines:
# Ingestion pipeline: receives data and routes it
traces:
receivers: [otlp]
processors: [batch]
exporters: [routing] # Connector used as an exporter here
# Tenant-specific pipelines: connector feeds data in as a receiver
traces/alpha:
receivers: [routing] # Connector used as a receiver here
exporters: [otlp/alpha]
traces/beta:
receivers: [routing]
exporters: [otlp/beta]
traces/default:
receivers: [routing]
exporters: [otlp/default]
Notice how the routing connector appears as an exporter in the ingestion pipeline and a receiver in the tenant pipelines. This is the defining trait of connectors: they bridge pipelines. The OTTL route() statement gives you full access to resource attributes, span attributes, and metric data points for routing decisions.
Essential Extensions for Debugging
Three extensions belong in every Collector deployment. They cost almost nothing to run and save hours when something goes wrong.
health_check
Exposes an HTTP endpoint (default :13133) that returns 200 OK when the Collector is running. Use it as a Kubernetes liveness probe and readiness probe.
pprof
Exposes Go's pprof profiling endpoints (default :1777). When the Collector is using too much CPU or memory, connect with go tool pprof http://localhost:1777/debug/pprof/heap to diagnose exactly where resources are going.
zpages
Provides an in-process web UI (default :55679) showing live pipeline status, recent traces processed by the Collector itself, and component-level stats. Visit /debug/tracez to see sampled internal traces and /debug/pipelinez for pipeline health.
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1777
zpages:
endpoint: 0.0.0.0:55679
service:
extensions: [health_check, pprof, zpages]
# ... pipelines below
Run otelcol validate --config=config.yaml to catch syntax errors and invalid references without starting the Collector. Pair this with otelcol components to list all receivers, processors, and exporters compiled into your Collector binary — a common source of "unknown component" errors when using the wrong distribution.
Backends and Visualization: The Observability Stack
OpenTelemetry intentionally stops at data collection and export — it does not provide storage or querying. That boundary is where backends take over. Choosing the right backend for traces, metrics, and logs is one of the highest-impact architectural decisions you'll make, because it directly shapes query capabilities, operational cost, and long-term scalability.
The ecosystem breaks down cleanly along the three signal types, plus a growing category of all-in-one platforms that handle everything under a single umbrella.
mindmap
root((OTel Backends))
Traces
Jaeger
Grafana Tempo
Zipkin
Metrics
Prometheus
Mimir / Cortex / Thanos
VictoriaMetrics
Logs
Grafana Loki
Elasticsearch / OpenSearch
All-in-One
Grafana LGTM Stack
Datadog
Honeycomb
New Relic
Dynatrace
Splunk
Trace Backends
Traces are arguably the most transformative signal OTel produces — they show you request flow across service boundaries. Three backends dominate this space, each with a distinct philosophy.
Jaeger
Jaeger is a CNCF graduated project originally created at Uber. It offers native OTLP ingestion, a polished web UI for trace exploration, and support for multiple storage backends (Cassandra, Elasticsearch, Kafka, and an in-memory option for development). For dev environments or small-to-medium production workloads, Jaeger's all-in-one binary gets you running in seconds.
# Run Jaeger all-in-one with OTLP enabled (gRPC on 4317, HTTP on 4318)
docker run -d --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
Jaeger v2 (currently in development) is being rebuilt on top of the OpenTelemetry Collector, meaning Jaeger itself becomes an OTel Collector distribution with built-in storage. This is a strong signal of how deeply the ecosystem is converging around OTel.
Grafana Tempo
Tempo takes a radically different approach: it stores traces in object storage (S3, GCS, Azure Blob) with no indexing. This makes it extremely cost-efficient at scale — you pay object storage prices instead of database prices. The trade-off is that you need a trace ID to look up a trace directly, or you use TraceQL, Tempo's query language, to search by span attributes.
# Minimal Tempo config receiving OTLP
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
storage:
trace:
backend: s3
s3:
bucket: my-tempo-traces
endpoint: s3.amazonaws.com
Tempo shines when paired with Grafana — you get trace visualization, TraceQL queries, and seamless links from metrics and logs to the exact traces that matter. For teams already on the Grafana ecosystem, Tempo is the natural choice.
Zipkin
Zipkin predates both Jaeger and OTel. It's still widely deployed and OTel Collectors can export to it natively. However, it lacks the modern query capabilities and ecosystem integration of Jaeger or Tempo. If you're starting fresh, Jaeger or Tempo are stronger choices — but if you already run Zipkin, OTel fits in without requiring a migration.
Metrics Backends
Metrics are the most mature observability signal, and the Prometheus ecosystem is the gravitational center of this space.
Prometheus
Prometheus is the de facto standard for Kubernetes metrics. It uses a pull-based model — the Prometheus server scrapes HTTP endpoints at regular intervals. Its query language, PromQL, is expressive and widely supported across dashboarding tools. OTel integrates with Prometheus in two ways: the Collector can expose a Prometheus scrape endpoint, or it can remote-write metrics directly to Prometheus-compatible backends.
# OTel Collector exporter: Prometheus scrape endpoint
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: myapp
send_timestamps: true
resource_to_telemetry_conversion:
enabled: true
Standalone Prometheus works well for single-cluster deployments. Its main limitation is horizontal scalability — a single Prometheus server has a ceiling on the number of time series it can handle.
Thanos, Cortex, and Mimir
When Prometheus hits its scaling limits, you move to a horizontally scalable layer. Thanos adds long-term storage and multi-cluster federation on top of existing Prometheus instances. Cortex was the first fully distributed Prometheus-compatible backend. Grafana Mimir is Cortex's successor — it ingests Prometheus metrics via remote write, stores them in object storage, and scales to billions of active series. All three are PromQL-compatible, so your dashboards and alerts don't change.
VictoriaMetrics
VictoriaMetrics is a Prometheus-compatible time-series database focused on performance and storage efficiency. It typically uses 5-10x less disk and RAM than Prometheus for the same dataset. It supports PromQL (with extensions via MetricsQL), remote write ingestion from OTel Collectors, and both single-node and clustered deployments. It's a strong option if cost-efficiency is your primary concern.
Log Backends
Logs are the highest-volume signal, which makes storage cost and query performance the defining factors in backend choice.
Grafana Loki
Loki indexes only metadata labels (like service.name, severity), not the log content itself. This makes it dramatically cheaper to run than full-text search engines. You query logs with LogQL, which filters by labels first and then applies regex or pattern matching on log lines. OTel Collectors export to Loki natively via the loki or otlphttp exporter.
# OTel Collector exporter: sending logs to Loki via OTLP
exporters:
otlphttp:
endpoint: "http://loki:3100/otlp"
service:
pipelines:
logs:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp]
Elasticsearch and OpenSearch
Elasticsearch (and its open-source fork OpenSearch) provides full-text search with inverted indexes across all log content. This gives you the most powerful querying — you can search any substring, aggregate on any field, and build complex boolean queries. The cost is significantly higher resource consumption: CPU, memory, and disk all scale with ingest volume. For teams that need deep ad-hoc log analysis or already run the ELK stack, Elasticsearch remains a solid choice.
All-in-One Platforms
Rather than assembling separate backends for each signal, many teams opt for a unified platform. These come in two flavors: open-source stacks you self-host, and commercial SaaS products.
The Grafana LGTM Stack
The "LGTM" stack — Loki (logs), Grafana (visualization), Tempo (traces), Mimir (metrics) — is the leading open-source all-in-one approach. Every component uses object storage, speaks OTLP natively, and is designed to work together. Grafana ties them all into a single pane of glass with cross-signal correlation.
| Platform | Strengths | Best For |
|---|---|---|
| Grafana LGTM | Open-source, cost-efficient, full OTel support, cross-signal correlation | Teams wanting control and cost savings |
| Datadog | Full-featured SaaS, auto-instrumentation, unified UI, 700+ integrations | Teams prioritizing ease of use over cost |
| Honeycomb | High-cardinality query engine, BubbleUp analysis, trace-first approach | Debugging complex distributed systems |
| New Relic | Generous free tier, full-stack observability, AIOps features | Startups and teams needing broad coverage |
| Dynatrace | AI-driven root cause analysis (Davis AI), automatic topology mapping | Enterprise environments with complex infrastructure |
| Splunk | Powerful log analytics (SPL), strong security/compliance features | Organizations with security and compliance requirements |
Every major commercial platform now accepts OTLP natively. This means you can instrument with OTel once and switch backends later without touching application code. Vendor lock-in shifts from the instrumentation layer to the query and dashboard layer — a much easier migration.
The Grafana Visualization Layer
Regardless of which backends you choose, Grafana has become the default visualization layer for OTel-based observability. It supports every backend mentioned above as a data source, and its real power lies in cross-signal correlation — the ability to jump between traces, metrics, and logs in a single investigation.
Dashboards and Explore
Grafana dashboards are built from panels, each querying a specific data source with PromQL, LogQL, TraceQL, or other query languages. The Explore view is where ad-hoc investigation happens — it gives you a split-pane interface where you can run queries against different data sources side by side, follow trace waterfalls, and drill into log lines.
Cross-Signal Navigation
The most powerful feature of the Grafana stack is seamless navigation between signals. Grafana supports three key cross-signal patterns:
- Trace-to-Logs: Click a span in a trace and jump directly to the logs emitted during that span's execution, filtered by trace ID and time range.
- Trace-to-Metrics: From a trace, navigate to the relevant service's RED metrics (rate, errors, duration) for the time window around the request.
- Exemplars: Prometheus metrics can carry exemplar data — a sample trace ID attached to a specific metric data point. When you see a latency spike on a dashboard, you click the exemplar dot and land on the exact trace that caused it.
Exemplars are the single most impactful cross-signal feature to set up. A latency percentile that says "p99 is 2.3s" is useful — but a clickable link from that data point to the exact slow trace is transformative for debugging. Enable exemplar storage in Prometheus or Mimir, and ensure your OTel SDK is attaching trace IDs to metric exports.
This cross-signal workflow — from a dashboard metric anomaly, to exemplar traces, to correlated logs — collapses investigation time from hours of manual correlation to minutes of guided navigation. It's the reason the "LGTM" stack has gained such rapid adoption: the backends are good individually, but the connected experience through Grafana is what makes the whole greater than the sum of its parts.
Semantic Conventions, Resource Detection, and Contrib Packages
OpenTelemetry generates telemetry data — spans, metrics, logs — but that data is only useful if different teams, services, and tools can agree on what the attribute names and values mean. Without a shared vocabulary, one service might call the HTTP method http.method, another might call it request_method, and a third might use httpMethod. Dashboards, alerts, and correlation queries break down immediately.
This is the problem semantic conventions solve. They are a standardized, versioned set of attribute names, types, and allowed values maintained by the OTel specification. When every instrumentation library uses the same attributes, your observability backend can provide meaningful out-of-the-box dashboards without custom mappings.
Key Convention Groups
Semantic conventions are organized by signal domain. Each group defines the attributes that describe a particular kind of operation. Here are the most important ones you'll encounter daily.
HTTP Conventions
HTTP is the most widely instrumented protocol. The conventions cover both client and server spans, capturing the request method, URL, status code, and more.
| Attribute | Type | Example Value | Description |
|---|---|---|---|
http.request.method | string | GET | HTTP request method (uppercase) |
url.full | string | https://api.example.com/users?page=2 | Full request URL including scheme, host, path, query |
http.response.status_code | int | 200 | HTTP response status code |
url.path | string | /users | The path component of the URL |
http.route | string | /users/:id | The matched route template (low cardinality) |
server.address | string | api.example.com | Server domain name or IP |
A typical instrumented HTTP server span in Python looks like this:
# These attributes are set automatically by OTel HTTP instrumentation.
# You rarely set them manually — this shows what the span contains.
span.set_attribute("http.request.method", "GET")
span.set_attribute("url.full", "https://api.example.com/users?page=2")
span.set_attribute("url.path", "/users")
span.set_attribute("http.route", "/users")
span.set_attribute("http.response.status_code", 200)
span.set_attribute("server.address", "api.example.com")
Database Conventions
Database spans capture what system you're talking to, the operation being performed, and (optionally) the raw statement. These conventions let you correlate slow queries across services without knowing which ORM or driver each team uses.
| Attribute | Type | Example Value | Description |
|---|---|---|---|
db.system | string | postgresql | Database management system identifier |
db.statement | string | SELECT * FROM users WHERE id = $1 | The database statement (sanitized) |
db.operation | string | SELECT | The name of the operation (e.g., SQL verb) |
db.name | string | myapp_production | Name of the database being accessed |
server.address | string | db-primary.internal | Database server hostname |
server.port | int | 5432 | Database server port |
Messaging Conventions
For systems like Kafka, RabbitMQ, or SQS, messaging conventions capture the broker, the operation type, and the destination. This is essential for tracing asynchronous workflows across publish/subscribe boundaries.
| Attribute | Type | Example Value | Description |
|---|---|---|---|
messaging.system | string | kafka | Messaging system identifier |
messaging.operation | string | publish | Type of operation: publish, receive, process |
messaging.destination.name | string | order-events | Topic or queue name |
messaging.message.id | string | msg-abc-123 | Unique message identifier |
RPC Conventions
RPC conventions cover gRPC, Thrift, and other RPC frameworks. They capture the system, service, and method being invoked, making it straightforward to see which remote procedure calls are slow or failing.
| Attribute | Type | Example Value | Description |
|---|---|---|---|
rpc.system | string | grpc | RPC system (grpc, thrift, etc.) |
rpc.service | string | UserService | Full name of the RPC service |
rpc.method | string | GetUser | Name of the RPC method |
rpc.grpc.status_code | int | 0 | gRPC numeric status code (0 = OK) |
Resource Conventions
While span attributes describe individual operations, resource attributes describe the entity producing the telemetry — your service, the host it runs on, the container, the Kubernetes pod, the cloud account. Resources are attached once at SDK initialization and applied to every span, metric, and log emitted by that SDK instance.
| Attribute | Category | Example Value | Description |
|---|---|---|---|
service.name | Service | payment-api | Logical name of the service (the most important resource attribute) |
service.version | Service | 2.4.1 | Version of the deployed service |
telemetry.sdk.language | SDK | python | Language of the OTel SDK |
telemetry.sdk.version | SDK | 1.24.0 | Version of the OTel SDK |
cloud.provider | Cloud | aws | Cloud provider (aws, gcp, azure) |
cloud.region | Cloud | us-east-1 | Cloud region |
k8s.pod.name | Kubernetes | payment-api-7b9f5d-x2k4q | Name of the Kubernetes pod |
k8s.namespace.name | Kubernetes | production | Kubernetes namespace |
host.name | Host | ip-10-0-1-42 | Hostname of the machine |
container.id | Container | a3bf... | Container ID (from cgroup or runtime) |
If you set only one resource attribute, make it service.name. Without it, most backends default to unknown_service, making your telemetry nearly impossible to filter or route. Set it via the OTEL_SERVICE_NAME environment variable or in your SDK configuration.
Convention Migration: Old → New
Semantic conventions are versioned and evolve over time. A significant migration happened between the "pre-stable" and "stable" HTTP conventions. If you've worked with OTel before 2023, you may have seen the old attribute names. Libraries are actively migrating, and many SDKs now emit both old and new attributes during the transition period.
| Old Attribute (deprecated) | New Attribute (stable) | Notes |
|---|---|---|
http.method | http.request.method | Namespaced under request/response |
http.url | url.full | URL attributes moved to a standalone url.* namespace |
http.status_code | http.response.status_code | Clarifies it's a response attribute |
http.target | url.path + url.query | Split into separate attributes for clarity |
net.peer.name | server.address | Simplified and moved to server.* namespace |
net.peer.port | server.port | Simplified and moved to server.* namespace |
Many instrumentation libraries currently emit both old and new attribute names. This means your backend might receive duplicated data under different keys. Update your dashboards and alert queries to use the new names, and configure your SDK's OTEL_SEMCONV_STABILITY_OPT_IN environment variable to control the behavior — options typically include old, dup (both), or stable (new only).
Resource Detection
Manually setting every resource attribute — host name, container ID, cloud region, Kubernetes pod — would be tedious and fragile. Resource detectors solve this by automatically discovering the runtime environment at startup and populating resource attributes for you.
Each OTel SDK ships with built-in detectors, and the contrib ecosystem provides additional ones. A detector queries its environment (file system, metadata endpoints, environment variables) and returns a set of resource attributes.
from opentelemetry.sdk.resources import Resource, get_aggregated_resources
from opentelemetry.resource.detector.azure import AzureVMResourceDetector
from opentelemetry.resource.detector.container import ContainerResourceDetector
# Detectors run at startup and merge their results
resource = get_aggregated_resources(
detectors=[
ContainerResourceDetector(),
AzureVMResourceDetector(),
],
initial_resource=Resource.create({
"service.name": "payment-api",
"service.version": "2.4.1",
}),
)
# resource now contains service.name, container.id,
# cloud.provider, cloud.region, host.id, etc.
Common resource detectors available across language SDKs include:
| Detector | Attributes Populated | How It Works |
|---|---|---|
| Process | process.pid, process.executable.name, process.runtime.name | Reads process info from the OS |
| Host | host.name, host.id, host.arch | Reads hostname and machine ID |
| OS | os.type, os.description | Reads OS release info |
| Container | container.id | Parses /proc/self/cgroup or CRI-O files |
| Kubernetes | k8s.pod.name, k8s.namespace.name, k8s.node.name | Reads the Downward API env vars or metadata |
| AWS (EC2/ECS/EKS) | cloud.provider, cloud.region, aws.ecs.task.arn | Queries IMDS or ECS metadata endpoint |
| GCP | cloud.provider, cloud.region, gcp.project.id | Queries GCP metadata server |
In many setups, you don't even configure detectors manually. The OTEL_RESOURCE_DETECTORS environment variable lets you specify which detectors to activate. For example, OTEL_RESOURCE_DETECTORS=env,host,os,process,container enables a standard set for containerized workloads.
The Contrib Ecosystem
The core OTel SDK is deliberately minimal — it provides the API, SDK, and OTLP exporter, but does not instrument specific libraries. That's the job of the contrib ecosystem: a collection of community-maintained packages that provide automatic instrumentation for popular libraries and frameworks.
Each contrib package follows a naming convention: opentelemetry-instrumentation-{library}. For example:
# Python examples
pip install opentelemetry-instrumentation-flask
pip install opentelemetry-instrumentation-django
pip install opentelemetry-instrumentation-requests
pip install opentelemetry-instrumentation-sqlalchemy
pip install opentelemetry-instrumentation-celery
pip install opentelemetry-instrumentation-redis
# Node.js examples
npm install @opentelemetry/instrumentation-http
npm install @opentelemetry/instrumentation-express
npm install @opentelemetry/instrumentation-pg
npm install @opentelemetry/instrumentation-redis
These packages monkey-patch or wrap library internals to create spans, record metrics, and propagate context automatically. Once installed and activated, a Flask or Express instrumentation creates server spans for every incoming HTTP request — with all the correct semantic convention attributes — without you writing any tracing code.
from flask import Flask
from opentelemetry.instrumentation.flask import FlaskInstrumentor
app = Flask(__name__)
# One line: every route now produces spans with
# http.request.method, url.path, http.route, http.response.status_code
FlaskInstrumentor().instrument_app(app)
@app.route("/users/<int:user_id>")
def get_user(user_id):
return {"id": user_id, "name": "Alice"}
The OTel Registry
With hundreds of contrib packages across multiple languages, finding the right one can be overwhelming. The OpenTelemetry Registry is the official catalog. It's a searchable directory of instrumentation packages, exporters, resource detectors, and other components — across all supported languages.
When evaluating a contrib package from the registry, consider these factors:
| Factor | What to Check | Why It Matters |
|---|---|---|
| Maturity | Look for stable vs alpha vs beta labels | Stable packages have locked APIs; alpha packages may have breaking changes |
| Supported versions | Check which versions of the target library are supported | A Django instrumentor that only supports Django 3.x won't help on 5.x |
| Semantic convention compliance | Does it use the latest stable attribute names? | Old conventions lead to fragmented dashboards |
| Signal coverage | Does it produce traces only, or also metrics and logs? | Full signal coverage means richer observability |
| Maintenance activity | Check the GitHub repo for recent commits and issue response | Stale packages accumulate bugs and security vulnerabilities |
Most languages offer a meta-package that bundles all stable instrumentations. In Python, opentelemetry-bootstrap -a install detects your installed libraries and installs matching instrumentors automatically. In Node.js, @opentelemetry/auto-instrumentations-node bundles the common ones. Use these to get started quickly, then trim down to only what you need for production.
SLIs, SLOs, and SLAs: Observability Meets Reliability
Collecting telemetry data is only valuable if it helps you answer a single question: is the service healthy for its users? SLIs, SLOs, and SLAs form a hierarchy that translates raw observability signals into reliability contracts your team (and your customers) can reason about.
The Hierarchy: SLI → SLO → SLA
These three concepts build on each other in a strict chain. Each layer adds more context and more consequences to the one below it.
| Concept | Definition | Example | Owned By |
|---|---|---|---|
| SLI (Service Level Indicator) | A quantitative measure of one dimension of service behavior | Proportion of HTTP requests completing in < 200ms | Engineering |
| SLO (Service Level Objective) | A target value (or range) for an SLI, measured over a time window | 99.9% of requests under 200ms over a rolling 30-day window | Engineering + Product |
| SLA (Service Level Agreement) | An SLO with contractual, business consequences for missing it | "99.9% availability or customer receives service credits" | Business / Legal |
If your SLA promises 99.9% availability, set your internal SLO at 99.95%. This gives your team a buffer to detect and fix problems before breaching the contractual agreement. An SLO violation should trigger engineering action; an SLA violation means your business is already paying the price.
How OTel Data Feeds SLI Calculation
OpenTelemetry is the collection layer — it produces the raw metrics and traces that SLIs are computed from. The OTel SDK instruments your application, emitting histograms (for latency), counters (for requests and errors), and traces (for per-request context). These signals flow through the OTel Collector and land in a metrics backend like Prometheus, where you write queries to derive SLI values.
graph LR
A["OTel SDK
(latency histogram,
error counter)"] --> B["OTel Collector"]
B --> C["Prometheus
(metrics storage)"]
C --> D["PromQL Queries
(derive SLIs)"]
D --> E["SLO Recording Rules
(evaluate SLI vs target)"]
E --> F{"Burn Rate
Exceeds
Threshold?"}
F -->|Yes| G["Alert Fires 🔔"]
G --> H["Incident Response"]
F -->|No| I["Budget OK ✅"]
Common SLI Types
Not every metric is an SLI. Good SLIs measure something the user directly experiences. Here are the five most common types, along with how you'd express each one from OTel-generated metrics.
Availability
The ratio of successful requests to total requests. This is the most fundamental SLI — if the service is down, nothing else matters. A request is "successful" if it returns a non-5xx response.
# Availability SLI: ratio of non-5xx requests over 30 days
sum(rate(http_server_request_duration_seconds_count{http_response_status_code!~"5.."}[5m]))
/
sum(rate(http_server_request_duration_seconds_count[5m]))
Latency
The proportion of requests faster than a given threshold. OTel exports latency as a histogram (http_server_request_duration_seconds), so you use histogram_quantile for percentiles or bucket ratios for SLI-style "good/total" fractions.
# Latency SLI: proportion of requests completing in < 200ms
sum(rate(http_server_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_server_request_duration_seconds_bucket{le="+Inf"}[5m]))
Error Rate, Throughput, and Correctness
| SLI Type | Formula | Typical Source |
|---|---|---|
| Error Rate | errors / total requests | http_server_request_duration_seconds_count filtered by status code |
| Throughput | requests per second above a minimum baseline | rate(http_server_request_duration_seconds_count[5m]) |
| Correctness | valid responses / total responses (requires app-level validation) | Custom OTel counter, e.g., app_response_valid_total |
Correctness is the hardest to instrument because it requires domain knowledge — the service might return a 200 OK with wrong data. You typically add a custom OTel counter that increments when a response passes validation checks.
Error Budgets
An error budget flips the reliability conversation from "minimize all failure" to "how much failure can we tolerate?" If your SLO is 99.9% over 30 days, your error budget is the remaining 0.1% — roughly 43 minutes of downtime or the equivalent number of failed requests.
Error Budget = 1 - SLO target
SLO = 99.9% → Error Budget = 0.1%
Over 30 days → 0.001 × 30 × 24 × 60 = 43.2 minutes of allowed downtime
If you've already consumed 30 minutes this month, you have 13.2 minutes left.
Spending it faster than expected → burn rate alert fires.
When the budget is healthy, teams ship features aggressively. When the budget is nearly exhausted, the team shifts to reliability work. This creates a natural, data-driven balance between velocity and stability.
Multi-Window, Multi-Burn-Rate Alerting
Naive SLO alerting ("fire if SLI drops below target") creates either too many false positives or alerts too late. Google's SRE book introduced multi-window, multi-burn-rate alerting to solve this. The idea: measure how fast you're consuming your error budget (the "burn rate") over multiple time windows, and alert only when both a short window and a long window agree the burn is real.
What Is Burn Rate?
A burn rate of 1x means you'll exactly exhaust your 30-day error budget in 30 days. A burn rate of 14.4x means you'll burn through the entire budget in about 2 days. Higher burn rates demand faster response.
| Severity | Burn Rate | Long Window | Short Window | Budget Consumed | Response |
|---|---|---|---|---|---|
| Page (critical) | 14.4x | 1 hour | 5 minutes | 2% in 1 hour | Immediate |
| Page (high) | 6x | 6 hours | 30 minutes | 5% in 6 hours | Within 30 min |
| Ticket (medium) | 3x | 1 day | 2 hours | 10% in 1 day | Next business day |
| Ticket (low) | 1x | 3 days | 6 hours | 10% in 3 days | Sprint planning |
Both the long window and the short window must exceed their threshold for the alert to fire. The short window acts as a reset: if a brief spike is already over, the short window drops below threshold and suppresses the alert.
Prometheus Recording and Alerting Rules
You define the SLI as a Prometheus recording rule, then write alerting rules that compare burn rates across windows. Here's a practical example for a latency SLO (99.9% of requests under 200ms over 30 days):
# prometheus-rules.yaml
groups:
- name: slo-latency-rules
rules:
# --- Recording rules: pre-compute the SLI ratio ---
- record: sli:latency:good_rate5m
expr: |
sum(rate(http_server_request_duration_seconds_bucket{le="0.2"}[5m]))
- record: sli:latency:total_rate5m
expr: |
sum(rate(http_server_request_duration_seconds_bucket{le="+Inf"}[5m]))
# --- 1h error ratio (for 14.4x burn rate window) ---
- record: sli:latency:error_ratio_1h
expr: |
1 - (
sum(increase(http_server_request_duration_seconds_bucket{le="0.2"}[1h]))
/
sum(increase(http_server_request_duration_seconds_bucket{le="+Inf"}[1h]))
)
# --- 6h error ratio (for 6x burn rate window) ---
- record: sli:latency:error_ratio_6h
expr: |
1 - (
sum(increase(http_server_request_duration_seconds_bucket{le="0.2"}[6h]))
/
sum(increase(http_server_request_duration_seconds_bucket{le="+Inf"}[6h]))
)
- name: slo-latency-alerts
rules:
# --- Critical: 14.4x burn rate over 1h AND 5m ---
- alert: LatencySLOCriticalBurnRate
expr: |
sli:latency:error_ratio_1h > (14.4 * 0.001)
and
sli:latency:error_ratio_5m > (14.4 * 0.001)
for: 1m
labels:
severity: critical
annotations:
summary: "Latency SLO burn rate critical (14.4x)"
description: "Consuming error budget at 14.4x. Will exhaust 30-day budget in ~2 days."
# --- High: 6x burn rate over 6h AND 30m ---
- alert: LatencySLOHighBurnRate
expr: |
sli:latency:error_ratio_6h > (6 * 0.001)
and
sli:latency:error_ratio_30m > (6 * 0.001)
for: 1m
labels:
severity: warning
annotations:
summary: "Latency SLO burn rate high (6x)"
description: "Consuming error budget at 6x. Will exhaust 30-day budget in ~5 days."
Each burn-rate window requires a separate PromQL query. Without recording rules, an alert evaluation with 4 severity tiers × 2 windows = 8 expensive histogram queries every evaluation cycle. Pre-compute the error ratios as recording rules so your alerting rules are cheap comparisons.
Grafana Dashboard for SLO Tracking
A well-designed SLO dashboard tells the story at a glance: Are we meeting our target? How much budget remains? Is the current burn rate dangerous? Below is a Grafana dashboard JSON model for the key panels. Import this into Grafana or use it as a starting template.
Panel 1: SLI Over Time (Time Series)
# Query A: Current SLI (plot as time series)
sli:latency:good_rate5m / sli:latency:total_rate5m
# Query B: SLO target (plot as constant threshold line)
vector(0.999)
Plot Query A as a line and Query B as a dashed red threshold line. When the SLI dips below the SLO line, you can visually see the moments that eat into your budget.
Panel 2: Error Budget Remaining (Gauge + Time Series)
# Error budget remaining (as a percentage of total budget)
# Uses 30-day window; adjust range to match your SLO period
1 - (
(
1 - (
sum(increase(http_server_request_duration_seconds_bucket{le="0.2"}[30d]))
/
sum(increase(http_server_request_duration_seconds_bucket{le="+Inf"}[30d]))
)
) / 0.001
)
Display this as a Gauge panel with thresholds: green above 50%, yellow between 20–50%, red below 20%. Add a companion time series panel showing how the remaining budget has trended over the past 30 days — a downward slope reveals a sustained burn.
Panel 3: Burn Rate (Stat Panel)
# Current burn rate (1h window)
# A value of 1.0 = consuming at exactly the sustainable rate
# A value of 14.4 = will exhaust 30-day budget in ~2 days
sli:latency:error_ratio_1h / 0.001
Use a Stat panel with color thresholds: green below 1x, yellow at 3x, orange at 6x, red at 14.4x. This single number tells an on-call engineer whether they need to act immediately.
A common mistake is setting a threshold alert like "fire if availability drops below 99.9%." This triggers on brief blips that consume negligible budget and misses slow, sustained degradations that quietly drain it. Burn-rate alerting catches both fast incidents and slow leaks while ignoring noise.
Putting It All Together
The full workflow forms a closed loop. Your OTel-instrumented services emit metrics. Prometheus stores them and evaluates recording rules that pre-compute SLI error ratios at multiple time windows. Alerting rules compare those ratios against burn-rate thresholds. When an alert fires, the on-call engineer opens the Grafana SLO dashboard to see the remaining error budget, the burn trajectory, and which SLI is degraded — then decides whether to roll back, scale up, or investigate further.
This system replaces gut-feel reliability with a quantitative framework. Your SLIs define what "healthy" means. Your SLOs set the bar. Your error budget tells you how much room you have. And your burn-rate alerts tell you when that room is shrinking too fast. Everything starts with the data OpenTelemetry collects.
Observability-Driven Development and Debugging Workflows
Most teams treat observability as an afterthought — they add log lines when a production incident is already underway, scrambling to understand behavior they never instrumented. Observability-Driven Development (ODD) flips this entirely: you instrument your code before you ship it, treating telemetry as a first-class artifact alongside your tests and documentation.
The core principle is simple. If you can't observe a behavior in production, you can't understand it, debug it, or improve it. ODD makes instrumentation part of the development loop, not a post-mortem reaction.
The Mindset Shift
Traditional development treats logging as a debugging tool — you sprinkle console.log or logger.info calls when something goes wrong, deploy a patched build, wait for the issue to recur, and hope you captured enough context. This is reactive and slow. ODD asks a different question: when this code runs in production, what do I need to see to understand its behavior?
| Reactive Approach | Observability-Driven Development |
|---|---|
| Add logging after an incident | Instrument during feature development |
| Telemetry covers known failure modes | Telemetry covers all meaningful behaviors |
| "We'll add metrics if this becomes a problem" | "We can't ship without metrics for this SLI" |
| Debugging requires new deploys to add context | Debugging uses existing traces, metrics, and span events |
| Tests verify correctness | Tests verify correctness and instrumentation |
In practice, this means every pull request that introduces a new endpoint, background job, or external call should also introduce the spans, metrics, and attributes needed to observe that code path. You review instrumentation in code review the same way you review error handling.
Test-Driven Development tells you what your code should do. Observability-Driven Development tells you what your code is actually doing in production. TDD guards against regressions in logic; ODD guards against regressions in performance, reliability, and user experience. Treat them as two sides of the same coin.
What "Instrument First" Looks Like
When you start building a new feature — say, a payment processing endpoint — ODD means you define your spans and metrics before writing the business logic. You think about what questions you'll ask in production: How long does payment authorization take? What's the failure rate by payment provider? Which step in the flow is slowest?
from opentelemetry import trace, metrics
tracer = trace.get_tracer("payments")
meter = metrics.get_meter("payments")
payment_duration = meter.create_histogram(
"payment.process.duration",
unit="ms",
description="Time to process a payment end-to-end",
)
payment_counter = meter.create_counter(
"payment.process.total",
description="Total payment attempts by provider and outcome",
)
def process_payment(order):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.provider", order.provider)
span.set_attribute("payment.amount", order.amount_cents)
span.set_attribute("order.id", order.id)
# Business logic goes here — instrumentation came first
result = authorize_and_capture(order)
span.set_attribute("payment.outcome", result.status)
payment_counter.add(1, {"provider": order.provider, "outcome": result.status})
payment_duration.record(result.duration_ms, {"provider": order.provider})
Notice the pattern: the span wraps the operation, attributes capture the dimensions you'll query by, and the metric records the SLI you'll alert on. The business logic (authorize_and_capture) is almost secondary — the observable scaffolding was there first.
The Debugging Loop
ODD pays off most visibly during incident response. When your system is well-instrumented, debugging follows a consistent, repeatable loop — from broad signal to precise root cause. Instead of guessing and grepping through log files, you systematically narrow scope using the three pillars of observability.
flowchart TD
A["🔔 Alert Fires\n(SLO burn rate exceeded)"] --> B["📊 Query Metrics\nIdentify affected SLI"]
B --> C["🔍 Pivot to Traces\nFind slow/errored requests"]
C --> D["🏷️ Filter by Attributes\nEndpoint, region, user segment"]
D --> E["📋 Drill into Single Trace\nInspect span waterfall"]
E --> F["📝 Read Span Events & Logs\nExceptions, retries, timeouts"]
F --> G["🎯 Identify Root Cause"]
G --> H["🔧 Fix & Deploy"]
H --> I["✅ Verify with Telemetry\nConfirm SLI recovery"]
I -->|"New alert?"| A
style A fill:#ff6b6b,stroke:#c92a2a,color:#fff
style G fill:#51cf66,stroke:#2b8a3e,color:#fff
style I fill:#339af0,stroke:#1864ab,color:#fff
Each step in this loop narrows the blast radius. You start with "something is wrong" and end with "this specific line of code, in this specific service, under these specific conditions, is the cause." The key enabler is rich, structured telemetry — without attributes on your spans, you can't filter; without span events, you can't see exceptions; without correlated trace IDs, you can't pivot from metrics to traces.
Workflow 1: Alert-Driven Investigation
This is the most common debugging workflow. An SLO burn-rate alert fires — say, your checkout service's error rate has spiked beyond the budget. Here's how you walk through it with proper instrumentation in place.
-
Scope the problem with metrics
Open your metrics dashboard and look at the SLI that triggered the alert. Query your error rate broken down by endpoint, status code, and region. You discover that
POST /checkoutinus-east-1has a 15% error rate, up from the baseline 0.3%.promqlsum(rate(http_server_request_duration_seconds_count{ status_code=~"5..", http_route="/checkout" }[5m])) by (deployment_environment, cloud_region) / sum(rate(http_server_request_duration_seconds_count{ http_route="/checkout" }[5m])) by (deployment_environment, cloud_region) -
Pivot to traces
Now you switch to your trace backend and search for traces from
POST /checkoutinus-east-1that have error status. Use exemplars on your metric chart if your tooling supports them — clicking a data point takes you directly to a representative trace. -
Find common attributes
Look at the errored traces as a group. Use your query tool to group by attributes: are they all hitting the same downstream service? Same payment provider? Same database shard? You find that 100% of errors have
payment.provider=stripeand the error message isconnection timeout. -
Drill into a single trace
Pick one representative trace and open the span waterfall. You see the
process_paymentspan takes 30 seconds (your timeout) and its child spanHTTP POST stripe.com/v1/chargesshows a network timeout. The span event attached to the exception includes the full stack trace. -
Identify root cause and verify the fix
Stripe's status page confirms a regional outage in
us-east-1. You enable your fallback payment provider, deploy, and watch the SLI recover on the same metrics dashboard that alerted you. The loop is closed.
Workflow 2: Customer-Reported Issue
A customer writes in: "My order from 2 hours ago still shows as processing." There's no alert because the overall SLI is healthy — this is a single-user issue. Without ODD, you'd be searching through log files with grep. With proper instrumentation, you have a direct path.
Search your trace backend for traces with user.id = "cust_8293" in the last 3 hours. You find their checkout trace, open it, and see the send_order_confirmation span failed with a serialization error. The span event shows the customer's address contained a Unicode character that broke your email template renderer. The fix is a one-line encoding change, and you can verify it by asking the customer to retry and watching the new trace complete cleanly.
# Trace search query — varies by backend, but attributes are universal
# Jaeger: tag=user.id:cust_8293
# Tempo: { span.user.id = "cust_8293" }
# Honeycomb: WHERE user.id = "cust_8293" LAST 3 HOURS
# What you see in the trace:
span: POST /checkout duration: 2.4s status: OK
span: validate_cart duration: 45ms status: OK
span: process_payment duration: 1.8s status: OK
span: send_order_confirmation duration: 580ms status: ERROR
event: exception
type: UnicodeEncodeError
message: "'ascii' codec can't encode character '\\xf1' in position 42"
This workflow is only possible if user.id is set as a span attribute. That's the ODD mindset — you add user.id during development because you know you'll need to search by it later, not because a support ticket demanded it.
Workflow 3: Canary Deployment Comparison
You've deployed a canary that serves 5% of traffic. Before promoting it, you need to compare its behavior against the baseline. This is where metrics and traces work in concert.
Query your latency histogram filtered by deployment.canary=true versus deployment.canary=false. The canary's p99 latency is 40% higher. Now pivot to traces: pull a sample of slow traces from the canary and compare them to baseline traces for the same endpoint. Using Jaeger's trace comparison feature or Honeycomb's BubbleUp, you see that canary traces spend extra time in a new validate_inventory span that makes a synchronous database call you didn't intend. The N+1 query pattern is obvious in the trace waterfall — each cart item triggers a separate database span.
Add deployment.version, deployment.canary, and deployment.id as resource attributes in your OTel SDK configuration. These propagate to every span and metric automatically, letting you slice all telemetry by deployment without changing application code.
Tooling for These Workflows
The workflows above are tool-agnostic, but the experience varies significantly depending on which observability platform you use. The common requirement is the ability to move fluidly between metrics, traces, and logs — and to filter by arbitrary attributes at each step.
| Tool | Strength | Best For |
|---|---|---|
| Grafana Explore | Unified query interface across Prometheus, Tempo, and Loki. "Trace to logs" and "trace to metrics" links let you pivot between signals in a single UI. | Teams already in the Grafana ecosystem who want metrics→traces→logs correlation without vendor lock-in. |
| Honeycomb | Query builder with BubbleUp automatically surfaces attributes that differ between slow and fast requests. Designed around high-cardinality trace analysis. | Investigating unknown-unknowns — when you don't know which attribute is the culprit and need the tool to surface it. |
| Jaeger | Trace comparison feature lets you diff two traces side-by-side. Lightweight, open-source, and OTel-native. | Canary vs. baseline comparisons, open-source-first teams, and environments where you need to self-host. |
No observability platform can surface insights from telemetry that doesn't exist. If your spans lack attributes, your metrics lack dimensions, or your logs lack correlation IDs, even the best query builder will return nothing useful. The investment in ODD is in your instrumentation — the tool just makes querying it comfortable.
Best Practices, Anti-Patterns, and Cost Management
Instrumenting your applications is only half the battle. The difference between an observability setup that scales gracefully and one that collapses under its own weight comes down to operational discipline. This section distills hard-won lessons from production deployments into concrete do's, don'ts, and cost-control strategies.
Best Practices
1. Always Set service.name and deployment.environment
These two resource attributes are the most important metadata you can attach to your telemetry. Without service.name, your traces and metrics land in your backend as anonymous noise. Without deployment.environment, you can't distinguish staging traffic from production incidents.
# otel-collector-config.yaml — resource processor
processors:
resource:
attributes:
- key: service.name
value: "checkout-service"
action: upsert
- key: deployment.environment
value: "production"
action: upsert
- key: service.version
value: "2.4.1"
action: upsert
Set these via environment variables (OTEL_RESOURCE_ATTRIBUTES) or in the SDK resource configuration so they're present on every span, metric, and log record from the start.
2. Use Semantic Conventions Consistently
OpenTelemetry defines semantic conventions for common attribute names — http.request.method, db.system, rpc.service, and hundreds more. When every team invents their own names (httpMethod, method, request_type), you lose the ability to write cross-service queries and reuse dashboards. Stick to the conventions, and your tooling will reward you.
3. Batch Exports — Never Block the Request Path
Span and metric export should happen asynchronously in the background. The BatchSpanProcessor queues completed spans and flushes them in batches, keeping per-request overhead under a microsecond. The SimpleSpanProcessor exports each span synchronously and exists only for debugging.
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Good: batched, async export
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://otel-collector:4317"),
max_queue_size=2048,
max_export_batch_size=512,
schedule_delay_millis=5000,
)
provider.add_span_processor(processor)
4. Set Memory Limits on the Collector
An unbounded Collector will consume all available memory under traffic spikes and get OOM-killed. The memory_limiter processor is non-negotiable in any production pipeline. Place it as the first processor in your pipeline so it can apply backpressure before the queue grows unbounded.
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1500 # Hard cap
spike_limit_mib: 512 # Reserve for bursts
service:
pipelines:
traces:
processors: [memory_limiter, batch] # memory_limiter FIRST
5. Use Tail Sampling for Cost Control
Head sampling (deciding at trace creation) is simple but blind — it randomly discards slow requests and errors. Tail sampling waits until a trace is complete, then applies policies: keep all errors, keep traces over 2 seconds, sample the rest at 10%. This dramatically reduces storage costs while preserving every trace you'd actually investigate.
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 2000 }
- name: baseline-sample
type: probabilistic
probabilistic: { sampling_percentage: 10 }
6. Correlate All Three Signals
The real power of OpenTelemetry is connecting traces, metrics, and logs into one coherent picture. Inject trace_id and span_id into every log line so you can jump from an error log straight to the trace. Use exemplars on metrics so a latency spike on a histogram links directly to an example slow trace.
import logging
from opentelemetry import trace
# Inject trace context into structured logs
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
"Payment processed",
extra={
"trace_id": format(ctx.trace_id, "032x"),
"span_id": format(ctx.span_id, "016x"),
"amount": 49.99,
},
)
7. Start with Auto-Instrumentation, Add Manual Spans Incrementally
Auto-instrumentation libraries cover HTTP servers, database clients, gRPC, and message queues out of the box — often with zero code changes. Get this baseline deployed first. Then add manual spans only where you need business-level visibility: payment processing, recommendation ranking, or complex workflows that auto-instrumentation can't see inside.
A good rule of thumb: if you can answer "why was this request slow?" with your current spans, you have enough. Add manual spans only when the auto-instrumented trace leaves a gap you've actually hit during an investigation.
Anti-Patterns to Avoid
These are the patterns that seem reasonable at first but cause real operational pain at scale. Most of them are expensive to fix after the fact, so catching them early matters.
1. Cardinality Explosion in Metric Labels
This is the single most common way teams blow up their observability costs. Every unique combination of label values creates a separate time series. If you add user_id as a metric label and you have 1 million users, a single counter becomes 1 million time series. Your Prometheus instance runs out of memory, your vendor bill skyrockets, and queries grind to a halt.
| Label | Cardinality | Verdict |
|---|---|---|
http.request.method | ~7 (GET, POST, etc.) | ✅ Safe |
http.response.status_code | ~50 | ✅ Safe |
http.route | ~100 (bounded by routes) | ✅ Safe |
user.id | Millions | ❌ Never use as a metric label |
request.id | Unbounded | ❌ Never use as a metric label |
db.statement | Unbounded (raw SQL) | ❌ Use as a span attribute, not a label |
Put high-cardinality identifiers on span attributes (where they're invaluable for search) and keep metric labels to bounded, low-cardinality dimensions.
2. Over-Instrumentation — Spans for Every Function Call
Creating a span for every internal function turns a single API call into a trace with hundreds of spans. This inflates export volume, slows down the trace viewer, and buries the signal in noise. Spans should represent units of work with clear boundaries — an HTTP request, a database query, a queue publish — not internal method calls like validateInput() or formatResponse().
3. Ignoring Context Propagation
If you see traces that start and stop at service boundaries instead of flowing end-to-end, context propagation is broken. This usually means one of: the HTTP client library isn't injecting traceparent headers, a reverse proxy is stripping them, or a service is using the wrong propagator format. Always verify propagation works across your entire call chain during setup — not during an outage.
4. Dashboard Sprawl Without SLOs
Teams often respond to a new observability platform by building dozens of dashboards for every metric they can find. Within a month, nobody looks at them. Dashboards without a purpose are just noise. Start from Service Level Objectives — "99.5% of checkout requests complete in under 500ms" — and build the dashboards, alerts, and burn-rate calculations that directly serve those SLOs.
5. Sending Unsampled Traces at High Volume
A service handling 10,000 requests per second generates roughly 100,000+ spans per second when you include downstream calls. At 1 KB per span, that's ~100 MB/s of raw telemetry. Without sampling, you're paying to store data nobody will ever query. Apply head sampling at the SDK level for baseline reduction, and tail sampling at the Collector for intelligent retention.
6. Not Setting Up Resource Detection
When you skip resource detectors, your telemetry lacks host, container, and cloud metadata. You can't filter by Kubernetes namespace, can't group by EC2 instance type, and can't correlate with infrastructure metrics. OTel SDKs and the Collector both offer resource detectors for AWS, GCP, Azure, and Kubernetes — enable them.
Anti-patterns 1 and 5 are the top two drivers of runaway observability costs. A single cardinality explosion or unsampled high-volume service can consume more budget than the rest of your infrastructure combined.
Cost Management Strategies
Observability costs scale with data volume. In a vendor-hosted model, you pay per GB ingested or per million spans; in a self-hosted model, you pay in storage, compute, and engineering time. Either way, controlling the volume of data you generate, transmit, and retain is the primary lever.
Estimate Data Volumes Before You Ship
Do the math before enabling tracing on a high-traffic service. A rough formula:
# Back-of-envelope data volume estimation
requests_per_sec=5000
avg_spans_per_trace=8
bytes_per_span=1000 # ~1 KB is typical for OTLP
sample_rate=0.10 # 10% sampling
daily_gb=$(echo "$requests_per_sec * $avg_spans_per_trace * $bytes_per_span * $sample_rate * 86400 / 1073741824" | bc)
echo "Estimated daily volume: ${daily_gb} GB"
# At 5K rps, 8 spans, 10% sample → ~32 GB/day
Use Collector Processors to Filter and Drop
The Collector pipeline is the ideal place to shed unnecessary data before it reaches your backend. Use the filter processor to drop health-check spans, the attributes processor to strip verbose attributes, and the transform processor for more complex logic.
processors:
filter/drop-health:
error_mode: ignore
traces:
span:
- 'attributes["http.route"] == "/healthz"'
- 'attributes["http.route"] == "/readyz"'
attributes/strip-verbose:
actions:
- key: db.statement
action: delete # Remove raw SQL — use db.operation instead
- key: http.request.header.authorization
action: delete # Never store auth headers
Aggregate Metrics Before Export
Instead of exporting raw histogram observations, use delta temporality and pre-aggregate at the SDK or Collector level. The metricstransform processor can combine metrics, rename them, and toggle aggregation temporality. If you're self-hosting Prometheus, the remote_write path with recording rules reduces the cardinality that hits long-term storage.
Use Tiered Retention
Not all telemetry needs the same retention window. A practical tiered approach:
| Data Type | Retention | Rationale |
|---|---|---|
| Raw spans (sampled) | 7–14 days | Enough for active incident investigation |
| Error/slow traces | 30–90 days | Supports post-mortems and trend analysis |
| Aggregated metrics | 13 months | Year-over-year comparison for capacity planning |
| Raw logs | 7–30 days | Most logs are only useful during active debugging |
| SLO burn-rate metrics | 13 months | Tracks reliability trends over full error budget windows |
Cost management is iterative. Start by measuring your current ingest volume per service, identify the top 3 contributors, and apply targeted sampling or filtering there first. A 10% reduction on your noisiest service often saves more than optimizing everything else combined.
Production Deployment Patterns and Scaling OTel
Getting OpenTelemetry running locally is straightforward. Getting it running reliably in production — handling millions of spans per second, surviving node failures, and securing every hop — is a different challenge entirely. This section covers the deployment topologies, scaling strategies, and security configurations you need for real-world OTel infrastructure.
Kubernetes Deployment Architecture
The standard Kubernetes deployment uses a two-tier Collector architecture: lightweight agent Collectors on every node that forward to centralized gateway Collectors that handle aggregation, tail sampling, and export. The OTel Operator automates SDK injection into application pods via annotations.
flowchart LR
subgraph Node1["K8s Node 1"]
App1["App Pod\n(auto-injected OTel SDK)"]
App2["App Pod\n(auto-injected OTel SDK)"]
DS1["DaemonSet Collector\n(Agent)"]
App1 -->|OTLP| DS1
App2 -->|OTLP| DS1
end
subgraph Node2["K8s Node 2"]
App3["App Pod\n(auto-injected OTel SDK)"]
App4["App Pod\n(auto-injected OTel SDK)"]
DS2["DaemonSet Collector\n(Agent)"]
App3 -->|OTLP| DS2
App4 -->|OTLP| DS2
end
subgraph Gateway["Gateway Tier"]
GW1["Gateway Collector\n(Deployment replica 1)"]
GW2["Gateway Collector\n(Deployment replica 2)"]
end
DS1 -->|OTLP/gRPC| GW1
DS1 -->|OTLP/gRPC| GW2
DS2 -->|OTLP/gRPC| GW1
DS2 -->|OTLP/gRPC| GW2
GW1 --> Tempo["Tempo\n(Traces)"]
GW1 --> Mimir["Mimir\n(Metrics)"]
GW1 --> Loki["Loki\n(Logs)"]
GW2 --> Tempo
GW2 --> Mimir
GW2 --> Loki
Operator["OTel Operator"] -.->|inject sidecar/init-container| App1
Operator -.->|inject sidecar/init-container| App3
Agent Collectors run as a DaemonSet so every node has exactly one. They receive telemetry from local pods over OTLP, perform lightweight processing (batching, resource attribution), and forward to the gateway tier. Gateway Collectors run as a Deployment (or StatefulSet for stateful sampling) and handle the heavy work: tail sampling, span-to-metrics generation, and fan-out to multiple backends.
DaemonSet Collector (Agent Mode)
The agent Collector runs on every node. Its job is to receive telemetry cheaply, add node-level metadata, and forward everything to the gateway. Keep the agent pipeline lean — no tail sampling, no complex transformations.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-agent
namespace: observability
spec:
selector:
matchLabels:
app: otel-agent
template:
metadata:
labels:
app: otel-agent
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.102.0
args: ["--config=/conf/agent-config.yaml"]
ports:
- containerPort: 4317 # OTLP gRPC
hostPort: 4317
- containerPort: 4318 # OTLP HTTP
hostPort: 4318
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: config
mountPath: /conf
volumes:
- name: config
configMap:
name: otel-agent-config
The corresponding agent configuration keeps things minimal — receive, batch, and forward:
# agent-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 400
spike_limit_mib: 100
k8sattributes:
extract:
metadata:
- k8s.namespace.name
- k8s.pod.name
- k8s.node.name
exporters:
otlp:
endpoint: otel-gateway.observability.svc.cluster.local:4317
tls:
insecure: false
ca_file: /etc/tls/ca.crt
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp]
logs:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp]
Gateway Collector (Deployment Mode)
Gateway Collectors handle aggregation, tail sampling, and export. They run as a Deployment so you can scale replicas independently of node count. For tail sampling, use a StatefulSet with the loadbalancing exporter on agents to route by trace_id — this ensures all spans for a given trace land on the same gateway instance.
# gateway-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1500
spike_limit_mib: 512
tail_sampling:
decision_wait: 10s
policies:
- name: error-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-policy
type: latency
latency: { threshold_ms: 2000 }
- name: probabilistic-policy
type: probabilistic
probabilistic: { sampling_percentage: 10 }
batch:
timeout: 10s
send_batch_size: 2048
exporters:
otlphttp/tempo:
endpoint: https://tempo.observability.svc.cluster.local:4318
prometheusremotewrite/mimir:
endpoint: https://mimir.observability.svc.cluster.local/api/v1/push
otlphttp/loki:
endpoint: https://loki.observability.svc.cluster.local:3100/otlp
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlphttp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite/mimir]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/loki]
The tail_sampling processor needs all spans for a trace on the same Collector instance. Use the loadbalancing exporter on agents with routing_key: traceID so spans are consistently hashed to the same gateway. Without this, tail sampling decisions will be based on incomplete traces.
OTel Operator and Auto-Instrumentation
The OpenTelemetry Operator is a Kubernetes operator that manages Collector instances and injects auto-instrumentation into application pods. Instead of manually adding SDK dependencies and init code, you annotate your pods and the Operator handles injection via an init container.
# Instrumentation CRD — tells the Operator how to inject
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: otel-instrumentation
namespace: my-app
spec:
exporter:
endpoint: http://otel-agent.observability.svc.cluster.local:4317
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: "0.25"
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.4.0
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.46b0
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.51.0
Then annotate your application pods to trigger injection:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
template:
metadata:
annotations:
# Pick ONE based on your runtime:
instrumentation.opentelemetry.io/inject-java: "true"
# instrumentation.opentelemetry.io/inject-python: "true"
# instrumentation.opentelemetry.io/inject-nodejs: "true"
spec:
containers:
- name: payment-service
image: myregistry/payment-service:v2.1.0
Deploying with Helm
The official Helm chart supports both agent (DaemonSet) and gateway (Deployment) modes. You can deploy the full two-tier architecture with a single chart by overriding the right values:
# values-agent.yaml (for DaemonSet agents)
mode: daemonset
image:
repository: otel/opentelemetry-collector-contrib
tag: 0.102.0
resources:
limits:
cpu: 500m
memory: 512Mi
ports:
otlp:
enabled: true
hostPort: 4317
config:
exporters:
otlp:
endpoint: otel-gateway.observability:4317
service:
pipelines:
traces:
exporters: [otlp]
metrics:
exporters: [otlp]
logs:
exporters: [otlp]
# Install both tiers
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-agent open-telemetry/opentelemetry-collector \
-f values-agent.yaml -n observability
helm install otel-gateway open-telemetry/opentelemetry-collector \
-f values-gateway.yaml -n observability
# Install the Operator for auto-instrumentation
helm install otel-operator open-telemetry/opentelemetry-operator \
-n observability --set admissionWebhooks.certManager.enabled=true
Non-Kubernetes Deployments
Not every workload runs on Kubernetes. For VMs and bare-metal servers, the Collector runs as a systemd service. For containerized but non-orchestrated workloads, a Docker sidecar pattern works well.
# /etc/systemd/system/otel-collector.service
[Unit]
Description=OpenTelemetry Collector
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=otel
Group=otel
ExecStart=/usr/local/bin/otelcol-contrib --config /etc/otel/config.yaml
Restart=always
RestartSec=5
MemoryMax=512M
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Install the binary, drop the config in /etc/otel/config.yaml, then systemctl enable --now otel-collector. The MemoryMax directive acts as a system-level safety net alongside the memory_limiter processor.
# docker-compose.yml
services:
my-app:
image: myregistry/my-app:latest
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
depends_on:
- otel-collector
otel-collector:
image: otel/opentelemetry-collector-contrib:0.102.0
command: ["--config=/etc/otel/config.yaml"]
volumes:
- ./collector-config.yaml:/etc/otel/config.yaml:ro
ports:
- "4317:4317"
- "4318:4318"
deploy:
resources:
limits:
memory: 512M
The sidecar pattern co-locates the Collector with your application. The app sends to otel-collector:4317 over Docker's internal network — no host port exposure needed for the data path.
Scaling Gateway Collectors
Horizontal scaling of gateway Collectors requires careful thought, especially when tail sampling is involved. The key constraint: all spans for a single trace must reach the same gateway instance, or the sampler makes decisions on incomplete data.
Load Balancing by Trace ID
Configure the loadbalancing exporter on your agent Collectors. It uses consistent hashing on traceID to route spans to specific gateway backends, discovered via DNS or Kubernetes services:
# Agent exporter config for trace-aware load balancing
exporters:
loadbalancing:
routing_key: traceID
protocol:
otlp:
tls:
insecure: false
ca_file: /etc/tls/ca.crt
resolver:
dns:
hostname: otel-gateway-headless.observability.svc.cluster.local
port: 4317
Use a headless Service (no ClusterIP) for the gateway so DNS returns individual pod IPs. The loadbalancing exporter resolves them and hashes traceID to pick a target. When gateway pods scale up or down, the resolver detects the change and redistributes.
Kafka as a Buffer
At very high throughput (100k+ spans/second), place Kafka or Pulsar between agents and gateways. This decouples producers from consumers, absorbs traffic spikes, and provides replay capability if a gateway goes down.
# Agent side: export to Kafka
exporters:
kafka:
brokers: ["kafka-0:9092", "kafka-1:9092", "kafka-2:9092"]
topic: otel-traces
encoding: otlp_proto
producer:
max_message_bytes: 10000000
compression: zstd
---
# Gateway side: consume from Kafka
receivers:
kafka:
brokers: ["kafka-0:9092", "kafka-1:9092", "kafka-2:9092"]
topic: otel-traces
encoding: otlp_proto
group_id: otel-gateway
initial_offset: latest
Backpressure Handling
The memory_limiter processor is your first line of defense against OOM kills. It should be the first processor in every pipeline. When memory exceeds the soft limit, it starts refusing data; when it drops below, it resumes. Combine this with retry policies on exporters:
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1500 # Hard limit — start refusing data
spike_limit_mib: 512 # Soft limit = limit_mib - spike_limit_mib
exporters:
otlphttp/tempo:
endpoint: https://tempo:4318
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
Always place memory_limiter first in the processor list. If it comes after a processor that buffers data (like batch), memory can spike past the limit before the limiter gets a chance to act. The correct order is: memory_limiter → your processors → batch.
Security Configuration
Telemetry data often contains sensitive information — HTTP headers, database query parameters, user IDs. Every link in the telemetry pipeline needs encryption and authentication.
TLS and mTLS
Configure TLS on both the receiver (server) and exporter (client) side. For mTLS between services and the Collector, each party presents a certificate and verifies the other's:
# Receiver with mTLS (server side)
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
tls:
cert_file: /etc/tls/server.crt
key_file: /etc/tls/server.key
client_ca_file: /etc/tls/ca.crt # Verify client certs
# Exporter with mTLS (client side)
exporters:
otlp:
endpoint: gateway.observability:4317
tls:
cert_file: /etc/tls/client.crt
key_file: /etc/tls/client.key
ca_file: /etc/tls/ca.crt
Authentication and RBAC
Use the bearertokenauth or oidcauth extensions to authenticate exporters against backends. For multi-tenant setups, the headerssetter extension can inject tenant-specific headers:
extensions:
bearertokenauth:
token: "${env:OTEL_AUTH_TOKEN}"
basicauth/server:
htpasswd:
inline: |
agent-user:$2y$10$hashed_password_here
# Protect the receiver with basic auth
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
auth:
authenticator: basicauth/server
# Authenticate to the backend with bearer token
exporters:
otlphttp/tempo:
endpoint: https://tempo.example.com:4318
auth:
authenticator: bearertokenauth
service:
extensions: [bearertokenauth, basicauth/server]
On the Kubernetes side, apply RBAC to the Collector's ServiceAccount. The agent needs read access to pod metadata (for k8sattributes), but nothing more:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-agent
rules:
- apiGroups: [""]
resources: ["pods", "namespaces", "nodes"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-agent
subjects:
- kind: ServiceAccount
name: otel-agent
namespace: observability
roleRef:
kind: ClusterRole
name: otel-agent
apiGroup: rbac.authorization.k8s.io
Migration Strategy: Running OTel Alongside Existing Tools
Most teams don't greenfield their observability stack. You already have Prometheus scraping metrics, Jaeger agents collecting traces, and Fluentd shipping logs. The migration to OTel should be incremental — not a flag day.
Phase 1: Dual-Write with OTel Collector
Deploy the OTel Collector as a sidecar or gateway that receives from your existing agents and from new OTel-instrumented services. Use the Collector's multi-exporter capability to write to both old and new backends simultaneously:
# Dual-write config: accept Jaeger + OTLP, export to both backends
receivers:
jaeger:
protocols:
thrift_http:
endpoint: 0.0.0.0:14268
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
prometheus:
config:
scrape_configs:
- job_name: 'existing-services'
kubernetes_sd_configs:
- role: pod
exporters:
# Old backends (keep during transition)
jaeger:
endpoint: jaeger-collector.tracing:14250
prometheusremotewrite/old:
endpoint: http://old-prometheus:9090/api/v1/write
# New backends
otlphttp/tempo:
endpoint: http://tempo:4318
prometheusremotewrite/mimir:
endpoint: http://mimir:9009/api/v1/push
service:
pipelines:
traces:
receivers: [jaeger, otlp]
processors: [batch]
exporters: [jaeger, otlphttp/tempo] # dual-write
metrics:
receivers: [otlp, prometheus]
processors: [batch]
exporters: [prometheusremotewrite/old, prometheusremotewrite/mimir]
Phased Rollout Plan
| Phase | What Changes | Duration | Rollback Plan |
|---|---|---|---|
| 1 — Collector deploy | Deploy OTel Collector alongside existing agents. No app changes. | 1–2 weeks | Remove Collector — zero app impact |
| 2 — Dual-write | Route existing agent output through OTel Collector. Write to both old and new backends. | 2–4 weeks | Revert agent configs to point directly at old backends |
| 3 — Instrument new services | New services use OTel SDK / auto-instrumentation. Existing services unchanged. | Ongoing | New services fall back to legacy agents |
| 4 — Migrate existing services | Replace Jaeger/Prometheus client libs with OTel SDK, service by service. | 4–12 weeks | Per-service rollback via feature flags or revert |
| 5 — Decommission legacy | Remove old agents, stop dual-writing, decommission old backends. | 1–2 weeks | Re-enable dual-write if gaps are found |
During phases 2–4, build dashboards that compare data from old and new backends side by side. Look for discrepancies in trace counts, metric values, and log volumes. Only move to the next phase when the numbers converge. This is your safety net — don't skip it.