Observability — OpenTelemetry (OTel) and Related Technologies

Prerequisites: Familiarity with distributed systems concepts (microservices, HTTP/gRPC APIs, containers). Basic experience with at least one of Python, Node.js, or Go. Understanding of what a web server, database call, and message queue are. No prior observability tooling experience required.

What Is Observability and Why It Matters

Observability is the ability to understand a system's internal state by examining its external outputs — logs, metrics, and traces. The concept originates from control theory, formalized by Hungarian-American engineer Rudolf Kálmán in 1960: a system is observable if you can fully determine its internal state from its outputs alone. Applied to software, this means you shouldn't need to deploy new code or attach a debugger to understand what's happening inside your services.

This is a fundamentally different posture than traditional debugging. Debugging is reactive — something breaks, you form a hypothesis, reproduce the issue, and poke around until you find the cause. Observability is proactive. It lets you ask arbitrary, novel questions about your system's behavior without having anticipated them in advance. As Charity Majors, co-founder of Honeycomb, puts it:

"Observability is about being able to ask arbitrary questions about your production environment without having to know ahead of time what you wanted to ask."

— Charity Majors

The Mindmap of Observability

Before diving deeper, here's a high-level view of observability — what it is, why it matters, how it works, and the benefits it delivers.

mindmap
  root((Observability))
    Why
      Unknown unknowns
      Distributed complexity
      Ephemeral infrastructure
      Polyglot systems
    What
      External outputs
        Logs
        Metrics
        Traces
      Reveal internal state
    How
      Three pillars
      High-cardinality data
      High-dimensionality data
      Correlation across signals
    Benefits
      Faster MTTR
      Proactive debugging
      SLO-driven operations
      Ask novel questions
    

Why Observability Became Essential

A decade ago, monitoring was often enough. You had a handful of monolithic applications running on known servers. You could SSH into a box, tail a log file, and reason about what went wrong. The set of failure modes was relatively small, and you could write an alert for each one.

That world no longer exists. Modern systems are built from dozens or hundreds of microservices, running in ephemeral containers on platforms like Kubernetes. Services are written in different languages. A single user request may fan out across ten services, three databases, a message queue, and an external API — each with its own failure modes, latencies, and retry logic. The Google SRE book captures this shift well:

"Monitoring a complex distributed system is fundamentally different from monitoring a single machine or a small collection of machines."

Site Reliability Engineering, Google (O'Reilly, 2016)

The shift to microservices, containers, and cloud-native architectures created an explosion of distributed state that traditional monitoring simply cannot handle. Here's why:

  • Microservices: A request's fate is determined by the interaction of many independent services, not a single process.
  • Ephemeral containers: The server you want to debug may have already been replaced. There's no box to SSH into.
  • Polyglot systems: Teams use different languages, frameworks, and data stores — each emitting different signals in different formats.
  • Dynamic orchestration: Kubernetes reschedules pods, autoscalers add or remove instances, and load balancers shift traffic constantly.

The "Unknown Unknowns" Problem

Traditional monitoring is built around known failure modes. You know the database might run out of connections, so you alert on the connection pool. You know the disk might fill up, so you alert at 90% capacity. These are known unknowns — you don't know when they'll happen, but you know what to watch for.

The real danger in distributed systems is the unknown unknowns — failures you never anticipated. A subtle interaction between a retry storm in Service A, a slow garbage collection pause in Service B, and a connection pool exhaustion in Service C produces a cascading failure that no single alert would catch. You can't write dashboards for problems you've never imagined.

Monitoring vs. Observability

Monitoring tells you when something is broken. Observability helps you understand why — even for failure modes you never predicted. Monitoring is a subset of observability, not a replacement for it.

Ben Sigelman, co-creator of Dapper (Google's distributed tracing system) and co-founder of LightStep, frames the distinction sharply:

"Monitoring is for known-unknowns. Observability is for unknown-unknowns."

— Ben Sigelman

Observability vs. Debugging: A Comparison

DimensionTraditional DebuggingObservability
PostureReactive — triggered by an incidentProactive — continuous understanding
PrerequisiteReproduce the issue locally or in stagingInterrogate production data directly
Failure typesKnown failure modes with predefined alertsNovel, emergent failures (unknown unknowns)
Data modelAggregated metrics, static dashboardsHigh-cardinality, high-dimensionality events
Iteration speedMinutes to hours (deploy, reproduce, inspect)Seconds (slice, dice, correlate live data)

High-Cardinality, High-Dimensionality Data

Observability depends on a specific kind of data. It's not enough to know that your p99 latency spiked. You need to know which users, hitting which endpoints, from which regions, using which build version, experienced that latency spike. This is where cardinality and dimensionality come in.

What these terms mean

  • High cardinality refers to the number of unique values a field can take. A user_id field with millions of unique values is high-cardinality. A status_code field with five possible values (200, 301, 400, 404, 500) is low-cardinality. Observability requires you to slice data by high-cardinality fields — individual user IDs, request IDs, container IDs, trace IDs — to isolate specific behaviors.
  • High dimensionality refers to the number of fields (dimensions) attached to each event. A single request event might carry 50+ attributes: user ID, endpoint, HTTP method, status code, latency, region, deployment version, feature flags, database query count, and more. The more dimensions you capture, the more "angles" from which you can interrogate your system.

Traditional monitoring tools struggle with high-cardinality data because they rely on pre-aggregation. They compute averages and percentiles ahead of time, which destroys the ability to drill down into individual events. Observability tools store raw, high-cardinality, high-dimensionality events and let you query them on the fly.

Think in Events, Not Averages

When instrumenting your services, capture wide, structured events rather than narrow, pre-aggregated metrics. Every request event should carry as much context as possible — user ID, feature flags, build SHA, region. You can always aggregate later; you can't disaggregate data that was averaged at write time.

The Bottom Line

Observability isn't a product you buy or a dashboard you build — it's a property of your system. A system is observable when its telemetry data is rich enough and queryable enough that engineers can answer novel questions about production behavior without shipping new code. In the chapters ahead, you'll learn about the specific signals (traces, metrics, logs) and the tooling (OpenTelemetry) that make this possible.

Observability vs Traditional Monitoring: A Paradigm Shift

Monitoring and observability are often used interchangeably, but they represent fundamentally different philosophies about understanding production systems. The distinction matters because it changes how you design instrumentation, how you respond to incidents, and ultimately how well you can reason about complex distributed systems.

Traditional monitoring asks: "Is the system working?" Observability asks: "Why is the system not working for this specific user, this specific request, at this specific moment?" That difference — from aggregate health checks to fine-grained exploratory analysis — is the paradigm shift.

Known-Knowns vs Unknown-Unknowns

Traditional monitoring is built around known-knowns. You anticipate failure modes in advance, set up dashboards for them, and configure threshold-based alerts. CPU above 90%? Alert. Disk at 95%? Alert. Error rate above 1%? Page someone. This works when you can predict what will go wrong.

The problem is that modern distributed systems fail in ways you cannot predict. A single request might traverse dozens of microservices, hit multiple caches and databases, and pass through load balancers, service meshes, and API gateways. When a user in São Paulo reports that checkout is slow on Tuesdays at 3pm, no pre-built dashboard will show you why.

Observability addresses unknown-unknowns — the failures you didn't anticipate. Instead of pre-defining what questions you can ask, you instrument your systems to emit rich, high-cardinality telemetry data. Then you explore that data interactively, slicing and dicing by any dimension: user ID, request ID, geographic region, feature flag, tenant, cart size — whatever the investigation demands.

High Cardinality Is the Key

Cardinality refers to the number of unique values a field can have. A status_code field (200, 404, 500) is low-cardinality. A user_id field with millions of unique values is high-cardinality. Traditional monitoring tools choke on high-cardinality data because they pre-aggregate metrics. Observability platforms are designed to handle it — and that's precisely what enables you to debug individual user experiences.

The Key Differences

The following table captures the core distinctions between the two approaches. These aren't just aesthetic differences — they shape tooling choices, team workflows, and incident response times.

DimensionTraditional MonitoringObservability
Core question"Is the system up?""Why is it broken for this request?"
Failure modelKnown-knowns (anticipated failures)Unknown-unknowns (novel failures)
Data modelPre-aggregated metrics, low-cardinality tagsHigh-cardinality events, traces, structured logs
Query stylePre-built dashboards, static queriesAd-hoc exploratory queries, drill-down
Alert philosophyThreshold-based (CPU > 90%, errors > 1%)SLO-based (error budget burn rate)
Debugging workflowCheck dashboard → scan logs → guessStart from symptom → slice by dimensions → find root cause
InstrumentationAgent-based, black-box, per-componentCode-level, white-box, request-scoped
Cost driverNumber of hosts/services monitoredVolume and cardinality of telemetry data
ScalabilityWorks well for monoliths and small clustersDesigned for distributed microservice architectures

The Evolution: SNMP to Modern Observability

The shift didn't happen overnight. Monitoring tooling has evolved through distinct generations, each responding to the limitations of the previous era and the growing complexity of infrastructure.

Generation 1: SNMP and Network Monitoring (1990s)

Simple Network Management Protocol (SNMP) gave operators a way to poll network devices for health data — interface traffic, error counts, uptime. Tools like MRTG generated graphs from SNMP data. This was sufficient when "infrastructure" meant routers, switches, and a handful of servers. The data model was entirely device-centric: you monitored boxes, not applications or user experiences.

Generation 2: Nagios and Check-Based Monitoring (2000s)

Nagios introduced the concept of service checks — small scripts that test whether something is working and return OK, WARNING, or CRITICAL. This was a leap forward because it let you monitor application-level concerns (is the web server responding? is the database accepting connections?). But Nagios was fundamentally binary: things were either green or red. It had no concept of trends, no way to answer "is this getting worse?", and scaling it to hundreds of services required heroic configuration management.

Generation 3: Graphite, StatsD, and Metrics Aggregation (2010s)

The rise of StatsD (born at Etsy in 2011) and Graphite gave engineers the ability to emit custom metrics from application code. For the first time, you could track business-level signals — orders per minute, payment processing latency, cache hit ratios — alongside infrastructure metrics. This era also brought time-series databases (InfluxDB, Prometheus) and sophisticated dashboarding (Grafana). The limitation? Everything was still pre-aggregated. You could see average request latency, but not the latency of a specific request. You could see error rates by endpoint, but not by individual user.

Generation 4: Modern Observability Platforms (2018–present)

Modern observability platforms (Honeycomb, Jaeger, Tempo, the OpenTelemetry ecosystem) store individual events and traces rather than pre-aggregated metrics. This preserves the full context of each request, enabling you to GROUP BY any field, filter to any dimension, and follow a single request across your entire distributed system. The three pillars — traces, metrics, and logs — are correlated, so you can jump from a latency spike on a dashboard directly into the specific traces that caused it.

The Dashboard-Driven Trap

If your incident response starts with "let me check the dashboards," you're likely operating in a monitoring mindset rather than an observability mindset. Dashboards aren't inherently bad, but over-reliance on them creates three distinct failure modes:

1. Alert Fatigue

Threshold-based alerts on pre-aggregated metrics produce noise. A brief CPU spike at 3am that self-resolves, a momentary uptick in 5xx errors during a deploy — these generate pages that train on-call engineers to ignore alerts. Studies show that teams with more than a few alerts per on-call shift start to treat all alerts as low-priority. The signal drowns in noise.

2. Dashboard Rot

Dashboards proliferate over time. Someone creates one during an incident, another team clones it with modifications, a third version gets built for a quarterly review. Within a year, you have dozens of dashboards — many showing stale metrics for services that have been renamed or decommissioned. Nobody knows which dashboard to trust during an incident, so engineers waste critical minutes hunting for the "right" one.

3. The Inability to Ask New Questions

This is the most fundamental limitation. A dashboard can only answer the questions it was designed to answer. If your checkout-latency dashboard breaks down by endpoint and region, you cannot suddenly ask "what's the latency for users with more than 50 items in their cart?" without modifying instrumentation, adding a new metric, deploying it, and waiting for data to accumulate. In an observability-first system, that query is immediate — because the raw event data already contains the cart-size attribute.

Dashboards Still Have a Place

Don't throw away your dashboards. They're valuable for situational awareness — a quick glance at system health. The mistake is treating them as your primary debugging tool. Use dashboards to detect that something is wrong; use observability tooling to figure out what and why.

From Reactive Firefighting to Proactive Understanding

The most impactful change observability enables isn't technical — it's cultural. With traditional monitoring, the incident lifecycle looks like this: alert fires → engineer opens dashboards → engineer scans logs → engineer forms a hypothesis → engineer deploys a fix → hope it works. Each step is reactive, and the debugging process relies heavily on tribal knowledge ("oh, when that dashboard looks like that, it usually means the database connection pool is exhausted").

With proper observability, the workflow shifts. You start from a symptom — a spike in error budget burn rate, a slow trace, a user report — and explore the telemetry data interactively. You don't need to know in advance which service is at fault. You slice the data by attributes, compare baselines, and follow the evidence. This is closer to scientific investigation than pattern matching.

More importantly, observability enables proactive work. Because you can query your telemetry data freely, you can ask questions before incidents happen: "Are there any endpoints where p99 latency has been creeping up over the past week?" or "Which tenants are seeing the highest error rates, even if overall error rates look healthy?" These questions surface problems before they become pages — and that's the real paradigm shift.

Start With One Service

You don't need to rip out all your monitoring and replace it overnight. Start by adding rich, high-cardinality instrumentation (using OpenTelemetry) to one critical service. Practice exploratory debugging with that data. As the team builds muscle memory for the observability workflow, expand to other services incrementally.

The Three Pillars: Logs, Metrics, and Traces

Observability rests on three complementary signal types — logs, metrics, and traces. Each one captures a different facet of what your system is doing at any given moment. Understanding what each pillar does well (and what it doesn't) is the key to building systems you can actually debug under pressure.

graph LR
    U["👤 User Request"] --> GW["API Gateway"]
    GW --> SVC["Application Service"]

    SVC --> L["📝 Logs
Discrete event records"] SVC --> M["📊 Metrics
Numeric aggregates"] SVC --> T["🔗 Traces
Distributed spans"] L --> OB["Observability Backend
(Correlation & Analysis)"] M --> OB T --> OB OB --> D["Dashboards & Alerts"] OB --> I["Investigation & Root Cause"] style L fill:#2d6a4f,stroke:#40916c,color:#fff style M fill:#e76f51,stroke:#f4a261,color:#fff style T fill:#457b9d,stroke:#a8dadc,color:#fff style OB fill:#6c567b,stroke:#c9b1d0,color:#fff

Logs: The Narrative Record

A log is a discrete, timestamped record of something that happened — a request arrived, a database query ran, an error was thrown. Logs are the oldest observability signal and the one most developers reach for first. They give you the why behind a problem: the full error message, the malformed input, the stack trace.

Logs come in two flavors: unstructured (free-form text) and structured (key-value pairs, typically JSON). Structured logs are dramatically easier to search and aggregate, which is why every modern logging library defaults to them.

text
2024-03-15 14:32:07 ERROR PaymentService - Failed to charge card ending 4242 for order #8812: gateway timeout after 30s
json
{
  "timestamp": "2024-03-15T14:32:07.341Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "span-789",
  "order_id": "8812",
  "card_last4": "4242",
  "error": "gateway_timeout",
  "duration_ms": 30004,
  "message": "Failed to charge card: gateway timeout"
}

Notice the structured version includes a trace_id and span_id. These fields are what let you correlate a log entry with a specific distributed trace — a connection that becomes critical during incident investigation.

Standard Severity Levels

Most logging frameworks follow a common severity hierarchy. The OpenTelemetry specification defines severity numbers that map onto these familiar levels:

LevelWhen to UseExample
TRACE / DEBUGDevelopment-time detail; disabled in productionSQL query text, request/response bodies
INFONormal operational events worth recordingServer started, order placed, user logged in
WARNUnexpected but recoverable conditionsRetry succeeded on second attempt, cache miss
ERRORFailures that affect a single operationPayment declined, upstream 5xx response
FATALUnrecoverable failures; process is shutting downOut of memory, configuration missing

Metrics: The Quantitative Pulse

Metrics are numeric measurements aggregated over time. Unlike logs (one record per event), metrics compress thousands of events into a single number: "there were 1,247 requests in the last minute" or "p99 latency is 340ms." This compression makes metrics extremely cheap to store and fast to query — perfect for dashboards and alerting.

There are three fundamental metric types you'll encounter everywhere:

TypeWhat It MeasuresExamples
CounterMonotonically increasing totalTotal requests served, total errors, bytes transferred
GaugePoint-in-time value that goes up or downCurrent CPU usage, active connections, queue depth
HistogramDistribution of values across bucketsRequest latency distribution, response size distribution

Here's what emitting these looks like in practice with the OpenTelemetry SDK:

python
from opentelemetry import metrics

meter = metrics.get_meter("checkout-service")

# Counter — total number of checkout attempts
checkout_counter = meter.create_counter(
    name="checkout.attempts",
    description="Total checkout attempts",
)

# Histogram — latency distribution of checkout operations
checkout_duration = meter.create_histogram(
    name="checkout.duration",
    unit="ms",
    description="Time taken to complete checkout",
)

# Gauge — items currently in processing queue
queue_depth = meter.create_up_down_counter(
    name="checkout.queue_depth",
    description="Items waiting in the checkout queue",
)
Metrics are for the "what" — not the "why"

A spike in your checkout.duration histogram tells you something is slow, but it won't tell you which service or which database query is the bottleneck. That's where traces come in.

Traces: The Distributed Story

A distributed trace follows a single request as it moves across service boundaries — from the frontend to the API gateway, through the order service, into the payment provider, and back. Each segment of work is called a span, and spans are linked together by a shared trace ID to form a tree that represents causality and timing.

Traces answer the question "where did the time go?" With a single trace, you can see that a checkout request took 3.2 seconds total, and 2.8 seconds of that was spent waiting for the payment gateway.

python
from opentelemetry import trace

tracer = trace.get_tracer("checkout-service")

def process_checkout(order):
    with tracer.start_as_current_span("process_checkout") as span:
        span.set_attribute("order.id", order.id)
        span.set_attribute("order.total", order.total)

        # Child span for inventory check
        with tracer.start_as_current_span("check_inventory"):
            inventory_ok = inventory_service.check(order.items)

        # Child span for payment processing
        with tracer.start_as_current_span("charge_payment") as pay_span:
            pay_span.set_attribute("payment.method", order.payment_method)
            result = payment_service.charge(order)

        return result

This code produces a trace with three spans: a parent process_checkout span and two children (check_inventory and charge_payment). In a trace visualization tool like Jaeger or Grafana Tempo, you'd see these as a waterfall, instantly revealing which operation consumed the most time.

How the Pillars Complement Each Other

No single signal type is sufficient on its own. Each pillar answers a different question during an incident, and the real power comes from using them together:

SignalQuestion AnsweredStrengthLimitation
Metrics"Is something wrong?"Cheap, fast, great for alertingNo detail about individual requests
Traces"Where is the problem?"Shows causality across servicesTypically sampled; doesn't capture every request
Logs"Why did this happen?"Rich detail, full contextExpensive at scale, hard to aggregate

The investigation flow is almost always the same: metrics fire an alert → you use traces to pinpoint the slow or failing service → you read the logs from that service to understand the root cause. This is sometimes summarized as: metrics tell you something is wrong, traces show you where, logs tell you why.

Practical Example: Diagnosing a Slow Checkout

Let's walk through a realistic incident to see how all three pillars work together. Imagine you run an e-commerce platform, and customers start complaining that checkout is slow.

  1. Metrics alert fires

    Your dashboard shows that checkout.duration p99 has jumped from 400ms to 3,200ms. The checkout.attempts counter is steady, so traffic hasn't spiked — the slowness is internal. You also notice payment_service.error_rate has climbed from 0.1% to 12%.

  2. Traces reveal the bottleneck

    You open your tracing UI and filter for slow checkout traces (duration > 2s). The waterfall view makes it obvious: the charge_payment span is taking 2,800ms instead of the usual 150ms. Upstream spans like check_inventory are fine.

  3. Logs explain the root cause

    You copy the trace_id from a slow trace and search your log backend. The payment service logs tell the full story:

    json
    {
      "timestamp": "2024-03-15T14:32:07.341Z",
      "level": "WARN",
      "service": "payment-service",
      "trace_id": "abc123def456",
      "message": "Connection pool exhausted, waiting for available connection",
      "pool_size": 10,
      "active_connections": 10,
      "wait_time_ms": 2650
    }
    {
      "timestamp": "2024-03-15T14:32:09.991Z",
      "level": "ERROR",
      "service": "payment-service",
      "trace_id": "abc123def456",
      "message": "Payment gateway request failed: TLS handshake timeout",
      "gateway_host": "api.payments.example.com",
      "retry_attempt": 2
    }

    Root cause: the payment gateway is experiencing TLS handshake slowdowns, which exhausts the connection pool and cascades into timeouts for all checkout requests.

Without metrics, you wouldn't know there was a problem until customers complained. Without traces, you'd be guessing which service was slow. Without logs, you'd know the payment service was the bottleneck but not that it was a connection pool issue triggered by TLS handshake failures.

The Emerging Fourth Signal: Profiles

Continuous profiling is increasingly recognized as the fourth pillar of observability. While traces show you that a service is slow, profiles show you exactly which function or line of code is consuming CPU, allocating memory, or blocking on I/O. Tools like Pyroscope, Parca, and Grafana's profiling integration let you attach flame graphs directly to trace spans.

OpenTelemetry added profiling support as a signal type in 2024, cementing its place alongside logs, metrics, and traces. The investigation chain becomes: metrics → traces → profiles → logs — narrowing from "something is wrong" all the way down to a specific hot code path.

Start with metrics and traces

If you're building observability from scratch, prioritize metrics (for alerting) and traces (for debugging) first. Add structured logging with trace context correlation next. Continuous profiling is valuable but addresses a narrower set of problems — add it once the first three pillars are solid.

OpenTelemetry: Origin Story and Project Structure

OpenTelemetry didn't appear from nowhere. It's the product of a hard-won merger between two competing open-source observability projects — OpenTracing and OpenCensus — each of which solved part of the instrumentation puzzle but neither of which could win the ecosystem alone. Understanding that history explains many of OTel's design decisions today.

The Two Predecessors

OpenTracing (2016)

OpenTracing launched in 2016 as a vendor-neutral API specification for distributed tracing. It defined a standard set of interfaces — spans, contexts, and propagation — that library authors could instrument against without coupling their code to a specific tracing backend like Jaeger, Zipkin, or Datadog. The key idea: instrument once, choose your backend later.

OpenTracing was API-only. It didn't ship an SDK, a collector, or any wire protocol. Vendors and open-source projects provided the actual implementations. This made it lightweight but also meant there was no "batteries included" experience — you always needed a third-party library to actually do anything with the traces.

OpenCensus (2018)

OpenCensus came out of Google in 2018 and took a different approach. Rather than being API-only, it provided a complete package: API, SDK, and built-in exporters for both tracing and metrics. You could drop OpenCensus into your Go or Java service and immediately start shipping data to Prometheus, Stackdriver, Jaeger, or Zipkin without stitching together multiple libraries.

The trade-off was tighter coupling. OpenCensus was more opinionated — the exporters lived in-tree, and the project's scope was broader. It also originated from Google's internal Census library, which meant it carried certain design assumptions from that world.

Why They Merged

By 2018, the observability community had a real problem: two credible, CNCF-adjacent projects solving overlapping problems with incompatible APIs. Library maintainers had to choose which standard to instrument for — or worse, instrument for both. Vendors had to support two integration paths. End users were confused about which project to adopt.

The fragmentation created three specific pain points:

  • Library authors couldn't pick a winner. If you maintained a popular HTTP framework, instrumenting for OpenTracing meant OpenCensus users got nothing (and vice versa).
  • Duplicate engineering effort. Both projects were building context propagation, span management, and exporter pipelines — the same fundamental problems solved twice.
  • Vendor fatigue. Backend vendors (Jaeger, Datadog, Lightstep, etc.) had to maintain parallel integrations for two standards that were supposed to reduce fragmentation in the first place.

In May 2019, the two projects announced their merger into OpenTelemetry. The new project entered the CNCF as a Sandbox project, aiming to combine OpenTracing's clean API design with OpenCensus's batteries-included approach. The goal: a single, unified standard for all telemetry signals — traces, metrics, and eventually logs.

Note

OpenTelemetry explicitly provides bridge packages for both OpenTracing and OpenCensus. If you have existing instrumentation using either predecessor, you can migrate incrementally — you don't have to rip and replace everything at once.

Timeline of Major Milestones

DateMilestoneSignificance
Nov 2016OpenTracing joins CNCFFirst vendor-neutral tracing API standard
Jan 2018OpenCensus launched by GoogleCombined tracing + metrics with built-in exporters
May 2019OpenTelemetry merger announcedTwo projects unite under one umbrella
May 2019CNCF Sandbox admissionOfficial CNCF home for the merged project
Aug 2021CNCF IncubationRecognized project maturity and adoption
Feb 2022Tracing specification reaches GAStable APIs/SDKs for traces across major languages
May 2023Metrics specification reaches GAStable APIs/SDKs for metrics (Go, Java, .NET first)
Apr 2024Logs specification reaches StableAll three signals now stable — the "triple play"
20242nd most active CNCF projectBehind only Kubernetes in contributor activity

Project Structure

OpenTelemetry is not a single repository or a single binary. It's an ecosystem of coordinated components, all governed by a central specification. Understanding how these pieces fit together is essential before you start instrumenting anything.

graph TD
    SPEC["📜 OTel Specification
(Language-agnostic rules)"] SEMCONV["📖 Semantic Conventions
(Standard attribute names)"] OTLP["📡 OTLP Protocol
(Wire format for telemetry)"] SDK_GO["Go SDK"] SDK_JAVA["Java SDK"] SDK_PY["Python SDK"] SDK_JS["JS/TS SDK"] SDK_DOTNET[".NET SDK"] SDK_OTHER["Ruby, Rust, C++, …"] COLLECTOR["🔧 OTel Collector
(Receive, process, export)"] CONTRIB["📦 Contrib Packages
(Community extensions)"] SPEC --> SEMCONV SPEC --> OTLP SPEC --> SDK_GO SPEC --> SDK_JAVA SPEC --> SDK_PY SPEC --> SDK_JS SPEC --> SDK_DOTNET SPEC --> SDK_OTHER SPEC --> COLLECTOR OTLP --> COLLECTOR OTLP --> SDK_GO OTLP --> SDK_JAVA SEMCONV --> SDK_GO SEMCONV --> SDK_JAVA SEMCONV --> SDK_PY SEMCONV --> SDK_JS COLLECTOR --> CONTRIB SDK_GO --> CONTRIB SDK_JAVA --> CONTRIB

The Specification

Everything in OpenTelemetry starts with the specification — a language-agnostic document that defines how telemetry data should be created, processed, and exported. It specifies API interfaces (what instrumentation authors call), SDK behavior (how telemetry data is processed and batched), and data model semantics (what a span, metric point, or log record looks like).

Language implementations must conform to the specification. This is what makes OpenTelemetry truly portable: a span created in a Go service and a span created in a Python service share the same structural contract, enabling seamless end-to-end distributed traces.

Language-Specific APIs and SDKs

Each supported language has its own repository containing two layers:

  • API — A minimal, zero-dependency interface that library authors instrument against. The API is safe to call even when no SDK is installed (it becomes a no-op).
  • SDK — The full implementation that application owners install. It handles sampling, batching, resource detection, and export. You configure the SDK at application startup.

The major language implementations (Go, Java, Python, JavaScript/TypeScript, .NET) have reached GA stability for traces and metrics. Others like Ruby, Rust, C++, Swift, and Erlang/Elixir are in varying stages of maturity.

OTLP — OpenTelemetry Protocol

OTLP is OpenTelemetry's native wire protocol for transmitting telemetry data. It supports all three signals (traces, metrics, logs) over both gRPC and HTTP/protobuf transports. OTLP is now widely supported — most observability backends (Grafana, Datadog, Honeycomb, New Relic, Dynatrace, etc.) accept OTLP natively, which means you can often skip vendor-specific exporters entirely.

Semantic Conventions

Semantic conventions define standard names and values for telemetry attributes. For example, an HTTP server span should use http.request.method for the HTTP method, url.path for the path, and server.port for the port number. Without these conventions, every team would invent their own attribute names, making cross-service queries and dashboards a nightmare.

The Collector

The OTel Collector is a standalone binary that acts as a telemetry pipeline. It receives data (from SDKs or other sources), processes it (filtering, batching, enriching, sampling), and exports it to one or more backends. You can deploy it as a sidecar, a DaemonSet, or a standalone gateway. The Collector decouples your application's instrumentation from your backend choice — your app always sends OTLP to the Collector, and the Collector handles the rest.

Contrib Repositories

Each language SDK and the Collector have a corresponding -contrib repository. These contain community-maintained extensions: auto-instrumentation libraries for popular frameworks (Express, Flask, Spring), additional exporters (Prometheus, Zipkin), and Collector processors/receivers contributed by vendors and the community. The contrib repos keep the core lean while enabling a broad ecosystem.

Tip

When evaluating language SDK maturity, check the OTel status page. Each signal (traces, metrics, logs) has its own stability level per language — "stable" in one language doesn't automatically mean stable in all.

Governance and the Path to CNCF Graduation

OpenTelemetry's governance is designed for a project of its scale — over 1,000 contributors across dozens of repositories. Three layers keep things organized:

  • Technical Committee (TC) — A small elected group that owns the specification and resolves cross-cutting technical decisions. They ensure consistency across language implementations.
  • Governance Committee (GC) — Handles non-technical project governance: community health, contributor experience, CNCF relationship, and project-wide policies.
  • Special Interest Groups (SIGs) — Each language implementation, the Collector, and major feature areas (like semantic conventions or logging) have their own SIG with dedicated maintainers and regular meetings. SIGs operate with significant autonomy within the bounds of the spec.

As of 2024, OpenTelemetry is the 2nd most active CNCF project (behind Kubernetes) by contributor count and commit volume. With all three telemetry signals now stable, the project is on the path toward CNCF Graduation — the highest maturity level, signaling production readiness and sustainable governance. The project has completed or is finalizing due diligence for this milestone.

Note

CNCF Graduation requires an independent security audit, a documented governance model, and demonstrated adoption by multiple organizations. OpenTelemetry's adoption — spanning cloud providers, SaaS vendors, and enterprises — positions it well for this milestone.

The API/SDK Separation: OTel's Key Design Decision

OpenTelemetry's most important architectural choice isn't about how it collects data — it's about how it separates concerns. The project splits itself into two distinct layers: a lightweight API for instrumenting code, and a heavier SDK for processing and exporting telemetry. This separation is what makes OTel viable as a universal standard rather than just another vendor library.

The API is a thin, stable interface that defines how you create spans, record metrics, and emit logs. By itself, it does nothing — every call is a no-op. The SDK is the engine that gives those calls meaning: it collects spans, batches metrics, and ships everything to your backend of choice. Understanding why these are separate packages is the key to understanding how OTel works in practice.

The Architecture at a Glance

The layered design ensures that instrumentation code (in libraries and applications) only depends on the API, while the SDK — with its heavier dependencies on exporters, processors, and configuration — is wired up once at the application's entry point.

graph TD
    subgraph Your Code
        APP["Application Code"]
        LIB["Library Code
(e.g., HTTP client, ORM)"] end subgraph OTel API Layer TAPI["Tracer API"] MAPI["Meter API"] LAPI["Logger API"] end subgraph OTel SDK Layer TP["TracerProvider"] MP["MeterProvider"] LP["LoggerProvider"] SP["SpanProcessors"] MR["MetricReaders"] end subgraph Exporters OTLP["OTLP Exporter"] JAEG["Jaeger Exporter"] PROM["Prometheus Exporter"] CONS["Console Exporter"] end APP -->|"instruments with"| TAPI APP -->|"instruments with"| MAPI LIB -->|"instruments with"| TAPI LIB -->|"instruments with"| MAPI LIB -->|"instruments with"| LAPI TAPI -->|"delegates to"| TP MAPI -->|"delegates to"| MP LAPI -->|"delegates to"| LP TP --> SP MP --> MR SP --> OTLP SP --> JAEG MR --> PROM SP --> CONS

Why Separate API from SDK?

The separation solves a real dependency management problem. Imagine you maintain a popular HTTP client library and want to add tracing. If you depend on a full SDK, you've just forced every consumer of your library to pull in OTLP exporters, gRPC dependencies, and configuration machinery — even if they don't want telemetry at all. That's a non-starter for library authors.

With the API/SDK split, your library depends only on the API package — a handful of interfaces with zero transitive dependencies. Application owners then choose whether to install the SDK. If they do, your library's instrumentation lights up. If they don't, every tracing call silently does nothing with near-zero overhead.

ConcernAPI PackageSDK Package
Who uses itLibrary authors & app developersApplication owners (at the entry point)
DependenciesNear-zero (interfaces only)Heavier (exporters, processors, gRPC/HTTP)
StabilityVery stable — rarely changesEvolves more frequently
Default behaviorNo-op (does nothing)Configurable pipelines
Vendor lock-inNoneNone (but you choose exporters here)

The Provider Bridge: TracerProvider, MeterProvider, LoggerProvider

The API and SDK are connected through providers. Each telemetry signal has one: TracerProvider for traces, MeterProvider for metrics, and LoggerProvider for logs. The API defines provider interfaces; the SDK supplies concrete implementations.

When your code calls tracer.start_span("process-order"), the API looks up the globally registered TracerProvider. If an SDK has been configured, that provider creates a real span with timing data, attributes, and context propagation. If no SDK is installed, the global provider is a no-op implementation that returns a dummy span — no allocations, no side effects, no overhead.

The No-Op Default Is the Whole Point

The no-op default isn't a fallback — it's a deliberate design goal. It means any library in the ecosystem can add OTel instrumentation without imposing runtime cost on users who haven't opted into observability. The API package typically adds <100KB to your dependency tree and the no-op code paths are optimized to be effectively free.

In Practice: Library Code vs. Application Code

The split creates two distinct roles in any codebase. Library code instruments with the API only — it creates spans and records metrics but never configures where that data goes. Application code (your main.py or index.ts) sets up the SDK: it chooses providers, attaches exporters, and defines resource attributes that identify the service.

Library Code — API Only

A library author depends solely on the API package. The code creates spans and records metrics without any knowledge of how (or whether) they'll be exported.

python
# my_http_library/client.py
# Depends ONLY on: opentelemetry-api
from opentelemetry import trace

# Get a tracer scoped to this library's name and version
tracer = trace.get_tracer("my-http-library", "1.2.0")

def fetch(url: str, method: str = "GET") -> Response:
    with tracer.start_as_current_span(
        f"{method} {url}",
        kind=trace.SpanKind.CLIENT,
        attributes={"http.method": method, "http.url": url},
    ) as span:
        response = _do_request(url, method)
        span.set_attribute("http.status_code", response.status_code)
        if response.status_code >= 400:
            span.set_status(trace.Status(trace.StatusCode.ERROR))
        return response
typescript
// my-http-library/src/client.ts
// Depends ONLY on: @opentelemetry/api
import { trace, SpanKind, SpanStatusCode } from "@opentelemetry/api";

// Get a tracer scoped to this library's name and version
const tracer = trace.getTracer("my-http-library", "1.2.0");

export async function fetch(url: string, method = "GET"): Promise<Response> {
  return tracer.startActiveSpan(
    `${method} ${url}`,
    { kind: SpanKind.CLIENT, attributes: { "http.method": method, "http.url": url } },
    async (span) => {
      const response = await doRequest(url, method);
      span.setAttribute("http.status_code", response.status);
      if (response.status >= 400) {
        span.setStatus({ code: SpanStatusCode.ERROR });
      }
      span.end();
      return response;
    }
  );
}

Notice that neither example imports anything from an SDK package. There's no mention of exporters, processors, or backends. If a user installs this library without an OTel SDK, tracer.start_as_current_span() returns a no-op span and the with block (or callback) runs with negligible overhead.

Application Code — SDK Setup

The application owner installs the SDK and configures the full pipeline. This is typically done once, at startup, before any instrumented code runs. Here you choose your exporters, define resource attributes (like service name), and register providers globally.

python
# app/main.py
# Depends on: opentelemetry-sdk, opentelemetry-exporter-otlp
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# 1. Define resource attributes that identify this service
resource = Resource.create({
    "service.name": "order-service",
    "service.version": "2.4.1",
    "deployment.environment": "production",
})

# 2. Create a TracerProvider with the resource
provider = TracerProvider(resource=resource)

# 3. Attach a BatchSpanProcessor with an OTLP exporter
otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

# 4. Register the provider globally — this "activates" all API instrumentation
trace.set_tracer_provider(provider)

# Now, any library using the OTel API (like my-http-library) will
# produce real spans that get exported to your collector.
typescript
// app/tracing.ts — import this before anything else
// Depends on: @opentelemetry/sdk-trace-node, @opentelemetry/exporter-trace-otlp-grpc
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { Resource } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from "@opentelemetry/semantic-conventions";
import { trace } from "@opentelemetry/api";

// 1. Define resource attributes that identify this service
const resource = new Resource({
  [ATTR_SERVICE_NAME]: "order-service",
  [ATTR_SERVICE_VERSION]: "2.4.1",
  "deployment.environment": "production",
});

// 2. Create a TracerProvider with the resource
const provider = new NodeTracerProvider({ resource });

// 3. Attach a BatchSpanProcessor with an OTLP exporter
const exporter = new OTLPTraceExporter({ url: "http://otel-collector:4317" });
provider.addSpanProcessor(new BatchSpanProcessor(exporter));

// 4. Register the provider globally — this "activates" all API instrumentation
provider.register();

// Now, any library using @opentelemetry/api will produce real spans
// that get batched and exported to your collector via gRPC.
Run SDK Setup First

Always initialize the SDK before importing any instrumented code. In Node.js, use the --require or --import flag to load your tracing setup file first: node --require ./tracing.js app.js. In Python, configure at the top of your entry point before other imports. If the SDK registers after instrumented code runs, those early spans are lost to the no-op provider.

How the Pieces Connect at Runtime

The global registration step (trace.set_tracer_provider(provider) in Python, provider.register() in TypeScript) is where the magic happens. Before that call, the API's global state holds a no-op provider. After it, every get_tracer() call returns a real tracer backed by your configured BatchSpanProcessor and exporter.

This means library code doesn't need to be "aware" of the SDK at all — not at compile time, not at import time. The API uses a service-locator pattern: it looks up the current global provider at the moment you request a tracer. Libraries written months before your application was configured will still produce telemetry, as long as the SDK is initialized before their code paths execute.

This Pattern Applies to All Three Signals

The API/SDK split works identically for metrics (MeterProvider) and logs (LoggerProvider). Each signal has its own provider, processor, and exporter chain. You can enable just traces, just metrics, or all three — the pattern is the same. Libraries that record metrics via the API will see the same no-op-by-default behavior until you register a MeterProvider with the SDK.

OTLP: The OpenTelemetry Protocol and Unified Data Model

OTLP (OpenTelemetry Protocol) is the native wire protocol for transmitting traces, metrics, and logs from instrumented applications to backends. Unlike vendor-specific formats that lock you into a particular ecosystem, OTLP provides a single, well-defined protocol that every OpenTelemetry SDK speaks natively — no translation layer required.

Understanding OTLP matters because it dictates how your telemetry data is structured, serialized, and delivered. The protocol defines both the transport mechanism (how bytes move over the network) and the data model (what those bytes represent). Let's break down both.

Transport Options

OTLP supports three transport variants, each suited to different deployment scenarios. The choice of transport affects performance, compatibility, and debuggability.

TransportContent TypeBest ForTrade-offs
gRPCapplication/grpcProduction workloads, high throughputBidirectional streaming, HTTP/2 multiplexing; may be blocked by some proxies/firewalls
HTTP/protobufapplication/x-protobufFirewall-restricted environmentsSame binary efficiency as gRPC; works through HTTP/1.1 proxies and load balancers
HTTP/JSONapplication/jsonDebugging, manual inspectionHuman-readable; 5–10× larger payload than protobuf; not recommended for production

gRPC is the default and recommended transport. It leverages HTTP/2 for multiplexed streams — a single TCP connection can carry traces, metrics, and logs concurrently without head-of-line blocking. The bidirectional streaming capability also enables efficient flow control between the client and the collector.

When gRPC isn't an option (corporate proxies, AWS ALBs that don't support HTTP/2, browser environments), HTTP/protobuf gives you the same binary encoding efficiency over plain HTTP/1.1. HTTP/JSON is the last resort — useful when you need to curl an endpoint or inspect payloads during development.

Endpoint Paths

gRPC uses service definitions (e.g., opentelemetry.proto.collector.trace.v1.TraceService/Export). HTTP transports use fixed paths: /v1/traces, /v1/metrics, and /v1/logs. If your collector isn't receiving data, check that you're hitting the right port and path.

The OTLP Data Model: Three Nesting Layers

Every OTLP payload — whether it carries traces, metrics, or logs — follows the same three-level nesting pattern. This isn't arbitrary; it's designed to avoid redundant data on the wire. Instead of stamping every single span with the service name and SDK version, that metadata is declared once at the top level and inherited by everything below it.

Resource

A Resource identifies the entity producing the telemetry: a service, a process, a container. It carries key-value attributes like service.name, service.version, host.name, and cloud.region. Every span, metric, and log record within a given export request shares the same Resource, so it's sent exactly once per batch.

InstrumentationScope

The InstrumentationScope identifies the library or module that generated the telemetry. If your application uses an HTTP client library that auto-instruments requests and a database library that instruments queries, each gets its own scope with a name and version. This lets backends filter and attribute telemetry to specific instrumentation libraries.

Signal-Specific Data

Below the scope layer, you find the actual telemetry records — Spans for traces, Metrics (with their data points) for metrics, and LogRecords for logs. Each signal type has its own protobuf message structure, but they all slot into the same Resource → Scope → Data hierarchy.

Traces: The Protobuf Structure

A trace export request wraps spans in ResourceSpansScopeSpansSpan. Here's the simplified protobuf schema showing the key fields on a Span:

protobuf
message ExportTraceServiceRequest {
  repeated ResourceSpans resource_spans = 1;
}

message ResourceSpans {
  Resource resource = 1;              // service.name, host, etc.
  repeated ScopeSpans scope_spans = 2;
}

message ScopeSpans {
  InstrumentationScope scope = 1;     // library name + version
  repeated Span spans = 2;
}

message Span {
  bytes  trace_id       = 1;   // 16-byte globally unique trace ID
  bytes  span_id        = 2;   // 8-byte unique span ID
  bytes  parent_span_id = 4;   // empty for root spans
  string name           = 5;   // e.g., "GET /api/users"
  SpanKind kind         = 6;   // CLIENT, SERVER, PRODUCER, CONSUMER, INTERNAL

  fixed64 start_time_unix_nano = 7;
  fixed64 end_time_unix_nano   = 8;

  repeated KeyValue             attributes = 9;
  repeated Event                events     = 10;  // timestamped annotations
  repeated Link                 links      = 11;  // cross-trace references
  Status                        status     = 15;  // OK, ERROR, UNSET
}

The trace_id ties all spans in a distributed transaction together. Each span has its own span_id, and the parent_span_id establishes the causal tree. Events are timestamped annotations attached to a span (e.g., an exception event with a stack trace), while Links connect spans across trace boundaries (useful for batch processing where one span triggers many traces).

Metrics: Data Points and Temporality

Metrics follow the same nesting: ResourceMetricsScopeMetricsMetric. But a Metric isn't a single value — it wraps typed data points that carry the actual measurements.

protobuf
message Metric {
  string name        = 1;   // e.g., "http.server.request.duration"
  string description = 2;
  string unit        = 3;   // e.g., "ms", "By", "{request}"

  oneof data {
    Gauge            gauge             = 5;
    Sum              sum               = 7;   // counter or up-down counter
    Histogram        histogram         = 9;
    ExponentialHistogram exp_histogram = 10;
    Summary          summary           = 11;  // legacy, avoid
  }
}

message NumberDataPoint {
  repeated KeyValue attributes       = 7;
  fixed64           time_unix_nano   = 3;
  oneof value {
    double as_double = 4;
    sfixed64 as_int  = 6;
  }
}

The oneof data field means each Metric message carries exactly one type of measurement. A Sum with is_monotonic = true represents a counter; a Histogram carries bucket boundaries and counts. Each data point includes its own set of attributes (dimensions/labels) and a timestamp.

Logs: Bridging Structured and Unstructured

Logs complete the three-signal data model with ResourceLogsScopeLogsLogRecord. The LogRecord is designed to accommodate both structured observability logs and legacy unstructured text.

protobuf
message LogRecord {
  fixed64           time_unix_nano          = 1;
  SeverityNumber    severity_number         = 2;  // 1–24 (TRACE to FATAL)
  string            severity_text           = 3;  // "INFO", "ERROR", etc.
  AnyValue          body                    = 5;  // the log message itself
  repeated KeyValue attributes              = 6;
  bytes             trace_id                = 9;  // correlation with traces
  bytes             span_id                 = 10; // correlation with spans
}

The trace_id and span_id fields are what make OTel logs fundamentally more useful than traditional log aggregation. When a log record carries these IDs, your backend can jump directly from a log line to the exact trace and span that produced it — no regex parsing of correlation IDs required.

Wire Efficiency: Compression and Batching

OTLP is designed for efficiency on the wire. The three-layer nesting deduplicates Resource and Scope metadata across potentially thousands of spans or data points in a single request. Beyond structural efficiency, OTLP supports gzip compression on all transports via the Content-Encoding: gzip header (or gRPC's built-in compression). In practice, gzip reduces payload size by 70–90% for typical telemetry data.

On the SDK side, the OTLP exporter batches telemetry before sending. The default batch size is 512 spans (configurable via OTEL_BSP_MAX_EXPORT_BATCH_SIZE), and the exporter flushes every 5 seconds or when the batch is full — whichever comes first.

Retries and Partial Success

OTLP defines explicit retry semantics. When the collector returns a retryable status code (HTTP 429, 502, 503, 504, or gRPC UNAVAILABLE), the exporter retries with exponential backoff. The response may include a Retry-After header or a retry_info in the gRPC status details that the exporter should respect.

protobuf
// The collector can accept some items and reject others
message ExportTracePartialSuccess {
  int64  rejected_spans = 1;   // number of spans rejected
  string error_message  = 2;   // human-readable reason
}

Partial success is a feature unique to OTLP among telemetry protocols. A collector can accept 900 out of 1000 spans and tell the exporter exactly how many were rejected and why. This prevents the all-or-nothing failure mode where a single bad span causes an entire batch to be dropped. The exporter can then decide whether to retry just the rejected items or log the loss.

Debugging OTLP Traffic

To inspect what your application actually sends, temporarily switch the exporter to HTTP/JSON and point it at a local listener: OTEL_EXPORTER_OTLP_PROTOCOL=http/json OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318. Then run nc -l 4318 or use a tool like otel-desktop-viewer to see the raw JSON payload.

Configuring the OTLP Exporter

All OpenTelemetry SDKs support a standard set of environment variables to configure OTLP export without code changes. Here are the most important ones:

bash
# Protocol: grpc (default), http/protobuf, or http/json
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc

# Endpoint — gRPC uses port 4317, HTTP uses port 4318
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

# Enable gzip compression (highly recommended)
export OTEL_EXPORTER_OTLP_COMPRESSION=gzip

# Timeout for each export request (default: 10s)
export OTEL_EXPORTER_OTLP_TIMEOUT=10000

# Custom headers (e.g., for authentication)
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer tok123,X-Tenant=team-a"

You can also set per-signal overrides. For example, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT overrides the endpoint for traces only, while OTEL_EXPORTER_OTLP_METRICS_PROTOCOL lets you use HTTP/protobuf for metrics even if traces use gRPC. This flexibility is useful when traces go to one backend and metrics to another.

Distributed Tracing: Spans, Traces, and Trace Context

A trace represents the complete journey of a single request as it moves through your distributed system. Structurally, a trace is a directed acyclic graph (DAG) of spans, where each span captures one unit of work — an HTTP handler, a database query, a message publish. The edges in this graph encode causal relationships: span B was triggered by span A.

Every span in the same trace shares a common trace_id. Parent-child relationships between spans are established via parent_span_id, which lets backend systems reconstruct the full call graph and render it as a waterfall timeline.

sequenceDiagram
    participant C as Client
    participant GW as API Gateway
    participant US as User Service
    participant DB as Database

    Note over C: trace_id: abc123 generated
    C->>GW: GET /users/42
traceparent: 00-abc123-span01-01 activate GW Note over GW: Creates span02
parent: span01 GW->>US: GET /internal/users/42
traceparent: 00-abc123-span02-01 activate US Note over US: Creates span03
parent: span02 US->>DB: SELECT * FROM users WHERE id=42
traceparent: 00-abc123-span03-01 activate DB Note over DB: Creates span04
parent: span03 DB-->>US: Row result deactivate DB US-->>GW: 200 OK { user data } deactivate US GW-->>C: 200 OK { user data } deactivate GW Note over C,DB: All 4 spans share trace_id abc123

In this flow, the client generates a trace_id and the first span_id. Each downstream service receives the trace context via the traceparent HTTP header, creates its own span, and sets the incoming span_id as its parent_span_id. The result is a four-span trace that captures the full request lifecycle.

Anatomy of a Span

A span is the fundamental building block of a trace. Each span captures a discrete operation with precise timing, identity, and contextual metadata. Here are the core fields every span carries:

FieldSize / TypeDescription
trace_id128-bit (32 hex chars)Globally unique identifier shared by all spans in the same trace
span_id64-bit (16 hex chars)Unique identifier for this specific span
parent_span_id64-bit (16 hex chars)The span_id of the parent; empty for root spans
namestringOperation name (e.g., GET /users/{id}, SELECT users)
kindenumRole of the span: CLIENT, SERVER, PRODUCER, CONSUMER, or INTERNAL
start_timenanosecond timestampWhen the operation began
end_timenanosecond timestampWhen the operation completed
attributeskey-value pairsStructured metadata (e.g., http.method, db.system)
eventstimestamped listAnnotations at specific moments within the span's lifetime
linkslist of span contextsReferences to spans in other traces (e.g., batch triggers)
statusenumOK, ERROR, or UNSET

Here's what a span looks like as structured data when exported to a tracing backend:

json
{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "parent_span_id": "b3e4a2f8cd91d5a0",
  "name": "GET /users/{id}",
  "kind": "SERVER",
  "start_time": "2024-11-15T09:30:00.000000000Z",
  "end_time": "2024-11-15T09:30:00.047000000Z",
  "status": { "code": "OK" },
  "attributes": {
    "http.method": "GET",
    "http.url": "https://api.example.com/users/42",
    "http.status_code": 200,
    "http.route": "/users/{id}"
  }
}

Span Kind

The kind field describes the role a span plays in the overall trace topology. Getting this right matters — tracing backends use span kind to correctly pair client and server spans, calculate service-to-service latency, and build dependency graphs.

Span KindDescriptionExample
CLIENTThe span initiates an outbound request to a remote serviceAn HTTP client call, a gRPC stub invocation
SERVERThe span handles an inbound request from a remote clientAn HTTP handler, a gRPC service method
PRODUCERThe span creates a message for later async processingPublishing to Kafka, enqueuing to RabbitMQ
CONSUMERThe span processes a message produced by a PRODUCERKafka consumer handler, SQS message processor
INTERNALAn internal operation that doesn't cross a process boundaryBusiness logic computation, in-memory cache lookup
CLIENT + SERVER spans form pairs

When Service A calls Service B, you get two spans for that single hop: a CLIENT span in Service A and a SERVER span in Service B. Both share the same trace_id, and the SERVER span's parent_span_id points to the CLIENT span. The time difference between them reveals network latency.

Events, Links, and Status

Events are timestamped annotations attached to a span. They mark specific moments within the span's lifetime — things like "connection acquired from pool" or "retry attempted." The most common event type is the exception event, which OTel libraries record automatically when an error is caught.

json
{
  "name": "exception",
  "timestamp": "2024-11-15T09:30:00.035000000Z",
  "attributes": {
    "exception.type": "ConnectionTimeoutError",
    "exception.message": "Connection to db-primary:5432 timed out after 5000ms",
    "exception.stacktrace": "at DbPool.acquire (pool.js:142)\n  at UserRepo.findById ..."
  }
}

Links connect spans across separate traces. Unlike the parent-child relationship, a link is a lateral reference — it says "this span is related to that other span, but wasn't caused by it." This is essential for batch processing: a single consumer span that processes 50 messages can link back to each of the 50 producer spans, even though they belong to different traces.

Status has three possible values. UNSET is the default and means the span completed without the instrumentation explicitly setting a status. OK means the operation was explicitly validated as successful. ERROR means the operation failed. Server frameworks typically set ERROR for 5xx responses and leave status UNSET for 4xx, since a 404 is not a server error.

W3C Trace Context

For traces to survive across service boundaries, there must be a standardized way to encode and transmit trace identity. The W3C Trace Context specification defines two HTTP headers that solve this: traceparent and tracestate.

The traceparent Header

This header carries the essential trace identity in a fixed format with four fields separated by hyphens:

text
traceparent: {version}-{trace_id}-{parent_id}-{trace_flags}

Example:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
              ^^                                                    ^^
           version=00                                        flags: 01 = sampled
FieldLengthDescription
version2 hex charsFormat version; currently always 00
trace_id32 hex chars (128-bit)The trace this span belongs to
parent_id16 hex chars (64-bit)The span_id of the calling span
trace_flags2 hex charsBit field; 01 = sampled, 00 = not sampled

The tracestate Header

The tracestate header carries vendor-specific trace data as a list of key-value pairs. Each tracing vendor can store its own context without clobbering others. This allows systems using Datadog, Dynatrace, and OpenTelemetry to coexist on the same request path.

text
tracestate: dd=s:1;t.dm:-4,ot=th:256

# Format: vendor1=value1,vendor2=value2
# Each vendor manages its own key; unknown keys are forwarded unchanged

Context Propagation in Practice

Trace context must propagate across every boundary in your system — not just HTTP. The mechanism changes depending on the transport, but the principle is always the same: inject context on the sending side, extract it on the receiving side.

HTTP — Headers

HTTP propagation is the most straightforward case. OTel SDKs automatically inject traceparent and tracestate into outgoing request headers and extract them from incoming requests. Here's what manual propagation looks like in Python:

python
from opentelemetry import context, trace
from opentelemetry.propagate import inject, extract

# --- Receiving side (server/consumer) ---
# Extract trace context from incoming HTTP headers
ctx = extract(carrier=request.headers)

with trace.get_tracer(__name__).start_as_current_span(
    "handle_request", context=ctx, kind=trace.SpanKind.SERVER
):
    # --- Sending side (client/producer) ---
    # Inject current trace context into outgoing headers
    headers = {}
    inject(carrier=headers)
    response = httpx.get("http://user-service/users/42", headers=headers)

Message Queues — Message Attributes

For asynchronous messaging (Kafka, RabbitMQ, SQS), trace context travels as message attributes or headers rather than HTTP headers. The producer injects context when publishing; the consumer extracts it when processing. This creates a PRODUCERCONSUMER span relationship that can span minutes or hours.

python
# Producer: inject context into Kafka message headers
with tracer.start_as_current_span("publish_order", kind=trace.SpanKind.PRODUCER):
    kafka_headers = {}
    inject(carrier=kafka_headers)
    producer.send("orders", value=order_data, headers=kafka_headers)

# Consumer: extract context from Kafka message headers
def handle_message(message):
    ctx = extract(carrier=dict(message.headers))
    with tracer.start_as_current_span(
        "process_order", context=ctx, kind=trace.SpanKind.CONSUMER
    ):
        process(message.value)

Async Boundaries — In-Process Propagation

Within a single process, trace context is stored in a context object tied to the current execution (thread-local in Java, contextvars in Python, AsyncLocalStorage in Node.js). When you spawn threads, goroutines, or async tasks, you must explicitly pass the context — otherwise the new execution unit starts with an empty trace context and creates a disconnected trace.

javascript
const { context, trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('my-service');

tracer.startActiveSpan('parent-operation', (parentSpan) => {
  // Capture the current context before entering async boundary
  const currentCtx = context.active();

  setTimeout(() => {
    // Explicitly restore context inside the async callback
    context.with(currentCtx, () => {
      tracer.startActiveSpan('delayed-child', (childSpan) => {
        // This child correctly appears under parent-operation
        childSpan.end();
      });
    });
  }, 1000);

  parentSpan.end();
});
Lost context = broken traces

The number one cause of disconnected traces in production is lost context propagation. If you use thread pools, goroutine dispatchers, or Promise.all() patterns, verify that trace context is being carried across those boundaries. Auto-instrumentation handles most HTTP cases, but custom async patterns almost always need manual context passing.

Putting It All Together: Trace Waterfall View

When a tracing backend (Jaeger, Tempo, Honeycomb) receives all spans for a trace, it reconstructs the DAG and renders it as a waterfall — a timeline where each span is a horizontal bar, indented under its parent. Here's how the four-span trace from the sequence diagram above would appear:

text
Trace: 4bf92f3577b34da6a3ce929d0e0e4736    Duration: 47ms

Service          Operation              Timeline (0ms ────────────────── 47ms)
─────────────────────────────────────────────────────────────────────────────
api-gateway      GET /users/42          ██████████████████████████████████████  0–47ms
  user-service   GET /internal/users/42    ██████████████████████████████████  3–45ms
    user-service SELECT users                  ██████████████████████████      8–40ms
      postgres   db.query                          ████████████████████       12–38ms

Reading a waterfall, you can immediately see that the database query (26ms) dominates the total latency. The 3ms gap between the gateway span and the user-service span is network latency. Each indent level represents a parent-child relationship — the same structure encoded by parent_span_id in the raw span data.

Name your spans well

Use low-cardinality operation names like GET /users/{id} rather than GET /users/42. High-cardinality names (with unique IDs baked in) make it impossible to aggregate spans into meaningful groups. Put the specific values in span attributes instead: http.route = "/users/{id}" and user.id = 42.

Metrics Instruments, Aggregation, and Temporality

OpenTelemetry defines a set of metric instruments — typed objects you use to record measurements. Each instrument type carries specific semantics about what kind of data it records and how that data should be aggregated. Choosing the right instrument is not a stylistic preference; it determines what your backend can compute and display.

There are two categories: synchronous instruments (you call them inline in your code) and asynchronous/observable instruments (you register a callback that OTel invokes at collection time). Let's walk through each one.

Counter

A Counter records monotonically increasing values — things that only go up. Total HTTP requests served, bytes sent, errors encountered. You call add() with a non-negative value, and the SDK accumulates the sum.

python
from opentelemetry import metrics

meter = metrics.get_meter("my.service", "1.0.0")

request_counter = meter.create_counter(
    name="http.server.request.count",
    description="Total number of HTTP requests received",
    unit="1",
)

# In your request handler:
request_counter.add(1, {"http.method": "GET", "http.route": "/users"})

The aggregated value only goes up. If you need to track something that can decrease — like active connections or queue depth — use an UpDownCounter instead.

UpDownCounter

An UpDownCounter tracks a value that can both increase and decrease. Think active connections, items in a queue, or allocated memory pools. You call add() with positive or negative deltas.

python
active_connections = meter.create_up_down_counter(
    name="http.server.active_connections",
    description="Number of currently active HTTP connections",
    unit="1",
)

# Connection opened
active_connections.add(1, {"server.address": "0.0.0.0"})

# Connection closed
active_connections.add(-1, {"server.address": "0.0.0.0"})

Histogram

A Histogram records the distribution of values. This is the instrument you reach for when you care about percentiles, not just averages — request latency, response payload sizes, query durations. The SDK sorts each recorded value into configured buckets and also tracks the sum, count, min, and max.

python
import time

request_duration = meter.create_histogram(
    name="http.server.request.duration",
    description="Duration of HTTP requests in seconds",
    unit="s",
)

start = time.perf_counter()
# ... handle request ...
elapsed = time.perf_counter() - start

request_duration.record(elapsed, {
    "http.method": "POST",
    "http.route": "/orders",
    "http.status_code": 201,
})

The default bucket boundaries ([0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000]) are often too coarse for sub-second latencies. You'll learn how to customize them with Views later in this section.

Gauge

A Gauge records a point-in-time snapshot of a value that doesn't aggregate meaningfully over time. CPU usage percentage, current temperature, memory utilization. In the OTel Python SDK, you set the value directly.

python
cpu_gauge = meter.create_gauge(
    name="system.cpu.utilization",
    description="Current CPU utilization as a fraction",
    unit="1",
)

# Periodically record the current value
import psutil
cpu_gauge.set(psutil.cpu_percent() / 100.0)

Observable (Async) Instruments

Observable instruments flip the control model. Instead of calling add() or record() from your code, you register a callback function that the SDK invokes on each collection cycle. This is ideal for metrics you poll from an external source — system stats, connection pool sizes, or values from a shared data structure you don't want to instrument inline.

There are observable variants for Counter, UpDownCounter, and Gauge: ObservableCounter, ObservableUpDownCounter, and ObservableGauge.

python
import psutil
from opentelemetry.metrics import Observation

def cpu_usage_callback(options):
    """Called by the SDK on each collection interval."""
    usage = psutil.cpu_percent(percpu=True)
    for idx, pct in enumerate(usage):
        yield Observation(
            value=pct / 100.0,
            attributes={"cpu.id": str(idx)},
        )

meter.create_observable_gauge(
    name="system.cpu.utilization",
    callbacks=[cpu_usage_callback],
    description="Per-core CPU utilization",
    unit="1",
)
Sync vs Async — When to Choose

Use synchronous instruments when the measurement happens at a known point in your code (e.g., inside a request handler). Use observable instruments when the value exists independently of your code flow and you just need to sample it periodically (e.g., system metrics, connection pool stats).

Quick Reference: All Instrument Types

InstrumentSync/AsyncMonotonicExample Use CaseDefault Aggregation
CounterSyncYesTotal requests, bytes sentSum
UpDownCounterSyncNoActive connections, queue depthSum
HistogramSyncRequest latency, payload sizeExplicit bucket histogram
GaugeSyncCPU usage, temperatureLast value
ObservableCounterAsyncYesSystem CPU time, disk I/O totalsSum
ObservableUpDownCounterAsyncNoProcess thread countSum
ObservableGaugeAsyncPer-core CPU utilizationLast value

Aggregation Temporality: Cumulative vs Delta

When the SDK exports metric data, it must decide what time range each data point covers. This is aggregation temporality, and it has two modes:

  • Cumulative — each data point represents the running total since the process started (or since the metric stream began). The value at time T includes everything from time 0 to T.
  • Delta — each data point represents only the change since the last successful export. The value covers the interval between the previous export and now.

This is not an abstract concern. Your backend dictates which temporality it expects, and sending the wrong one causes incorrect results.

BackendExpected TemporalityWhy
PrometheusCumulativePrometheus computes rates from cumulative counters using rate() and increase(). It expects monotonically increasing values and detects resets.
DatadogDeltaDatadog's intake API expects pre-computed deltas. Sending cumulative values results in double-counting.
StatsDDeltaStatsD aggregates deltas on the server side. Each flush is treated as an increment.
OTLP (generic)EitherOTLP supports both. The OTel Collector can convert between them using the cumulativetodelta or deltatocumulative processors.

You configure temporality on the exporter, not on individual instruments. Here's how to set delta temporality for a Periodic exporting reader:

python
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
    PeriodicExportingMetricReader,
    AggregationTemporality,
)
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import (
    OTLPMetricExporter,
)

# Delta temporality for all instrument types (e.g., for Datadog)
delta_temporality = {
    Counter: AggregationTemporality.DELTA,
    UpDownCounter: AggregationTemporality.CUMULATIVE,
    Histogram: AggregationTemporality.DELTA,
    ObservableCounter: AggregationTemporality.DELTA,
    ObservableUpDownCounter: AggregationTemporality.CUMULATIVE,
    ObservableGauge: AggregationTemporality.CUMULATIVE,
}

exporter = OTLPMetricExporter(
    preferred_temporality=delta_temporality,
)
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=10000)
provider = MeterProvider(metric_readers=[reader])
UpDownCounter stays cumulative

Even when targeting a delta-preferring backend, UpDownCounter and ObservableUpDownCounter typically remain cumulative. A delta UpDownCounter can produce negative values that many backends misinterpret. Keep these cumulative unless your backend explicitly requires delta for all types.

Views: Customizing Aggregation

Views let you override how an instrument's data is aggregated without changing the instrumentation code. This is powerful: a library author records a Histogram with default buckets, and you — the application operator — reconfigure the bucket boundaries, drop unwanted attributes, or even change the aggregation type entirely.

The most common use case is adjusting histogram bucket boundaries for sub-second latencies:

python
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.view import View
from opentelemetry.sdk.metrics.export import (
    ExplicitBucketHistogramAggregation,
)

# Custom buckets for request latency (in seconds)
latency_view = View(
    instrument_name="http.server.request.duration",
    aggregation=ExplicitBucketHistogramAggregation(
        boundaries=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
    ),
)

# Drop a high-cardinality attribute to reduce series count
drop_user_id_view = View(
    instrument_name="http.server.request.duration",
    attribute_keys=["http.method", "http.route", "http.status_code"],
    # Only these attributes are kept; "user.id" and others are dropped
)

provider = MeterProvider(
    metric_readers=[reader],
    views=[latency_view, drop_user_id_view],
)

Views match instruments by name (exact or wildcard), instrument type, meter name, or meter schema URL. You can register multiple views, and they apply independently — a single instrument can produce multiple metric streams if matched by more than one view.

Exemplars: Bridging Metrics and Traces

Metrics tell you what is happening (p99 latency spiked to 2.3s). Traces tell you why (a specific database query took 1.8s). Exemplars are the bridge — they attach sampled trace context (trace ID, span ID) directly to individual metric data points.

When your Histogram records a latency value that falls into a particular bucket, an exemplar can capture the trace ID of the request that produced that value. In your metrics backend, you click from the spike in your p99 chart directly into the offending trace.

python
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
    PeriodicExportingMetricReader,
)
from opentelemetry.sdk.metrics.view import View
from opentelemetry.sdk.metrics._internal.exemplar import (
    AlwaysOnExemplarFilter,
    TraceBasedExemplarFilter,
)

# TraceBasedExemplarFilter: attach exemplars only when a sampled
# trace context is active (recommended for production)
provider = MeterProvider(
    metric_readers=[reader],
    exemplar_filter=TraceBasedExemplarFilter(),
)

# AlwaysOnExemplarFilter: attach exemplars for every measurement
# (useful for debugging, high overhead in production)

Exemplars work automatically when both tracing and metrics are configured in the same process. The SDK captures the active span's trace ID and span ID at the moment a measurement is recorded. No changes to your instrumentation code are needed — just configure the exemplar filter on the MeterProvider.

Exemplar support varies by backend

Grafana (with Tempo + Mimir/Prometheus) and Datadog both support exemplar-based metric-to-trace navigation. Verify your backend supports exemplars before enabling them — if it doesn't, the exemplar data is exported but silently ignored, adding overhead for no benefit.

Structured Logging and Log-Trace Correlation

Most applications start with plain-text log lines like INFO 2024-03-15 Order placed for user 42. This works when you're tailing a single log file on a single server. It falls apart the moment you have dozens of services, thousands of requests per second, and a need to answer questions like "show me every log line related to this failed checkout."

Unstructured logs are expensive to parse at query time, impossible to filter reliably with simple string matching, and completely disconnected from the traces and metrics that give them context. Structured logging and log-trace correlation solve all three problems.

Why Unstructured Logs Break at Scale

Consider a traditional log line:

text
2024-03-15 09:22:17 ERROR [PaymentService] Failed to charge card for order #8812 — timeout after 30s

A human can read this. A machine cannot — at least not without a fragile regex that breaks the next time someone changes the log format. You can't reliably extract the order ID, you can't filter by severity without parsing the word "ERROR" out of a free-text blob, and you have zero connection to the request trace that triggered this payment attempt.

Structured logging solves the parsing problem by emitting log records as key-value pairs in a machine-readable format. The two dominant formats are JSON and logfmt.

Structured Formats: JSON vs logfmt

FormatExampleStrengths
JSON {"level":"error","msg":"charge failed","order_id":8812} Universal parser support, nested values, wide ecosystem
logfmt level=error msg="charge failed" order_id=8812 Human-readable, compact, easy to scan in a terminal

JSON is the safer default for production systems — every log aggregator (Elasticsearch, Loki, Datadog) natively parses it. logfmt shines in local development and CLI tools where you read logs with your eyes. Both formats let you query on specific fields (order_id=8812) without regex gymnastics.

The OTel Logs Data Model

OpenTelemetry defines a standard LogRecord structure that goes beyond simple key-value logging. Every log record carries a consistent set of fields that connect it to the broader observability picture — traces, resources, and instrumentation scopes.

The key fields in a LogRecord are:

FieldTypePurpose
timestampuint64 (nanoseconds)When the log was emitted
observed_timestampuint64 (nanoseconds)When the collector received the log
severity_numberint (1–24)Numeric severity (maps to TRACE, DEBUG, INFO, WARN, ERROR, FATAL)
severity_textstringHuman-readable severity like "ERROR"
bodyanyThe actual log message (string, map, or array)
attributeskey-value mapStructured metadata — order_id, user_id, http.status_code
resourcekey-value mapThe entity producing the log: service.name, host.name, k8s.pod.name
trace_idbytes (16)Links this log to a distributed trace
span_idbytes (8)Links this log to a specific span within that trace
Note

The trace_id and span_id fields are what make log-trace correlation possible. Without them, logs and traces live in completely separate silos. With them, you can click a log line in Grafana and jump directly to the exact trace — and vice versa.

The Log Bridge API — Integrate, Don't Replace

OTel does not ask you to throw away your existing logging library. Instead, it provides a Log Bridge API that sits between your application's logger and the OTel SDK. Your code continues to call logging.error() in Python, logger.error() in Log4j, or Log.Error() in Serilog. The bridge intercepts those calls, converts them into OTel LogRecord objects, enriches them with trace context and resource attributes, and exports them via OTLP.

This design has two major advantages. First, you get zero-disruption adoption — no rewriting of application code. Second, the bridge automatically injects the active trace_id and span_id from the current context, which is something you'd otherwise have to wire up manually in every log call.

flowchart LR
    A["Application Code\ncalls logger.error()"] --> B["Standard Logger\n(Python logging, Log4j, Serilog)"]
    B --> C["OTel Log Bridge\nLoggingHandler / Appender"]
    C --> D["Inject trace_id\n& span_id from\nactive context"]
    D --> E["Enrich with\nResource Attributes\n(service.name, host, etc.)"]
    E --> F["OTLP Exporter"]
    F --> G["Backend\n(Grafana, Jaeger, Datadog)"]
    G --> H["Query logs\nby trace_id"]
    

Code Example: Python Logging with OTel Log Bridge

The following example sets up the OTel log bridge for Python's built-in logging module. After this setup, every log call automatically includes trace context — no changes needed in your application code.

python
import logging
from opentelemetry import trace
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
from opentelemetry.sdk.resources import Resource

# 1. Define the service resource
resource = Resource.create({"service.name": "payment-service"})

# 2. Set up the OTel LoggerProvider with an OTLP exporter
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(
    BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://localhost:4317"))
)

# 3. Attach OTel's LoggingHandler to Python's root logger
handler = LoggingHandler(level=logging.DEBUG, logger_provider=logger_provider)
logging.getLogger().addHandler(handler)
logging.getLogger().setLevel(logging.DEBUG)

# 4. Application code — unchanged, uses standard logging
logger = logging.getLogger("payment")
logger.info("Payment processing started", extra={"order_id": 8812})

The LoggingHandler is the bridge. It converts every Python LogRecord into an OTel LogRecord, automatically pulling trace_id and span_id from the active span context. The extra dict fields become OTel log attributes.

Auto-Injection of Trace Context into Logs

Once the bridge is wired up, trace correlation happens automatically within any active span. Here's what it looks like in practice — you create a span, and every log call inside that span carries the trace context.

python
tracer = trace.get_tracer("payment")
logger = logging.getLogger("payment")

def process_payment(order_id: int, amount: float):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order.id", order_id)
        logger.info("Charging card", extra={"order_id": order_id, "amount": amount})

        try:
            result = charge_card(order_id, amount)
            logger.info("Payment succeeded", extra={"order_id": order_id})
        except TimeoutError:
            logger.error("Payment gateway timeout", extra={"order_id": order_id})
            span.set_status(trace.StatusCode.ERROR, "gateway timeout")
            raise

Every logger.info() and logger.error() call inside the with block automatically gets the trace_id and span_id of the process_payment span. The exported log records look like this:

json
{
  "timestamp": "2024-03-15T09:22:17.384Z",
  "severity_text": "ERROR",
  "severity_number": 17,
  "body": "Payment gateway timeout",
  "attributes": {
    "order_id": 8812
  },
  "resource": {
    "service.name": "payment-service"
  },
  "trace_id": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4",
  "span_id": "f6e5d4c3b2a1f6e5"
}

With this record in your backend, you can query all logs for trace a1b2c3d4... and immediately see every log emitted during that request — across every service in the chain. In Grafana, this is a single click from a log panel to the trace timeline view.

Tip

If you also want trace_id in your console output during development, add it to your log formatter: logging.Formatter('%(asctime)s %(levelname)s [trace=%(otelTraceID)s] %(message)s'). The OTel handler injects otelTraceID and otelSpanID as attributes on the Python LogRecord object.

How Log-Trace Correlation Works End-to-End

The correlation chain has three links. First, the OTel tracing SDK creates a span with a unique trace_id and span_id, stored in the current context. Second, when your code emits a log, the Log Bridge reads these IDs from the context and stamps them onto the LogRecord. Third, the backend indexes logs and traces by trace_id, enabling bidirectional navigation.

This means your debugging workflow changes fundamentally. Instead of grep-searching through gigabytes of text, you start from an error log, click through to the trace, see exactly which service and span failed, inspect the span's attributes and events, and find every other log line emitted during that same request. The trace_id becomes the universal join key across all three signals — logs, traces, and metrics.

Context Propagation and Baggage Across Services

Distributed tracing only works if every service in a request chain knows which trace it belongs to. The mechanism that makes this possible is context propagation — the automatic passing of trace identity and metadata across process boundaries via HTTP headers, message headers, or other transport mechanisms.

Without context propagation, each service would create isolated, unrelated traces. With it, you get a single connected trace that follows a request from ingress to the deepest downstream dependency.

The OTel Context Object

At the heart of propagation is the Context object — an immutable key-value store that carries two critical pieces of data through your application:

  • Span Context — the trace ID, span ID, trace flags, and trace state that identify the current position in a distributed trace
  • Baggage — arbitrary key-value pairs you attach for cross-service communication (covered in detail below)

Context is immutable by design. Every time you add or modify a value, a new Context is returned. This prevents race conditions in concurrent code and ensures that each operation sees a consistent snapshot of the propagation state.

Context is not the same as a Span

Context is a carrier, not a trace element. A Span is something you create and end. Context is the invisible thread that links spans together across goroutines, threads, and network boundaries. Think of it as the envelope; the span context and baggage are the letters inside.

How Propagation Works: Injection and Extraction

Propagation involves two complementary operations that happen at service boundaries:

  • Injection — On the outgoing side, the propagator serializes the current Context into a carrier (e.g., HTTP request headers). This happens automatically when you use OTel-instrumented HTTP clients or messaging libraries.
  • Extraction — On the incoming side, the propagator deserializes the carrier back into a Context object. Your server middleware or framework integration does this before your handler code runs.
sequenceDiagram
    participant A as Service A
(Order API) participant B as Service B
(Billing API) participant C as Service C
(Notification API) Note over A: Creates root span
Sets baggage: tenant-id=acme A->>A: inject(context, headers) A->>B: POST /charge
traceparent: 00-abc123-span01-01
baggage: tenant-id=acme Note over B: extract(headers) → context B->>B: Read baggage: tenant-id=acme
Create child span (parent=span01) B->>B: inject(context, headers) B->>C: POST /notify
traceparent: 00-abc123-span02-01
baggage: tenant-id=acme Note over C: extract(headers) → context C->>C: Read baggage: tenant-id=acme
Create child span (parent=span02)

Notice how the traceparent header carries the same trace ID (abc123) across all three services, but the parent span ID updates at each hop. The baggage header travels unchanged, making the tenant-id available everywhere without each service needing to look it up independently.

Propagators: The Serialization Formats

A propagator defines how context is serialized into and deserialized from a carrier. OTel supports multiple propagator formats to interoperate with different tracing ecosystems.

PropagatorHeaders UsedWhen to Use
W3CTraceContextPropagatortraceparent, tracestateDefault. Use unless you have a reason not to. W3C standard supported by all major vendors.
W3CBaggagePropagatorbaggageCarries baggage key-value pairs. Must be explicitly added alongside the trace context propagator.
B3Propagatorb3 or X-B3-TraceId, etc.Zipkin compatibility. Use when migrating from Zipkin or communicating with Zipkin-instrumented services.
CompositePropagatorCombines multipleUse when your system needs multiple formats simultaneously (e.g., W3C + B3 during migration).

In most setups, you want a composite propagator that includes both the trace context and baggage propagators. Here's how to configure that:

javascript
const { W3CTraceContextPropagator } = require('@opentelemetry/core');
const { W3CBaggagePropagator } = require('@opentelemetry/core');
const { CompositePropagator } = require('@opentelemetry/core');
const { propagation } = require('@opentelemetry/api');

// Register both propagators so baggage travels with traces
propagation.setGlobalPropagator(
  new CompositePropagator({
    propagators: [
      new W3CTraceContextPropagator(),
      new W3CBaggagePropagator(),
    ],
  })
);
python
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositeHTTPPropagator
from opentelemetry.trace.propagation import TraceContextTextMapPropagator
from opentelemetry.baggage.propagation import W3CBaggagePropagator

set_global_textmap(
    CompositeHTTPPropagator([
        TraceContextTextMapPropagator(),
        W3CBaggagePropagator(),
    ])
)
go
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/propagation"
)

otel.SetTextMapPropagator(
    propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ),
)

Baggage: Cross-Service Key-Value Pairs

Baggage is the OTel mechanism for passing arbitrary key-value metadata alongside trace context. Unlike span attributes (which are local to a single span), baggage entries propagate across every service boundary in the request chain. This makes baggage ideal for values that multiple services need without making independent lookups.

Common Use Cases

  • Tenant ID — Multi-tenant systems can route, filter, and label telemetry per tenant without each service querying a database
  • Feature flags — Propagate which feature variant is active so downstream services can branch behavior consistently
  • A/B test group — Ensure the entire request chain knows which experiment cohort the user belongs to
  • Request priority — Let downstream services adjust timeout or queue priority based on the origin's classification

Setting and Reading Baggage

javascript
const { propagation, context } = require('@opentelemetry/api');

// --- Service A: Setting baggage at the edge ---
const baggage = propagation.createBaggage({
  'tenant-id': { value: 'acme' },
  'ab-group': { value: 'experiment-42' },
});
const ctxWithBaggage = propagation.setBaggage(context.active(), baggage);

// Run downstream calls within this context
context.with(ctxWithBaggage, () => {
  // Any HTTP call made here will include the baggage header
  fetch('http://billing-api/charge', { method: 'POST', body });
});

// --- Service B: Reading baggage downstream ---
app.post('/charge', (req, res) => {
  const currentBaggage = propagation.getBaggage(context.active());
  const tenantId = currentBaggage?.getEntry('tenant-id')?.value;
  console.log(`Processing charge for tenant: ${tenantId}`);
  // tenantId === 'acme'
});
python
from opentelemetry import baggage, context

# --- Service A: Setting baggage at the edge ---
ctx = baggage.set_baggage("tenant-id", "acme")
ctx = baggage.set_baggage("ab-group", "experiment-42", context=ctx)

# Attach context so outgoing calls propagate it
token = context.attach(ctx)
try:
    requests.post("http://billing-api/charge", json=payload)
finally:
    context.detach(token)

# --- Service B: Reading baggage downstream ---
tenant_id = baggage.get_baggage("tenant-id")
print(f"Processing charge for tenant: {tenant_id}")
# tenant_id == "acme"
go
import (
    "go.opentelemetry.io/otel/baggage"
)

// --- Service A: Setting baggage at the edge ---
tenantMember, _ := baggage.NewMember("tenant-id", "acme")
abMember, _ := baggage.NewMember("ab-group", "experiment-42")
bag, _ := baggage.New(tenantMember, abMember)
ctx = baggage.ContextWithBaggage(ctx, bag)

// Pass ctx to your HTTP client — headers are injected automatically
req, _ := http.NewRequestWithContext(ctx, "POST", "http://billing-api/charge", body)
client.Do(req)

// --- Service B: Reading baggage downstream ---
bag := baggage.FromContext(ctx)
tenantID := bag.Member("tenant-id").Value()
fmt.Printf("Processing charge for tenant: %s\n", tenantID)
// tenantID == "acme"

Baggage Gotchas and Security Concerns

Baggage is powerful but comes with real trade-offs you need to understand before using it in production. Because baggage entries are serialized into HTTP headers on every request, they're subject to both size constraints and security risks.

Baggage is visible in plain text

Baggage values are transmitted as HTTP headers in cleartext. Never put sensitive data — passwords, tokens, PII, or API keys — in baggage. Any proxy, load balancer, or service in the chain can read (and log) these values. Treat baggage like a URL query parameter: assume everyone can see it.

Key Constraints

ConstraintDetail
Header size limitsMost HTTP servers and proxies enforce header size limits (e.g., Nginx defaults to 8 KB total headers). A bloated baggage header can cause 431 Request Header Fields Too Large errors.
Propagation overheadEvery baggage entry is serialized and deserialized at every hop. Keep entries small (short keys, short values) and few (under 10 entries is a good rule of thumb).
No automatic cleanupBaggage accumulates. If Service A sets 5 entries and Service B adds 3 more, Service C sees all 8. There is no built-in TTL or scoping mechanism.
Cross-trust boundariesBaggage from external callers flows into your system. Validate and sanitize baggage values at trust boundaries to prevent injection attacks or unexpected behavior.
Use a SpanProcessor to copy baggage to span attributes

Baggage entries don't automatically appear in your trace backend. If you want to query traces by tenant-id, write a custom SpanProcessor that reads baggage from the current context and copies selected entries into span attributes at span start. This keeps baggage lean for propagation while making the data queryable in your observability platform.

Adding B3 Compatibility During Migration

If you're migrating from Zipkin to OTel (or need to communicate with services that still use B3 headers), you can configure a composite propagator that handles both formats. Incoming requests with either traceparent or b3 headers will be understood, and outgoing requests will include both.

javascript
const { CompositePropagator, W3CTraceContextPropagator } = require('@opentelemetry/core');
const { W3CBaggagePropagator } = require('@opentelemetry/core');
const { B3Propagator, B3InjectEncoding } = require('@opentelemetry/propagator-b3');
const { propagation } = require('@opentelemetry/api');

propagation.setGlobalPropagator(
  new CompositePropagator({
    propagators: [
      new W3CTraceContextPropagator(),
      new W3CBaggagePropagator(),
      new B3Propagator({ injectEncoding: B3InjectEncoding.MULTI_HEADER }),
    ],
  })
);
// Outgoing requests now include: traceparent, baggage, X-B3-TraceId, etc.

Once all services are migrated to W3C Trace Context, you can safely remove the B3 propagator from the composite and drop the extra headers.

Auto-Instrumentation vs Manual Instrumentation

OpenTelemetry gives you two distinct paths to instrument your applications: auto-instrumentation, which hooks into common frameworks and libraries without touching your code, and manual instrumentation, where you explicitly create spans, record metrics, and emit logs using the OTel API. In practice, most production systems use both — auto-instrumentation as the foundation, manual instrumentation for business-specific detail.

Auto-Instrumentation: Zero-Code Telemetry

Auto-instrumentation works by intercepting calls to well-known libraries — HTTP clients, database drivers, web frameworks, messaging systems — and automatically creating spans, propagating context, and recording attributes. You get distributed tracing across service boundaries without writing a single line of instrumentation code.

Each language ecosystem provides its own mechanism for this. Here's how you set it up across four major languages:

Install the auto-instrumentation package and run your app with the opentelemetry-instrument wrapper:

bash
# Install the SDK and auto-instrumentation packages
pip install opentelemetry-distro opentelemetry-exporter-otlp

# Install instrumentation libraries for detected packages
opentelemetry-bootstrap -a install

# Run your Flask/Django/FastAPI app — no code changes needed
OTEL_SERVICE_NAME=order-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
opentelemetry-instrument python app.py

The opentelemetry-bootstrap command inspects your installed packages and installs matching instrumentation libraries (e.g., opentelemetry-instrumentation-flask, opentelemetry-instrumentation-psycopg2). The opentelemetry-instrument command monkey-patches these libraries at startup.

Download the Java agent JAR and attach it to your JVM process:

bash
# Download the latest agent JAR
curl -L -o opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

# Attach the agent to your Java application
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=order-service \
  -Dotel.exporter.otlp.endpoint=http://localhost:4318 \
  -jar my-app.jar

The Java agent uses bytecode manipulation to instrument over 100 libraries — Spring, JAX-RS, JDBC, Hibernate, Kafka, gRPC, and more. No recompilation required. It's the most mature auto-instrumentation in the OTel ecosystem.

Install the meta-package and register it before your application loads:

bash
# Install SDK and auto-instrumentation packages
npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http
javascript
// tracing.js — load this BEFORE your app code
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");

const sdk = new NodeSDK({
  serviceName: "order-service",
  traceExporter: new OTLPTraceExporter({ url: "http://localhost:4318/v1/traces" }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
bash
# Start your app with the tracing setup loaded first
node --require ./tracing.js app.js

.NET offers a NuGet-based approach with OpenTelemetry.AutoInstrumentation:

bash
# Download and install the auto-instrumentation package
dotnet tool install --global OpenTelemetry.AutoInstrumentation

# Set environment variables and run your app
OTEL_SERVICE_NAME=order-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
OTEL_DOTNET_AUTO_HOME=$HOME/.otel-dotnet-auto \
CORECLR_ENABLE_PROFILING=1 \
CORECLR_PROFILER="{918728DD-259F-4A6A-AC2B-B85E1B658571}" \
CORECLR_PROFILER_PATH=${OTEL_DOTNET_AUTO_HOME}/linux-x64/OpenTelemetry.AutoInstrumentation.Native.so \
dotnet run

The .NET agent uses the CLR profiling API to inject instrumentation into ASP.NET Core, HttpClient, SqlClient, Entity Framework, and other common libraries at runtime.

Manual Instrumentation: Fine-Grained Control

Auto-instrumentation covers the infrastructure layer — HTTP requests, database queries, message queue operations. But it knows nothing about your business logic. When you need spans for "process payment," "validate inventory," or "calculate shipping cost," you reach for manual instrumentation.

Manual instrumentation uses the OpenTelemetry API directly. You acquire a tracer, create spans, set attributes, and record events. This gives you full control over span names, hierarchies, and the metadata attached to each operation.

python
from opentelemetry import trace

tracer = trace.get_tracer("order-service", "1.0.0")

def process_order(order):
    with tracer.start_as_current_span("process-order") as span:
        span.set_attribute("order.id", order.id)
        span.set_attribute("order.item_count", len(order.items))
        span.set_attribute("order.total_usd", order.total)

        with tracer.start_as_current_span("validate-inventory"):
            check_stock(order.items)

        with tracer.start_as_current_span("charge-payment") as payment_span:
            result = charge_card(order.payment_method, order.total)
            payment_span.set_attribute("payment.provider", result.provider)
            payment_span.set_attribute("payment.status", result.status)

        span.add_event("order.completed", {"order.id": order.id})
java
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;

Tracer tracer = GlobalOpenTelemetry.getTracer("order-service", "1.0.0");

public void processOrder(Order order) {
    Span span = tracer.spanBuilder("process-order").startSpan();
    try (Scope scope = span.makeCurrent()) {
        span.setAttribute("order.id", order.getId());
        span.setAttribute("order.item_count", order.getItems().size());

        // Child spans are linked automatically via context
        validateInventory(order.getItems());
        chargePayment(order.getPaymentMethod(), order.getTotal());

        span.addEvent("order.completed");
    } catch (Exception e) {
        span.recordException(e);
        span.setStatus(StatusCode.ERROR, e.getMessage());
        throw e;
    } finally {
        span.end();
    }
}
javascript
const { trace } = require("@opentelemetry/api");

const tracer = trace.getTracer("order-service", "1.0.0");

async function processOrder(order) {
  return tracer.startActiveSpan("process-order", async (span) => {
    try {
      span.setAttribute("order.id", order.id);
      span.setAttribute("order.item_count", order.items.length);

      await tracer.startActiveSpan("validate-inventory", async (child) => {
        await checkStock(order.items);
        child.end();
      });

      await tracer.startActiveSpan("charge-payment", async (child) => {
        const result = await chargeCard(order.paymentMethod, order.total);
        child.setAttribute("payment.status", result.status);
        child.end();
      });

      span.addEvent("order.completed");
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: trace.SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}
Tracer names matter

The first argument to getTracer() is the instrumentation scope name — typically your service or library name. This isn't just a label; backends use it to group and filter telemetry. Always use a meaningful, consistent name rather than an empty string.

The Hybrid Approach: Best of Both

The most effective strategy is layering manual instrumentation on top of auto-instrumentation. Auto-instrumentation gives you the full picture of how requests flow through infrastructure — HTTP handlers, database calls, external API requests. Manual instrumentation fills in the business logic gaps that sit between those infrastructure calls.

Here's what the hybrid looks like in a Python Flask application. Auto-instrumentation handles the Flask route and the database query. You add manual spans for the domain logic in between:

python
from flask import Flask, request, jsonify
from opentelemetry import trace

app = Flask(__name__)
tracer = trace.get_tracer("order-service")

@app.route("/orders", methods=["POST"])
def create_order():
    # AUTO: Flask instrumentation creates a span for "POST /orders"
    data = request.get_json()

    # MANUAL: Custom span for business validation
    with tracer.start_as_current_span("validate-order") as span:
        span.set_attribute("order.customer_id", data["customer_id"])
        errors = validate_business_rules(data)
        if errors:
            span.set_attribute("validation.passed", False)
            return jsonify({"errors": errors}), 400
        span.set_attribute("validation.passed", True)

    # MANUAL: Custom span for pricing calculation
    with tracer.start_as_current_span("calculate-pricing") as span:
        pricing = compute_total(data["items"], data.get("coupon_code"))
        span.set_attribute("pricing.subtotal", pricing.subtotal)
        span.set_attribute("pricing.discount_pct", pricing.discount)

    # AUTO: SQLAlchemy instrumentation captures the INSERT query
    order = save_order_to_db(data, pricing)

    # AUTO: requests instrumentation captures the outgoing HTTP call
    notify_warehouse(order)

    return jsonify({"order_id": order.id}), 201

The resulting trace contains a clean hierarchy: the auto-generated Flask span at the root, your manual business spans nested inside, and auto-generated database and HTTP spans beneath those. You see the full story — infrastructure and domain logic — in one trace.

Trade-offs at a Glance

FactorAuto-InstrumentationManual Instrumentation
Setup effortMinutes — add an agent or package, set env varsHours to days — instrument each operation explicitly
Code changesZero (or one bootstrap file)Requires modifying application code
CoverageCommon frameworks and libraries onlyAnything you choose to instrument
GranularityInfrastructure-level (HTTP routes, SQL queries)Business-level (domain operations, custom attributes)
Span namingGeneric (e.g., GET /api/orders/:id)Domain-specific (e.g., apply-loyalty-discount)
Performance overheadSlightly higher — instruments everything it canYou control exactly what's traced
MaintenanceUpdates with agent/library versionsMust update code when logic changes
Startup impactAdds 1–5s startup time (especially Java agent)Negligible
Start with auto, add manual where it hurts

Don't try to manually instrument everything on day one. Deploy auto-instrumentation first, observe your traces in a backend like Jaeger or Grafana Tempo, then identify the gaps — the "black box" spans where you know business logic runs but can't see what happened. Add manual spans there. This incremental approach gives you the fastest time-to-value.

Auto-Instrumentation Library Support

Not all languages have the same breadth of auto-instrumentation support. Here's a snapshot of coverage for common libraries:

Library / FrameworkPythonJavaNode.js.NET
HTTP server (Express, Flask, Spring, etc.)
HTTP client (requests, HttpClient)
PostgreSQL / MySQL✅ (JDBC)
Redis
KafkaPartial
gRPC
GraphQLLimitedLimited
MongoDB
Auto-instrumentation is not "set and forget"

Auto-instrumentation agents can conflict with other bytecode manipulation tools (APM agents, security agents), increase memory usage, and occasionally break on major library version upgrades. Test auto-instrumentation in staging before production rollout, and pin your agent versions in CI/CD pipelines to avoid surprises.

Sampling Strategies: Head-Based, Tail-Based, and Beyond

A moderately busy microservices application can generate millions of spans per minute. Storing and indexing every single one of them is expensive — both in network bandwidth leaving your services and in backend storage costs. The uncomfortable truth is that most traces are uninteresting: a successful 50ms GET request looks nearly identical to the one before it and the one after it.

Sampling is how you keep your observability costs under control without losing the traces that actually matter. The core question every sampling strategy answers is: which traces do we keep, and when do we decide?

flowchart LR
    subgraph head ["Head-Based Sampling"]
        direction TB
        A1["Request arrives"] --> A2{"Sampler decides\nimmediately"}
        A2 -->|"Sampled ✓"| A3["Spans collected\n& exported"]
        A2 -->|"Not sampled ✗"| A4["Spans dropped\nat creation"]
    end

    subgraph tail ["Tail-Based Sampling"]
        direction TB
        B1["Request arrives"] --> B2["All spans collected\ninto buffer"]
        B2 --> B3["Trace completes"]
        B3 --> B4{"Tail sampler\nevaluates full trace"}
        B4 -->|"Interesting"| B5["Trace kept"]
        B4 -->|"Boring"| B6["Trace dropped"]
    end
    

Head-Based Sampling

Head-based sampling makes the keep-or-drop decision at the very beginning of a trace — before any spans are even created. The decision propagates through the entire call chain via the W3C tracestate / traceparent headers, so every service in a distributed transaction agrees on whether to record or not. This is cheap and simple: you never allocate memory for spans you won't keep.

OpenTelemetry SDKs ship with four built-in head-based samplers:

SamplerBehaviorUse Case
AlwaysOnKeeps 100% of tracesDevelopment, low-traffic services
AlwaysOffDrops 100% of tracesDisabling tracing without removing instrumentation
TraceIdRatioBasedKeeps a fixed percentage based on a hash of the trace IDSteady-state production sampling at a known rate
ParentBasedRespects the sampling decision of the upstream (parent) spanAlmost always — wraps another sampler to maintain consistency

TraceIdRatioBased works by hashing the 128-bit trace ID and checking whether the result falls below a threshold. Because the hash is deterministic, every service that sees the same trace ID makes the same decision — no coordination needed. ParentBased wraps any other sampler and delegates to it only for root spans; for child spans, it simply inherits the parent's decision.

python
from opentelemetry.sdk.trace.sampling import (
    ParentBasedTraceIdRatio,
)
from opentelemetry.sdk.trace import TracerProvider

# Keep ~10% of traces; respect parent decisions for child spans
sampler = ParentBasedTraceIdRatio(0.1)

provider = TracerProvider(sampler=sampler)
Note

Head-based sampling's fundamental limitation is that you can't know if a trace is "interesting" at the moment it starts. A request that will eventually fail with a 500 error or take 30 seconds looks identical to a fast, successful one at creation time. You'll inevitably drop traces you wish you'd kept.

Tail-Based Sampling

Tail-based sampling flips the model: collect all spans first, buffer them until the trace is complete (or a timeout fires), and then evaluate the entire trace against a set of policies. This means you can keep every trace that contains an error, every trace that exceeds a latency threshold, and probabilistically sample the rest. You get the best of both worlds — low storage costs and complete visibility into problems.

The trade-off is operational complexity. The tail sampler needs to hold spans in memory while waiting for a trace to finish, which means higher resource usage. More critically, all spans belonging to a single trace must arrive at the same collector instance — otherwise the sampler sees a partial trace and makes a bad decision. This is solved by load-balancing upstream by trace_id.

The OTel Collector's tail_sampling Processor

The OpenTelemetry Collector (contrib distribution) ships with a tail_sampling processor that supports composable policies. Here's a production-realistic configuration:

yaml
processors:
  tail_sampling:
    decision_wait: 10s          # how long to buffer spans
    num_traces: 100000          # max traces held in memory
    policies:
      # Always keep traces with errors
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Always keep slow traces (> 2 seconds)
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 2000

      # Keep 5% of everything else
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

      # Hard cap: never exceed 100 traces/sec
      - name: rate-limit-policy
        type: rate_limiting
        rate_limiting:
          spans_per_second: 500

The key policy types are:

Policy TypeWhat It DoesTypical Use
status_codeMatches traces containing spans with a specific status (ERROR, OK)Keep all error traces
latencyMatches traces whose end-to-end duration exceeds a thresholdKeep slow traces for debugging
probabilisticRandomly keeps a percentage of tracesBaseline sampling for "normal" traffic
rate_limitingCaps the number of spans per secondCost control / burst protection
compositeCombines multiple sub-policies with AND/OR logic and rate allocationComplex multi-criteria decisions

Load Balancing by Trace ID

When you run multiple collector instances (and you should for availability), you need a load-balancing layer in front that routes by trace_id. The OTel Collector's loadbalancing exporter handles this — it hashes the trace ID and consistently routes to the same downstream collector.

yaml
# Gateway collector — routes spans to sampling tier
exporters:
  loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otel-sampling-collectors.svc.cluster.local
        port: 4317
Warning

If you run tail-based sampling without trace-ID-based routing, the sampler will see incomplete traces. It might drop the spans that contain the error while keeping the boring parent — exactly the opposite of what you want. Always pair tail_sampling with loadbalancing in a two-tier collector architecture.

Beyond Head and Tail: Advanced Strategies

The head-vs-tail dichotomy covers the basics, but several vendor-specific and community-driven approaches push sampling further. These systems recognize that static rules don't adapt well to changing traffic patterns.

Priority Sampling (Datadog)

Datadog's tracing libraries assign each trace a priority value at creation time: USER_REJECT (-1), AUTO_REJECT (0), AUTO_KEEP (1), or USER_KEEP (2). The Datadog Agent then uses these priorities to make downstream decisions. Any trace manually marked as USER_KEEP by application code (for example, in a critical transaction handler) is always retained, while the agent dynamically adjusts the rate for AUTO_KEEP traces based on throughput. This gives developers an escape hatch: flag the traces you know matter, and let the system handle the rest.

Dynamic Sampling (Honeycomb / Refinery)

Honeycomb's Refinery is an open-source tail-sampling proxy that goes beyond static policies. It groups traces into "key spaces" — combinations of attributes like service.name, http.status_code, and http.route — and dynamically adjusts sample rates per group to maintain a target throughput. If your /healthz endpoint suddenly spikes to 10× its normal volume, Refinery automatically samples it more aggressively, while keeping the sample rate for rare error paths low (capturing more of them).

The key insight behind dynamic sampling is that sample rate should vary inversely with how interesting or rare a traffic pattern is. High-volume, uniform traffic gets aggressively sampled down. Low-volume, unusual traffic gets kept at higher rates. This is fundamentally better than a flat probabilistic rate across all traffic.

Tip

Whichever strategy you choose, always record the sample_rate as a span attribute. Without it, you can't accurately reconstruct metrics like request counts or error rates from sampled trace data. A trace sampled at 10% represents 10 similar traces — your analysis tooling needs that multiplier.

Choosing a Strategy

StrategyProsConsBest For
Head-based (ratio) Zero overhead, simple to configure Blind to trace outcomes; drops errors Low-traffic services, getting started
Tail-based (Collector) Keeps interesting traces, drops boring ones Memory-intensive; requires trace-ID routing Medium-to-high traffic with SLO monitoring
Dynamic (Refinery-style) Adapts to traffic patterns automatically Additional infrastructure; tuning required High-traffic, diverse workloads
Priority (Datadog-style) Developer control over critical paths Vendor-specific; requires code changes Teams needing guaranteed capture of key flows

In practice, most production systems combine strategies. A common pattern is head-based ParentBased(TraceIdRatio(0.1)) in the SDK to cut volume by 90% before it hits the network, followed by tail-based sampling in the Collector to ensure the remaining 10% is biased toward the traces that matter most.

Hands-On: Instrumenting a Python Microservice

This walkthrough instruments a FastAPI-based payment service with OpenTelemetry from scratch. By the end you'll have distributed traces, request metrics, and structured logs flowing into a local Jaeger instance through the OTel Collector.

The full setup runs in Docker Compose so you can experiment without installing anything beyond Docker on your host machine.

Project Structure

Here's what the finished project looks like on disk. Every file is covered in the steps that follow.

text
payment-service/
├── app/
│   ├── main.py            # FastAPI app + OTel bootstrap
│   ├── telemetry.py       # All OTel configuration
│   └── routes/
│       └── payments.py    # Business logic with manual spans
├── requirements.txt
├── Dockerfile
├── otel-collector-config.yaml
└── docker-compose.yaml

Step 1 — Install the Packages

OpenTelemetry's Python ecosystem is modular: you install the core SDK plus individual instrumentation libraries for each framework you use. This keeps your dependency tree tight — you only pull in what you need.

requirements.txt
# Core
fastapi==0.111.0
uvicorn==0.30.1
requests==2.32.3
sqlalchemy==2.0.31

# OpenTelemetry SDK + API
opentelemetry-api==1.25.0
opentelemetry-sdk==1.25.0

# OTLP exporter (sends data to the Collector)
opentelemetry-exporter-otlp==1.25.0

# Auto-instrumentation libraries
opentelemetry-instrumentation-fastapi==0.46b0
opentelemetry-instrumentation-requests==0.46b0
opentelemetry-instrumentation-sqlalchemy==0.46b0

Install locally with pip install -r requirements.txt, or let the Dockerfile handle it (shown later).

Step 2 — Configure the Telemetry Module

Centralizing all OTel setup in a single telemetry.py keeps your application code clean. This module configures three providers — one each for traces, metrics, and logs — and wires them to the OTLP exporter.

Resource Attributes Matter

The service.name, service.version, and deployment.environment attributes are attached to every signal (trace, metric, log) your service emits. They're the primary keys backends use to filter and group telemetry, so set them accurately.

app/telemetry.py
import logging

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry._logs import set_logger_provider


def init_telemetry(
    service_name: str = "payment-service",
    service_version: str = "1.0.0",
    environment: str = "development",
    otlp_endpoint: str = "http://otel-collector:4317",
) -> None:
    """Bootstrap tracing, metrics, and logging providers."""

    # --- Shared resource (attached to all signals) ---
    resource = Resource.create({
        SERVICE_NAME: service_name,
        SERVICE_VERSION: service_version,
        "deployment.environment": environment,
    })

    # --- Tracing ---
    tracer_provider = TracerProvider(resource=resource)
    span_exporter = OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True)
    tracer_provider.add_span_processor(BatchSpanProcessor(span_exporter))
    trace.set_tracer_provider(tracer_provider)

    # --- Metrics ---
    metric_reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint=otlp_endpoint, insecure=True),
        export_interval_millis=10_000,  # flush every 10 s
    )
    meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
    metrics.set_meter_provider(meter_provider)

    # --- Logging (bridge Python stdlib logging → OTel) ---
    logger_provider = LoggerProvider(resource=resource)
    logger_provider.add_log_record_processor(
        BatchLogRecordProcessor(
            OTLPLogExporter(endpoint=otlp_endpoint, insecure=True)
        )
    )
    set_logger_provider(logger_provider)
    handler = LoggingHandler(level=logging.INFO, logger_provider=logger_provider)
    logging.getLogger().addHandler(handler)

What each provider does

ProviderExporterBatching Strategy
TracerProviderOTLPSpanExporterBatchSpanProcessor — buffers spans and flushes in bulk (default: 512 spans or 5 s)
MeterProviderOTLPMetricExporterPeriodicExportingMetricReader — pushes aggregated metrics every 10 s
LoggerProviderOTLPLogExporterBatchLogRecordProcessor — same batching semantics as traces

Step 3 — Wire Up Auto-Instrumentation in the FastAPI App

Auto-instrumentation libraries monkey-patch framework internals to create spans and propagate context automatically. Two lines of code give you full request-level tracing for FastAPI and outbound HTTP calls.

app/main.py
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

from app.telemetry import init_telemetry
from app.routes.payments import router as payments_router

# 1. Initialize all three OTel providers BEFORE the app starts
init_telemetry()

# 2. Create the FastAPI app
app = FastAPI(title="Payment Service")
app.include_router(payments_router, prefix="/payments")

# 3. Auto-instrument FastAPI (creates server spans for every request)
FastAPIInstrumentor.instrument_app(app)

# 4. Auto-instrument outbound HTTP via `requests` library
RequestsInstrumentor().instrument()

After these four lines, every inbound request generates a server span with attributes like http.method, http.route, and http.status_code. Outbound requests.get() / requests.post() calls produce client spans with context propagation headers injected automatically.

Step 4 — Add Custom Metrics and Manual Spans

Auto-instrumentation captures the HTTP layer, but your business logic lives deeper. Manual spans let you trace domain operations like "process a payment" and attach semantically meaningful attributes. Custom metrics give you counters and histograms tailored to your KPIs.

app/routes/payments.py
import time
import logging
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from opentelemetry import trace, metrics

logger = logging.getLogger(__name__)
router = APIRouter()

# --- Tracer & Meter (lazy-resolved via global providers) ---
tracer = trace.get_tracer("payment-service.payments")
meter = metrics.get_meter("payment-service.payments")

# --- Custom metrics ---
payment_counter = meter.create_counter(
    name="payments.processed.total",
    description="Total number of payments processed",
    unit="1",
)
payment_latency = meter.create_histogram(
    name="payments.processing.duration",
    description="Time spent processing a payment",
    unit="ms",
)


class PaymentRequest(BaseModel):
    order_id: str
    amount: float
    currency: str = "USD"
    method: str = "credit_card"


@router.post("/process")
async def process_payment(req: PaymentRequest):
    start = time.perf_counter()

    # --- Manual span for business logic ---
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.order_id", req.order_id)
        span.set_attribute("payment.amount", req.amount)
        span.set_attribute("payment.currency", req.currency)
        span.set_attribute("payment.method", req.method)

        logger.info(
            "Processing payment for order %s, amount %.2f %s",
            req.order_id, req.amount, req.currency,
        )

        # Simulate validation
        with tracer.start_as_current_span("validate_payment"):
            if req.amount <= 0:
                span.set_status(trace.StatusCode.ERROR, "Invalid amount")
                raise HTTPException(status_code=400, detail="Amount must be positive")

        # Simulate gateway call
        with tracer.start_as_current_span("call_payment_gateway") as gw_span:
            gw_span.set_attribute("gateway.provider", "stripe")
            time.sleep(0.05)  # simulate latency
            transaction_id = f"txn_{req.order_id}_001"
            gw_span.set_attribute("gateway.transaction_id", transaction_id)

        span.set_attribute("payment.transaction_id", transaction_id)
        span.set_status(trace.StatusCode.OK)

    # --- Record metrics ---
    elapsed_ms = (time.perf_counter() - start) * 1000
    payment_counter.add(1, {"payment.method": req.method, "payment.currency": req.currency})
    payment_latency.record(elapsed_ms, {"payment.method": req.method})

    logger.info("Payment %s completed in %.1fms", transaction_id, elapsed_ms)

    return {"status": "success", "transaction_id": transaction_id}

Notice how the manual spans nest inside the auto-instrumented FastAPI server span. When you view this in Jaeger, you'll see a clean hierarchy: POST /payments/process → process_payment → validate_payment → call_payment_gateway.

Step 5 — Configure the OTel Collector

The Collector sits between your application and your backends. It receives OTLP data, can transform it (add attributes, sample, filter), and exports to one or more destinations. This config is minimal on purpose — it receives OTLP on gRPC :4317 and forwards traces to Jaeger and logs/metrics to the debug exporter (stdout).

otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 512

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger, debug]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

Step 6 — Docker Compose for Local Testing

This Compose file brings up three services: your payment app, the OTel Collector, and Jaeger (which natively accepts OTLP). After docker compose up, you can hit the API on port 8000 and view traces at http://localhost:16686.

Dockerfile
FROM python:3.12-slim
WORKDIR /srv
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ app/
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
docker-compose.yaml
version: "3.9"

services:
  payment-service:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
    depends_on:
      - otel-collector

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.104.0
    command: ["--config", "/etc/otelcol/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol/config.yaml:ro
    ports:
      - "4317:4317"   # OTLP gRPC
    depends_on:
      - jaeger

  jaeger:
    image: jaegertracing/all-in-one:1.58
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "16686:16686" # Jaeger UI
      - "4317"        # OTLP gRPC (internal)

Step 7 — Run It and Send a Request

Bring everything up, fire a test payment, and verify the trace in Jaeger.

bash
# Start all services
docker compose up -d --build

# Wait a few seconds for startup, then send a test payment
curl -s -X POST http://localhost:8000/payments/process \
  -H "Content-Type: application/json" \
  -d '{"order_id": "ord_42", "amount": 99.95, "currency": "USD", "method": "credit_card"}'

# Expected response:
# {"status":"success","transaction_id":"txn_ord_42_001"}

# Open Jaeger UI
open http://localhost:16686

In the Jaeger UI, select payment-service from the service dropdown and click Find Traces. You should see a trace with four spans in a parent-child hierarchy.

Tip

Click on the process_payment span in Jaeger and expand its tags. You'll see every custom attribute you set — payment.order_id, payment.amount, gateway.transaction_id, etc. These attributes are what make traces actually useful for debugging production issues.

Key Takeaways

ConceptWhat You DidWhy It Matters
Resource attributesSet service.name, service.version, deployment.environmentEvery signal is tagged, making filtering trivial in backends
Auto-instrumentationTwo calls: FastAPIInstrumentor + RequestsInstrumentorFull request-level tracing with zero business code changes
Manual spansWrapped process_payment and sub-steps in start_as_current_spanVisibility into domain logic, not just HTTP plumbing
Custom metricsCreated a counter and histogram for paymentsBusiness KPIs (throughput, latency) available for dashboards and alerts
Log bridgingAttached LoggingHandler to Python's root loggerLogs are correlated with trace IDs automatically
Collector as gatewayRouted all signals through the OTel CollectorDecouples your app from backend choice — swap Jaeger for Tempo without code changes

Hands-On: Instrumenting Node.js and Go Services

A single-language trace is useful, but real-world systems are polyglot. In this section you'll add a Node.js service and a Go service to the existing Python microservice, wire them all up with OpenTelemetry, and watch a single distributed trace flow across three runtimes. The call chain is Python → Node.js → Go, with trace context propagating automatically via HTTP headers.

Project Structure

After this section your project will look like this:

bash
microservices-demo/
├── python-service/      # Existing Flask service (entry point)
├── node-service/        # New — Express service
│   ├── tracing.js
│   ├── app.js
│   ├── package.json
│   └── Dockerfile
├── go-service/          # New — net/http service
│   ├── main.go
│   ├── go.mod
│   └── Dockerfile
└── docker-compose.yaml  # Updated with all three services

Node.js Service

The Node.js service receives requests from Python, performs some business logic, then calls the Go service downstream. OpenTelemetry's @opentelemetry/sdk-node package provides a single NodeSDK class that wires up tracing, and the auto-instrumentation package patches Express, HTTP, and other libraries automatically.

Install Dependencies

bash
cd node-service
npm init -y
npm install express axios
npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/api

Configure the SDK — tracing.js

This file must be loaded before your application code so that auto-instrumentation can monkey-patch modules at require time. The NodeSDK handles TracerProvider setup, context propagation (W3C TraceContext by default), and batching.

javascript
// tracing.js — load with: node --require ./tracing.js app.js
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");

const sdk = new NodeSDK({
  serviceName: "node-service",
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://otel-collector:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on("SIGTERM", () => {
  sdk.shutdown().then(() => process.exit(0));
});

Application Code with Custom Spans — app.js

Auto-instrumentation gives you HTTP and Express spans for free. For business-specific operations you create manual spans using the @opentelemetry/api tracer. The key insight: because you loaded tracing.js first, the SDK is already running and axios calls automatically inject the traceparent header into outgoing requests.

javascript
// app.js
const express = require("express");
const axios = require("axios");
const { trace } = require("@opentelemetry/api");

const app = express();
const tracer = trace.getTracer("node-service");

const GO_SERVICE_URL = process.env.GO_SERVICE_URL || "http://go-service:8082";

app.get("/process", async (req, res) => {
  // Custom span for business logic
  const result = await tracer.startActiveSpan("validate-order", async (span) => {
    span.setAttribute("order.source", "web");
    span.setAttribute("order.priority", "high");

    // Simulate validation work
    const isValid = Math.random() > 0.1;
    span.setAttribute("order.valid", isValid);

    if (!isValid) {
      span.setStatus({ code: 2, message: "Validation failed" });
      span.end();
      return { valid: false };
    }

    // Call downstream Go service — trace context propagates automatically
    const goResponse = await axios.get(`${GO_SERVICE_URL}/finalize`);
    span.end();
    return { valid: true, downstream: goResponse.data };
  });

  res.json({ service: "node-service", result });
});

app.listen(3000, () => console.log("Node service listening on :3000"));

Dockerfile

docker
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
CMD ["node", "--require", "./tracing.js", "app.js"]
Why --require ./tracing.js?

Auto-instrumentation works by wrapping module imports at load time. If you import Express before the SDK initializes, those HTTP handlers won't be patched. The --require flag guarantees tracing.js runs first, before any application module loads.

Go Service

The Go service sits at the end of the call chain. It receives requests from Node.js, extracts trace context from the incoming traceparent header, runs some business logic, and exports the resulting spans. Go's OpenTelemetry SDK is explicit — you set up each component yourself, which gives you precise control over batching, exporters, and resource attributes.

Initialize the Module and Install Packages

bash
cd go-service
go mod init go-service

go get go.opentelemetry.io/otel \
  go.opentelemetry.io/otel/sdk \
  go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc \
  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp \
  go.opentelemetry.io/otel/trace \
  go.opentelemetry.io/otel/attribute

Full Service Code — main.go

Unlike Node.js where a single NodeSDK call configures everything, Go requires you to create the exporter, build a TracerProvider, and register it globally. The otelhttp middleware wraps your HTTP handler to automatically extract incoming trace context and create server spans.

go
package main

import (
"context"
"encoding/json"
"log"
"math/rand"
"net/http"
"os"
"time"

"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
"go.opentelemetry.io/otel/trace"
)

var tracer trace.Tracer

func initTracer() func() {
endpoint := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
if endpoint == "" {
endpoint = "otel-collector:4317"
}

ctx := context.Background()
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint(endpoint),
otlptracegrpc.WithInsecure(),
)
if err != nil {
log.Fatalf("failed to create exporter: %v", err)
}

res := resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("go-service"),
semconv.ServiceVersion("1.0.0"),
)

tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
)
otel.SetTracerProvider(tp)
tracer = tp.Tracer("go-service")

return func() {
_ = tp.Shutdown(ctx)
}
}

func finalizeHandler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()

// Manual span for business logic
_, span := tracer.Start(ctx, "compute-final-price",
trace.WithAttributes(
attribute.String("pricing.currency", "USD"),
attribute.Float64("pricing.base_amount", 99.95),
),
)
// Simulate computation
time.Sleep(time.Duration(rand.Intn(50)) * time.Millisecond)

discount := rand.Float64() * 20
finalPrice := 99.95 - discount
span.SetAttributes(
attribute.Float64("pricing.discount", discount),
attribute.Float64("pricing.final_price", finalPrice),
)
span.End()

w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]interface{}{
"service":     "go-service",
"final_price": finalPrice,
})
}

func main() {
shutdown := initTracer()
defer shutdown()

mux := http.NewServeMux()
// otelhttp.NewHandler wraps the handler with automatic span creation
mux.Handle("/finalize", otelhttp.NewHandler(
http.HandlerFunc(finalizeHandler), "GET /finalize",
))

log.Println("Go service listening on :8082")
log.Fatal(http.ListenAndServe(":8082", mux))
}

Dockerfile

docker
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /go-service .

FROM alpine:3.19
COPY --from=builder /go-service /go-service
EXPOSE 8082
CMD ["/go-service"]

Updated Python Service

The existing Python service acts as the entry point. It needs one change: after doing its own work, it calls the Node.js service. With OpenTelemetry's requests auto-instrumentation already in place, the outgoing HTTP call automatically carries the traceparent header.

python
# python-service/app.py (updated route)
import os, requests
from flask import Flask, jsonify
from opentelemetry import trace

app = Flask(__name__)
tracer = trace.get_tracer("python-service")

NODE_SERVICE_URL = os.environ.get("NODE_SERVICE_URL", "http://node-service:3000")

@app.route("/order")
def create_order():
    with tracer.start_as_current_span("receive-order") as span:
        span.set_attribute("order.id", "ORD-2024-001")
        span.set_attribute("order.items_count", 3)

        # Call Node.js service — traceparent header injected automatically
        response = requests.get(f"{NODE_SERVICE_URL}/process")
        node_result = response.json()

        return jsonify({
            "service": "python-service",
            "order_id": "ORD-2024-001",
            "downstream": node_result,
        })

Updated docker-compose.yaml

This Compose file brings up all five components: the three application services, the OpenTelemetry Collector, and Jaeger for visualization. The depends_on entries ensure the Collector starts before any application service tries to export spans.

yaml
version: "3.9"

services:
  # ── Observability Infrastructure ──
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP

  jaeger:
    image: jaegertracing/all-in-one:1.54
    ports:
      - "16686:16686" # Jaeger UI
      - "14250:14250" # gRPC from Collector
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  # ── Application Services ──
  python-service:
    build: ./python-service
    ports:
      - "8080:8080"
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
      - NODE_SERVICE_URL=http://node-service:3000
    depends_on:
      - otel-collector
      - node-service

  node-service:
    build: ./node-service
    ports:
      - "3000:3000"
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
      - GO_SERVICE_URL=http://go-service:8082
    depends_on:
      - otel-collector
      - go-service

  go-service:
    build: ./go-service
    ports:
      - "8082:8082"
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=otel-collector:4317
    depends_on:
      - otel-collector
gRPC vs HTTP Endpoint Formats

Notice the Go service uses otel-collector:4317 (gRPC, no scheme prefix) while the Node.js and Python services use http://otel-collector:4318 (HTTP with the full URL). The Go OTLP gRPC exporter expects a bare host:port, not a URL. Mixing these up is the most common "spans aren't showing up" bug.

How Trace Context Propagates

The magic that connects spans across three different runtimes is W3C Trace Context propagation. Here's exactly what happens on each hop:

HopWhat HappensHeader Sent
Client → Python Python creates a root span. No incoming traceparent, so a new trace ID is generated.
Python → Node.js requests instrumentation injects the current span's context into outgoing headers. traceparent: 00-{traceId}-{spanId}-01
Node.js → Go axios instrumentation does the same — injects context from the active Node span. traceparent: 00-{traceId}-{spanId}-01
Go receives otelhttp middleware extracts traceparent, creates a child span under the same trace ID.

The critical point: the trace ID stays the same across all three services. Each service generates its own span IDs, but they all reference the same parent trace. This is how Jaeger can stitch them into a single waterfall view.

Run and Verify

  1. Start all services
    bash
    docker compose up --build -d
  2. Send a request through the full chain
    bash
    curl http://localhost:8080/order | jq .

    You should see a nested JSON response with data from all three services:

    json
    {
      "service": "python-service",
      "order_id": "ORD-2024-001",
      "downstream": {
        "service": "node-service",
        "result": {
          "valid": true,
          "downstream": {
            "service": "go-service",
            "final_price": 87.32
          }
        }
      }
    }
  3. View the distributed trace in Jaeger

    Open http://localhost:16686 in your browser. Select python-service from the Service dropdown and click Find Traces. You'll see a single trace with spans from all three services.

Reading the Jaeger Trace

The trace you see in Jaeger will contain approximately six spans arranged in a parent-child waterfall. Here's what each span represents and which service produced it:

bash
Trace: a]  a]  a]  a]  a]  a]  a]  a]  a]  a]  a]  a]  a]  a]  total: ~120ms

├─ python-service: GET /order              ████████████████████████████████  120ms
│  ├─ python-service: receive-order        ██████████████████████████████    115ms
│  │  ├─ node-service: GET /process        ████████████████████████          100ms
│  │  │  ├─ node-service: validate-order   █████████████████████             85ms
│  │  │  │  ├─ go-service: GET /finalize   ██████████████                    60ms
│  │  │  │  │  └─ go-service:              ████████                          40ms
│  │  │  │  │    compute-final-price

Each indentation level represents a parent-child relationship. The python-service root span encompasses the entire request lifecycle. Inside it, you can see the hand-off from Python to Node.js to Go, with both auto-generated spans (like GET /order) and your custom business spans (like validate-order and compute-final-price). The custom spans carry the attributes you set — click any span in Jaeger to inspect order.valid, pricing.final_price, and other attributes.

Debugging Missing Spans

If you only see spans from one or two services, check two things: (1) the Collector endpoint format matches the protocol (gRPC on 4317, HTTP on 4318), and (2) the services can resolve each other's hostnames on the Docker network. Run docker compose logs otel-collector to see if spans are arriving at the Collector.

The OTel Collector: Architecture and Deployment Models

The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It sits between your instrumented applications and your observability backends, decoupling the two so you can change destinations without touching application code. Think of it as a telemetry router with built-in transformation capabilities.

The Collector is a single binary that you configure entirely through YAML. Every configuration defines one or more pipelines, and each pipeline is composed of three types of components: Receivers, Processors, and Exporters.

graph LR
    subgraph Receivers
        R1["OTLP\n(gRPC/HTTP)"]
        R2["Prometheus\n(scrape)"]
        R3["Filelog\n(log files)"]
    end

    subgraph Processors
        P1["memory_limiter"]
        P2["batch"]
        P3["attributes"]
    end

    subgraph Exporters
        E1["OTLP → Tempo"]
        E2["Prometheus\nRemote Write"]
        E3["Loki\n(logs)"]
    end

    R1 --> P1
    R2 --> P1
    R3 --> P1
    P1 --> P2
    P2 --> P3
    P3 --> E1
    P3 --> E2
    P3 --> E3

    subgraph Deployment Topology
        direction TB
        APP["App + SDK"] -->|"OTLP"| AGENT["Agent\n(DaemonSet/Sidecar)"]
        AGENT -->|"OTLP"| GW["Gateway\n(centralized)"]
        GW --> BACKEND["Backend\n(Tempo, Prom, Loki)"]
    end
    

Pipeline Architecture: Receivers → Processors → Exporters

A Collector pipeline has a strict data flow: Receivers ingest data, Processors transform it in order, and Exporters send it out. Each pipeline handles one signal type — traces, metrics, or logs. A single Collector instance typically runs multiple pipelines simultaneously.

Receivers — Data Ingestion

Receivers are how data gets into the Collector. They can be either push-based (the Collector listens for incoming data) or pull-based (the Collector actively scrapes a target). You can configure multiple receivers per pipeline, and the same receiver can feed into multiple pipelines.

ReceiverTypeSignalDescription
otlpPushTraces, Metrics, LogsNative OTel protocol over gRPC (4317) and HTTP (4318)
prometheusPullMetricsScrapes Prometheus-format endpoints
jaegerPushTracesAccepts Jaeger Thrift and gRPC formats
zipkinPushTracesAccepts Zipkin JSON v2 spans
filelogPullLogsTails log files with configurable parsing
hostmetricsPullMetricsCollects CPU, memory, disk, and network metrics from the host

Processors — Transformation Pipeline

Processors run sequentially in the order you define them in the configuration. The order matters — data flows through processor A, then B, then C. This is where you shape, filter, sample, and enrich your telemetry before it leaves the Collector.

ProcessorPurposeWhy You Need It
memory_limiterBackpressurePrevents the Collector from running out of memory under load
batchBatchingGroups data into batches for more efficient export (reduces network calls)
attributesEnrichmentAdds, updates, or deletes resource/span/metric attributes
filterFilteringDrops telemetry that matches conditions (e.g., health-check spans)
tail_samplingSamplingMakes sampling decisions after seeing the full trace (requires Gateway)
transformOTTL transformsApplies arbitrary transformations using the OpenTelemetry Transformation Language
Processor ordering matters

Always place memory_limiter first in the processor chain — it needs to be able to reject data before other processors allocate memory for it. A typical order is: memory_limiterfilterattributes / transformbatch (batch last, so it batches the final transformed data).

Exporters — Data Output

Exporters send data to one or more backends. Like receivers, you can attach multiple exporters to a single pipeline — the Collector fans out copies of the data to each one. This means you can send traces to Tempo and Jaeger simultaneously from the same pipeline.

Common exporters include otlp (for OTLP-native backends like Tempo or Grafana Cloud), prometheusremotewrite (for Prometheus/Mimir/Thanos), loki (for log aggregation), and vendor-specific exporters like datadog, splunk_hec, and awsxray.

Connectors — Joining Pipelines

Connectors are a newer component type that acts as both an exporter for one pipeline and a receiver for another. They bridge signal types — for example, the spanmetrics connector reads trace data and produces metrics like request rate, error rate, and duration histograms (RED metrics) without you needing a separate tool.

Here is a complete Collector configuration that demonstrates receivers, processors, exporters, connectors, and pipeline wiring:

yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  filelog:
    include: [/var/log/app/*.log]
    operators:
      - type: json_parser

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  batch:
    send_batch_size: 1024
    timeout: 5s
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo.monitoring:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://mimir.monitoring:9009/api/v1/push
  loki:
    endpoint: http://loki.monitoring:3100/loki/api/v1/push

connectors:
  spanmetrics:
    dimensions:
      - name: http.method
      - name: http.status_code

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo, spanmetrics]
    metrics:
      receivers: [otlp, spanmetrics]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp, filelog]
      processors: [memory_limiter, attributes, batch]
      exporters: [loki]

Deployment Models

How you deploy the Collector has a significant impact on reliability, latency, and what processing you can do. There are four common patterns, ranging from "no Collector at all" to a multi-tier architecture.

1. No Collector (Direct SDK Export)

Your application's OTel SDK exports telemetry directly to the backend. This is the simplest setup — no extra infrastructure to manage. However, it tightly couples your app to the backend, offers no local buffering, and means every SDK change requires an application redeploy. Suitable for local development and small prototypes only.

2. Agent Mode (Sidecar / DaemonSet)

A Collector instance runs alongside your application — either as a Kubernetes sidecar container in the same pod, or as a DaemonSet with one Collector per node. The SDK exports to localhost, which is fast and reliable. The Agent handles retry logic, batching, and basic enrichment, keeping the SDK configuration minimal.

3. Gateway Mode (Centralized)

A pool of Collector instances runs as a standalone, horizontally-scaled deployment behind a load balancer. All telemetry routes through this central tier. The Gateway is the only place you can run tail_sampling (which needs to see all spans of a trace) and cross-service aggregation. It is also the single point to manage export credentials for your backends.

4. Agent + Gateway (Recommended)

The production-grade pattern combines both tiers. Agents run locally for fast buffering and basic processing. They forward to a Gateway tier for advanced operations like tail sampling, attribute enrichment with external data, and routing to multiple backends. This gives you local resilience and centralized control.

ModelBufferingTail SamplingComplexityBest For
No CollectorNoneNoMinimalLocal dev, prototypes
Agent onlyLocalNoLowSmall clusters, simple backends
Gateway onlyCentralYesMediumCentralized control without local agents
Agent + GatewayBothYesHigherProduction workloads at scale

Distributions: Core vs Contrib

The Collector ships in two official distributions, and the difference matters when you plan your component usage.

otelcol (Core) includes only the most stable, well-tested components: the OTLP receiver/exporter, the batch and memory_limiter processors, and a handful of others. It is small, secure, and suitable when you only need OTLP-to-OTLP forwarding.

otelcol-contrib (Contrib) bundles hundreds of community-contributed receivers, processors, and exporters — including filelog, hostmetrics, prometheusremotewrite, loki, tail_sampling, and vendor-specific exporters. Most production deployments use Contrib because they need at least one component that is not in Core.

Contrib is large — consider building a custom Collector

The Contrib binary includes every contributed component, resulting in a large binary (~200+ MB) with a broad attack surface. If you only need 5 components, you are shipping hundreds of unused ones. For production, build a custom Collector with only the components you actually use.

Building a Custom Collector with OCB

The OpenTelemetry Collector Builder (OCB) lets you create a custom Collector binary that includes exactly the receivers, processors, exporters, and connectors you need — nothing more. You define a manifest YAML file listing your components and their versions, then OCB generates the Go source code and compiles it into a single binary.

yaml
# otel-builder-manifest.yaml
dist:
  name: my-otelcol
  description: Custom Collector for our platform
  output_path: ./build
  otelcol_version: 0.104.0

receivers:
  - gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.104.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/filelogreceiver v0.104.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver v0.104.0

processors:
  - gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.104.0
  - gomod: go.opentelemetry.io/collector/processor/memorylimiterprocessor v0.104.0

exporters:
  - gomod: go.opentelemetry.io/collector/exporter/otlpexporter v0.104.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter v0.104.0

connectors:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/connector/spanmetricsconnector v0.104.0

Then build and run your custom Collector:

bash
# Install OCB
go install go.opentelemetry.io/collector/cmd/builder@latest

# Build the custom Collector from the manifest
builder --config=otel-builder-manifest.yaml

# Run it with your Collector config
./build/my-otelcol --config=otel-config.yaml
Pin your OCB versions

All component versions in your manifest must match the otelcol_version. Version mismatches between core and contrib modules are the number one cause of OCB build failures. Pin everything to the same release tag and update them together.

Collector Configuration: Pipelines, Processors, and Recipes

The OpenTelemetry Collector is configured via a single YAML file that declares what data comes in, how it gets transformed, and where it goes out. Every Collector deployment — sidecar, agent, or gateway — shares the same configuration model. Mastering this file is the single most important skill for running OTel in production.

Top-Level Configuration Structure

A Collector config has six top-level keys. Each one defines a pool of named components that you wire together in the service.pipelines section.

KeyPurposeExample Components
receiversIngest data from external sources (push or pull)otlp, prometheus, filelog, hostmetrics
processorsTransform, filter, batch, or sample data in-flightbatch, attributes, tail_sampling, filter
exportersSend data to backends or downstream collectorsotlp, otlphttp, prometheusremotewrite, loki
connectorsBridge one pipeline's output to another pipeline's inputspanmetrics, routing, forward
extensionsAuxiliary services (health checks, auth, storage)health_check, pprof, zpages
serviceDeclares active pipelines and enabled extensions

Here is the skeleton structure that every config follows. Defining a component under receivers: alone does nothing — you must reference it in a pipeline under service: for it to run.

yaml
receivers:
  # Named receiver instances go here

processors:
  # Named processor instances go here

exporters:
  # Named exporter instances go here

connectors:
  # Named connector instances (optional)

extensions:
  # Named extension instances (optional)

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [batch]
      exporters:  [otlp/backend]
    metrics:
      receivers:  [otlp, prometheus]
      processors: [batch]
      exporters:  [prometheusremotewrite]
    logs:
      receivers:  [filelog]
      processors: [attributes, batch]
      exporters:  [otlphttp]
Naming Convention

You can create multiple instances of the same component type using a type/name suffix — for example, otlp/frontend and otlp/backend. The part after the slash is an arbitrary label you choose. This is how you target different endpoints with the same exporter type.

Environment Variable Substitution

Never hard-code secrets or environment-specific values. The Collector natively supports ${ENV_VAR} syntax anywhere in the YAML, with optional defaults via ${ENV_VAR:-fallback}. This makes the same config portable across dev, staging, and production.

yaml
exporters:
  otlp/backend:
    endpoint: ${OTEL_BACKEND_ENDPOINT:-localhost:4317}
    headers:
      authorization: "Bearer ${API_TOKEN}"
    tls:
      insecure: ${TLS_INSECURE:-false}

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: ${OTLP_GRPC_LISTEN:-0.0.0.0:4317}
      http:
        endpoint: ${OTLP_HTTP_LISTEN:-0.0.0.0:4318}

Recipe 1: Basic Tracing Pipeline — OTLP to Jaeger/Tempo

This is the starting point for most tracing setups. Applications send spans via OTLP (gRPC or HTTP), the Collector batches them for efficiency, and forwards them to a Jaeger or Grafana Tempo backend. The batch processor is critical — without it, each span triggers a separate network request to your backend.

yaml
# recipe-1-basic-tracing.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 1024       # Flush after 1024 spans
    timeout: 5s                 # ...or after 5 seconds, whichever comes first
    send_batch_max_size: 2048   # Hard cap per batch

exporters:
  otlp/tempo:
    endpoint: ${TEMPO_ENDPOINT:-tempo.observability.svc:4317}
    tls:
      insecure: true            # Set false in production with real certs

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [batch]
      exporters:  [otlp/tempo]

The batch processor works on two triggers: a size threshold (send_batch_size) and a time deadline (timeout). In low-traffic environments, the timeout ensures spans aren't held indefinitely. In high-traffic bursts, the size threshold caps memory usage.

Recipe 2: Prometheus Scraping and Remote Write

You can use the Collector as a drop-in replacement for Prometheus' scrape loop. The prometheus receiver accepts standard Prometheus scrape_configs syntax, meaning you can port your existing prometheus.yml jobs directly. The Collector then converts scraped metrics into OTLP format internally and exports them via prometheusremotewrite to any compatible backend (Cortex, Mimir, Thanos, etc.).

yaml
# recipe-2-prometheus-scrape.yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "kubernetes-pods"
          scrape_interval: 30s
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: "true"
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
              action: replace
              target_label: __address__
              regex: (.+)
              replacement: $$1

        - job_name: "node-exporter"
          scrape_interval: 15s
          static_configs:
            - targets: ["node-exporter:9100"]

processors:
  batch:
    timeout: 10s

exporters:
  prometheusremotewrite:
    endpoint: ${MIMIR_ENDPOINT:-http://mimir.observability.svc:9009/api/v1/push}
    headers:
      X-Scope-OrgID: ${TENANT_ID:-default}
    resource_to_telemetry_conversion:
      enabled: true             # Promotes OTel resource attributes to metric labels

service:
  pipelines:
    metrics:
      receivers:  [prometheus]
      processors: [batch]
      exporters:  [prometheusremotewrite]

The resource_to_telemetry_conversion setting is important. Without it, OTel resource attributes (like service.name) are dropped during the conversion to Prometheus format. Enabling it promotes them to metric labels so your Grafana dashboards can filter by service.

Recipe 3: Log Collection with Multiline Parsing

The filelog receiver tails log files from disk — useful for collecting application logs in Kubernetes (from /var/log/pods) or VMs. The real power is in its operator pipeline: you can parse multiline stack traces, extract structured fields, and enrich logs before they leave the node. This recipe collects Java-style logs with multiline exceptions and ships them to Loki.

yaml
# recipe-3-log-collection.yaml
receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    exclude:
      - /var/log/pods/*/otel-collector/*.log   # Don't collect our own logs
    start_at: end                               # Don't re-read historical logs
    include_file_path: true                     # Add log.file.path attribute
    operators:
      # Step 1: Combine multiline Java stack traces into a single log entry
      - type: multiline
        line_start_pattern: '^\d{4}-\d{2}-\d{2}'
        overwrite_text: body

      # Step 2: Parse the timestamp from the log line
      - type: regex_parser
        regex: '^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\s+(?P<severity>\w+)\s+\[(?P<thread>[^\]]+)\]\s+(?P<logger>\S+)\s+-\s+(?P<message>.*)'
        timestamp:
          parse_from: attributes.timestamp
          layout: "%Y-%m-%d %H:%M:%S.%L"
        severity:
          parse_from: attributes.severity

processors:
  attributes/logs:
    actions:
      - key: environment
        value: ${DEPLOY_ENV:-production}
        action: upsert
      - key: cluster
        value: ${CLUSTER_NAME}
        action: upsert
      - key: thread
        action: delete                          # Clean up parsed temp attributes

  batch:
    timeout: 5s
    send_batch_size: 512

exporters:
  loki:
    endpoint: ${LOKI_ENDPOINT:-http://loki.observability.svc:3100/loki/api/v1/push}
    default_labels_enabled:
      exporter: false
      job: true

service:
  pipelines:
    logs:
      receivers:  [filelog]
      processors: [attributes/logs, batch]
      exporters:  [loki]

The multiline operator is the key to sane log collection. Without it, each line of a Java stack trace becomes a separate log entry, making debugging impossible. The line_start_pattern regex identifies where a new log record begins — all subsequent lines until the next match are folded into the same entry.

Recipe 4: Tail Sampling Gateway

Head-based sampling (deciding at trace start) is simple but wasteful — you might drop the one trace that shows a critical error. Tail sampling waits until a trace is complete, then decides whether to keep it based on what actually happened. This requires a gateway Collector that receives all spans, groups them into complete traces, and applies policies.

Tail Sampling Requires Trace Completeness

All spans for a given trace must arrive at the same gateway instance. If you run multiple gateway replicas, you need a load balancer that routes by trace_id — use the loadbalancing exporter on your agent Collectors to achieve this. Without this, the gateway sees incomplete traces and makes bad sampling decisions.

yaml
# recipe-4-tail-sampling-gateway.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  # Step 1: Buffer spans and group them into complete traces
  groupbytrace:
    wait_duration: 30s          # Wait up to 30s for all spans of a trace to arrive
    num_traces: 100000          # Max traces held in memory simultaneously

  # Step 2: Apply sampling policies to complete traces
  tail_sampling:
    decision_wait: 10s          # Additional wait after groupbytrace
    num_traces: 100000
    policies:
      # Policy 1: Always keep traces containing errors
      - name: errors-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR

      # Policy 2: Keep traces with high latency (> 2 seconds)
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 2000

      # Policy 3: Always keep traces from critical services
      - name: critical-services
        type: string_attribute
        string_attribute:
          key: service.name
          values:
            - payment-service
            - auth-service

      # Policy 4: Probabilistic sampling for everything else (10%)
      - name: catch-all
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

  batch:
    send_batch_size: 1024
    timeout: 5s

exporters:
  otlp/tempo:
    endpoint: ${TEMPO_ENDPOINT:-tempo.observability.svc:4317}
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [groupbytrace, tail_sampling, batch]
      exporters:  [otlp/tempo]

The processor order matters. groupbytrace must come before tail_sampling so that the sampler sees complete traces, not individual spans arriving out of order. The batch processor comes last because you want to batch the already-sampled output. Policies are evaluated with OR logic — if any policy matches, the trace is kept.

Recipe 5: Multi-Tenant Routing

In a shared platform, different teams or tenants need their telemetry routed to separate backends. The routing connector reads a resource attribute (like tenant.id) and directs data to different sub-pipelines, each with its own exporter and destination.

yaml
# recipe-5-multi-tenant-routing.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

connectors:
  routing:
    table:
      - statement: route() where resource.attributes["tenant.id"] == "team-alpha"
        pipelines: [traces/alpha]
      - statement: route() where resource.attributes["tenant.id"] == "team-beta"
        pipelines: [traces/beta]
    default_pipelines: [traces/default]

processors:
  batch:
    timeout: 5s

exporters:
  otlp/alpha:
    endpoint: ${ALPHA_ENDPOINT:-tempo-alpha.svc:4317}
    tls:
      insecure: true
  otlp/beta:
    endpoint: ${BETA_ENDPOINT:-tempo-beta.svc:4317}
    tls:
      insecure: true
  otlp/default:
    endpoint: ${DEFAULT_ENDPOINT:-tempo-shared.svc:4317}
    tls:
      insecure: true

service:
  pipelines:
    # Ingestion pipeline: receives data and routes it
    traces:
      receivers:  [otlp]
      processors: [batch]
      exporters:  [routing]       # Connector used as an exporter here

    # Tenant-specific pipelines: connector feeds data in as a receiver
    traces/alpha:
      receivers:  [routing]       # Connector used as a receiver here
      exporters:  [otlp/alpha]
    traces/beta:
      receivers:  [routing]
      exporters:  [otlp/beta]
    traces/default:
      receivers:  [routing]
      exporters:  [otlp/default]

Notice how the routing connector appears as an exporter in the ingestion pipeline and a receiver in the tenant pipelines. This is the defining trait of connectors: they bridge pipelines. The OTTL route() statement gives you full access to resource attributes, span attributes, and metric data points for routing decisions.

Essential Extensions for Debugging

Three extensions belong in every Collector deployment. They cost almost nothing to run and save hours when something goes wrong.

health_check

Exposes an HTTP endpoint (default :13133) that returns 200 OK when the Collector is running. Use it as a Kubernetes liveness probe and readiness probe.

pprof

Exposes Go's pprof profiling endpoints (default :1777). When the Collector is using too much CPU or memory, connect with go tool pprof http://localhost:1777/debug/pprof/heap to diagnose exactly where resources are going.

zpages

Provides an in-process web UI (default :55679) showing live pipeline status, recent traces processed by the Collector itself, and component-level stats. Visit /debug/tracez to see sampled internal traces and /debug/pipelinez for pipeline health.

yaml
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679

service:
  extensions: [health_check, pprof, zpages]
  # ... pipelines below
Validate Before You Deploy

Run otelcol validate --config=config.yaml to catch syntax errors and invalid references without starting the Collector. Pair this with otelcol components to list all receivers, processors, and exporters compiled into your Collector binary — a common source of "unknown component" errors when using the wrong distribution.

Backends and Visualization: The Observability Stack

OpenTelemetry intentionally stops at data collection and export — it does not provide storage or querying. That boundary is where backends take over. Choosing the right backend for traces, metrics, and logs is one of the highest-impact architectural decisions you'll make, because it directly shapes query capabilities, operational cost, and long-term scalability.

The ecosystem breaks down cleanly along the three signal types, plus a growing category of all-in-one platforms that handle everything under a single umbrella.

mindmap
  root((OTel Backends))
    Traces
      Jaeger
      Grafana Tempo
      Zipkin
    Metrics
      Prometheus
      Mimir / Cortex / Thanos
      VictoriaMetrics
    Logs
      Grafana Loki
      Elasticsearch / OpenSearch
    All-in-One
      Grafana LGTM Stack
      Datadog
      Honeycomb
      New Relic
      Dynatrace
      Splunk
    

Trace Backends

Traces are arguably the most transformative signal OTel produces — they show you request flow across service boundaries. Three backends dominate this space, each with a distinct philosophy.

Jaeger

Jaeger is a CNCF graduated project originally created at Uber. It offers native OTLP ingestion, a polished web UI for trace exploration, and support for multiple storage backends (Cassandra, Elasticsearch, Kafka, and an in-memory option for development). For dev environments or small-to-medium production workloads, Jaeger's all-in-one binary gets you running in seconds.

bash
# Run Jaeger all-in-one with OTLP enabled (gRPC on 4317, HTTP on 4318)
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Jaeger v2 (currently in development) is being rebuilt on top of the OpenTelemetry Collector, meaning Jaeger itself becomes an OTel Collector distribution with built-in storage. This is a strong signal of how deeply the ecosystem is converging around OTel.

Grafana Tempo

Tempo takes a radically different approach: it stores traces in object storage (S3, GCS, Azure Blob) with no indexing. This makes it extremely cost-efficient at scale — you pay object storage prices instead of database prices. The trade-off is that you need a trace ID to look up a trace directly, or you use TraceQL, Tempo's query language, to search by span attributes.

yaml
# Minimal Tempo config receiving OTLP
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: "0.0.0.0:4317"

storage:
  trace:
    backend: s3
    s3:
      bucket: my-tempo-traces
      endpoint: s3.amazonaws.com

Tempo shines when paired with Grafana — you get trace visualization, TraceQL queries, and seamless links from metrics and logs to the exact traces that matter. For teams already on the Grafana ecosystem, Tempo is the natural choice.

Zipkin

Zipkin predates both Jaeger and OTel. It's still widely deployed and OTel Collectors can export to it natively. However, it lacks the modern query capabilities and ecosystem integration of Jaeger or Tempo. If you're starting fresh, Jaeger or Tempo are stronger choices — but if you already run Zipkin, OTel fits in without requiring a migration.

Metrics Backends

Metrics are the most mature observability signal, and the Prometheus ecosystem is the gravitational center of this space.

Prometheus

Prometheus is the de facto standard for Kubernetes metrics. It uses a pull-based model — the Prometheus server scrapes HTTP endpoints at regular intervals. Its query language, PromQL, is expressive and widely supported across dashboarding tools. OTel integrates with Prometheus in two ways: the Collector can expose a Prometheus scrape endpoint, or it can remote-write metrics directly to Prometheus-compatible backends.

yaml
# OTel Collector exporter: Prometheus scrape endpoint
exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: myapp
    send_timestamps: true
    resource_to_telemetry_conversion:
      enabled: true

Standalone Prometheus works well for single-cluster deployments. Its main limitation is horizontal scalability — a single Prometheus server has a ceiling on the number of time series it can handle.

Thanos, Cortex, and Mimir

When Prometheus hits its scaling limits, you move to a horizontally scalable layer. Thanos adds long-term storage and multi-cluster federation on top of existing Prometheus instances. Cortex was the first fully distributed Prometheus-compatible backend. Grafana Mimir is Cortex's successor — it ingests Prometheus metrics via remote write, stores them in object storage, and scales to billions of active series. All three are PromQL-compatible, so your dashboards and alerts don't change.

VictoriaMetrics

VictoriaMetrics is a Prometheus-compatible time-series database focused on performance and storage efficiency. It typically uses 5-10x less disk and RAM than Prometheus for the same dataset. It supports PromQL (with extensions via MetricsQL), remote write ingestion from OTel Collectors, and both single-node and clustered deployments. It's a strong option if cost-efficiency is your primary concern.

Log Backends

Logs are the highest-volume signal, which makes storage cost and query performance the defining factors in backend choice.

Grafana Loki

Loki indexes only metadata labels (like service.name, severity), not the log content itself. This makes it dramatically cheaper to run than full-text search engines. You query logs with LogQL, which filters by labels first and then applies regex or pattern matching on log lines. OTel Collectors export to Loki natively via the loki or otlphttp exporter.

yaml
# OTel Collector exporter: sending logs to Loki via OTLP
exporters:
  otlphttp:
    endpoint: "http://loki:3100/otlp"

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp]

Elasticsearch and OpenSearch

Elasticsearch (and its open-source fork OpenSearch) provides full-text search with inverted indexes across all log content. This gives you the most powerful querying — you can search any substring, aggregate on any field, and build complex boolean queries. The cost is significantly higher resource consumption: CPU, memory, and disk all scale with ingest volume. For teams that need deep ad-hoc log analysis or already run the ELK stack, Elasticsearch remains a solid choice.

All-in-One Platforms

Rather than assembling separate backends for each signal, many teams opt for a unified platform. These come in two flavors: open-source stacks you self-host, and commercial SaaS products.

The Grafana LGTM Stack

The "LGTM" stack — Loki (logs), Grafana (visualization), Tempo (traces), Mimir (metrics) — is the leading open-source all-in-one approach. Every component uses object storage, speaks OTLP natively, and is designed to work together. Grafana ties them all into a single pane of glass with cross-signal correlation.

PlatformStrengthsBest For
Grafana LGTMOpen-source, cost-efficient, full OTel support, cross-signal correlationTeams wanting control and cost savings
DatadogFull-featured SaaS, auto-instrumentation, unified UI, 700+ integrationsTeams prioritizing ease of use over cost
HoneycombHigh-cardinality query engine, BubbleUp analysis, trace-first approachDebugging complex distributed systems
New RelicGenerous free tier, full-stack observability, AIOps featuresStartups and teams needing broad coverage
DynatraceAI-driven root cause analysis (Davis AI), automatic topology mappingEnterprise environments with complex infrastructure
SplunkPowerful log analytics (SPL), strong security/compliance featuresOrganizations with security and compliance requirements
OTel as the Common Denominator

Every major commercial platform now accepts OTLP natively. This means you can instrument with OTel once and switch backends later without touching application code. Vendor lock-in shifts from the instrumentation layer to the query and dashboard layer — a much easier migration.

The Grafana Visualization Layer

Regardless of which backends you choose, Grafana has become the default visualization layer for OTel-based observability. It supports every backend mentioned above as a data source, and its real power lies in cross-signal correlation — the ability to jump between traces, metrics, and logs in a single investigation.

Dashboards and Explore

Grafana dashboards are built from panels, each querying a specific data source with PromQL, LogQL, TraceQL, or other query languages. The Explore view is where ad-hoc investigation happens — it gives you a split-pane interface where you can run queries against different data sources side by side, follow trace waterfalls, and drill into log lines.

Cross-Signal Navigation

The most powerful feature of the Grafana stack is seamless navigation between signals. Grafana supports three key cross-signal patterns:

  • Trace-to-Logs: Click a span in a trace and jump directly to the logs emitted during that span's execution, filtered by trace ID and time range.
  • Trace-to-Metrics: From a trace, navigate to the relevant service's RED metrics (rate, errors, duration) for the time window around the request.
  • Exemplars: Prometheus metrics can carry exemplar data — a sample trace ID attached to a specific metric data point. When you see a latency spike on a dashboard, you click the exemplar dot and land on the exact trace that caused it.
Tip

Exemplars are the single most impactful cross-signal feature to set up. A latency percentile that says "p99 is 2.3s" is useful — but a clickable link from that data point to the exact slow trace is transformative for debugging. Enable exemplar storage in Prometheus or Mimir, and ensure your OTel SDK is attaching trace IDs to metric exports.

This cross-signal workflow — from a dashboard metric anomaly, to exemplar traces, to correlated logs — collapses investigation time from hours of manual correlation to minutes of guided navigation. It's the reason the "LGTM" stack has gained such rapid adoption: the backends are good individually, but the connected experience through Grafana is what makes the whole greater than the sum of its parts.

Semantic Conventions, Resource Detection, and Contrib Packages

OpenTelemetry generates telemetry data — spans, metrics, logs — but that data is only useful if different teams, services, and tools can agree on what the attribute names and values mean. Without a shared vocabulary, one service might call the HTTP method http.method, another might call it request_method, and a third might use httpMethod. Dashboards, alerts, and correlation queries break down immediately.

This is the problem semantic conventions solve. They are a standardized, versioned set of attribute names, types, and allowed values maintained by the OTel specification. When every instrumentation library uses the same attributes, your observability backend can provide meaningful out-of-the-box dashboards without custom mappings.

Key Convention Groups

Semantic conventions are organized by signal domain. Each group defines the attributes that describe a particular kind of operation. Here are the most important ones you'll encounter daily.

HTTP Conventions

HTTP is the most widely instrumented protocol. The conventions cover both client and server spans, capturing the request method, URL, status code, and more.

AttributeTypeExample ValueDescription
http.request.methodstringGETHTTP request method (uppercase)
url.fullstringhttps://api.example.com/users?page=2Full request URL including scheme, host, path, query
http.response.status_codeint200HTTP response status code
url.pathstring/usersThe path component of the URL
http.routestring/users/:idThe matched route template (low cardinality)
server.addressstringapi.example.comServer domain name or IP

A typical instrumented HTTP server span in Python looks like this:

python
# These attributes are set automatically by OTel HTTP instrumentation.
# You rarely set them manually — this shows what the span contains.
span.set_attribute("http.request.method", "GET")
span.set_attribute("url.full", "https://api.example.com/users?page=2")
span.set_attribute("url.path", "/users")
span.set_attribute("http.route", "/users")
span.set_attribute("http.response.status_code", 200)
span.set_attribute("server.address", "api.example.com")

Database Conventions

Database spans capture what system you're talking to, the operation being performed, and (optionally) the raw statement. These conventions let you correlate slow queries across services without knowing which ORM or driver each team uses.

AttributeTypeExample ValueDescription
db.systemstringpostgresqlDatabase management system identifier
db.statementstringSELECT * FROM users WHERE id = $1The database statement (sanitized)
db.operationstringSELECTThe name of the operation (e.g., SQL verb)
db.namestringmyapp_productionName of the database being accessed
server.addressstringdb-primary.internalDatabase server hostname
server.portint5432Database server port

Messaging Conventions

For systems like Kafka, RabbitMQ, or SQS, messaging conventions capture the broker, the operation type, and the destination. This is essential for tracing asynchronous workflows across publish/subscribe boundaries.

AttributeTypeExample ValueDescription
messaging.systemstringkafkaMessaging system identifier
messaging.operationstringpublishType of operation: publish, receive, process
messaging.destination.namestringorder-eventsTopic or queue name
messaging.message.idstringmsg-abc-123Unique message identifier

RPC Conventions

RPC conventions cover gRPC, Thrift, and other RPC frameworks. They capture the system, service, and method being invoked, making it straightforward to see which remote procedure calls are slow or failing.

AttributeTypeExample ValueDescription
rpc.systemstringgrpcRPC system (grpc, thrift, etc.)
rpc.servicestringUserServiceFull name of the RPC service
rpc.methodstringGetUserName of the RPC method
rpc.grpc.status_codeint0gRPC numeric status code (0 = OK)

Resource Conventions

While span attributes describe individual operations, resource attributes describe the entity producing the telemetry — your service, the host it runs on, the container, the Kubernetes pod, the cloud account. Resources are attached once at SDK initialization and applied to every span, metric, and log emitted by that SDK instance.

AttributeCategoryExample ValueDescription
service.nameServicepayment-apiLogical name of the service (the most important resource attribute)
service.versionService2.4.1Version of the deployed service
telemetry.sdk.languageSDKpythonLanguage of the OTel SDK
telemetry.sdk.versionSDK1.24.0Version of the OTel SDK
cloud.providerCloudawsCloud provider (aws, gcp, azure)
cloud.regionCloudus-east-1Cloud region
k8s.pod.nameKubernetespayment-api-7b9f5d-x2k4qName of the Kubernetes pod
k8s.namespace.nameKubernetesproductionKubernetes namespace
host.nameHostip-10-0-1-42Hostname of the machine
container.idContainera3bf...Container ID (from cgroup or runtime)
service.name is critical

If you set only one resource attribute, make it service.name. Without it, most backends default to unknown_service, making your telemetry nearly impossible to filter or route. Set it via the OTEL_SERVICE_NAME environment variable or in your SDK configuration.

Convention Migration: Old → New

Semantic conventions are versioned and evolve over time. A significant migration happened between the "pre-stable" and "stable" HTTP conventions. If you've worked with OTel before 2023, you may have seen the old attribute names. Libraries are actively migrating, and many SDKs now emit both old and new attributes during the transition period.

Old Attribute (deprecated)New Attribute (stable)Notes
http.methodhttp.request.methodNamespaced under request/response
http.urlurl.fullURL attributes moved to a standalone url.* namespace
http.status_codehttp.response.status_codeClarifies it's a response attribute
http.targeturl.path + url.querySplit into separate attributes for clarity
net.peer.nameserver.addressSimplified and moved to server.* namespace
net.peer.portserver.portSimplified and moved to server.* namespace
Watch for dual-emit during migration

Many instrumentation libraries currently emit both old and new attribute names. This means your backend might receive duplicated data under different keys. Update your dashboards and alert queries to use the new names, and configure your SDK's OTEL_SEMCONV_STABILITY_OPT_IN environment variable to control the behavior — options typically include old, dup (both), or stable (new only).

Resource Detection

Manually setting every resource attribute — host name, container ID, cloud region, Kubernetes pod — would be tedious and fragile. Resource detectors solve this by automatically discovering the runtime environment at startup and populating resource attributes for you.

Each OTel SDK ships with built-in detectors, and the contrib ecosystem provides additional ones. A detector queries its environment (file system, metadata endpoints, environment variables) and returns a set of resource attributes.

python
from opentelemetry.sdk.resources import Resource, get_aggregated_resources
from opentelemetry.resource.detector.azure import AzureVMResourceDetector
from opentelemetry.resource.detector.container import ContainerResourceDetector

# Detectors run at startup and merge their results
resource = get_aggregated_resources(
    detectors=[
        ContainerResourceDetector(),
        AzureVMResourceDetector(),
    ],
    initial_resource=Resource.create({
        "service.name": "payment-api",
        "service.version": "2.4.1",
    }),
)
# resource now contains service.name, container.id,
# cloud.provider, cloud.region, host.id, etc.

Common resource detectors available across language SDKs include:

DetectorAttributes PopulatedHow It Works
Processprocess.pid, process.executable.name, process.runtime.nameReads process info from the OS
Hosthost.name, host.id, host.archReads hostname and machine ID
OSos.type, os.descriptionReads OS release info
Containercontainer.idParses /proc/self/cgroup or CRI-O files
Kubernetesk8s.pod.name, k8s.namespace.name, k8s.node.nameReads the Downward API env vars or metadata
AWS (EC2/ECS/EKS)cloud.provider, cloud.region, aws.ecs.task.arnQueries IMDS or ECS metadata endpoint
GCPcloud.provider, cloud.region, gcp.project.idQueries GCP metadata server

In many setups, you don't even configure detectors manually. The OTEL_RESOURCE_DETECTORS environment variable lets you specify which detectors to activate. For example, OTEL_RESOURCE_DETECTORS=env,host,os,process,container enables a standard set for containerized workloads.

The Contrib Ecosystem

The core OTel SDK is deliberately minimal — it provides the API, SDK, and OTLP exporter, but does not instrument specific libraries. That's the job of the contrib ecosystem: a collection of community-maintained packages that provide automatic instrumentation for popular libraries and frameworks.

Each contrib package follows a naming convention: opentelemetry-instrumentation-{library}. For example:

bash
# Python examples
pip install opentelemetry-instrumentation-flask
pip install opentelemetry-instrumentation-django
pip install opentelemetry-instrumentation-requests
pip install opentelemetry-instrumentation-sqlalchemy
pip install opentelemetry-instrumentation-celery
pip install opentelemetry-instrumentation-redis

# Node.js examples
npm install @opentelemetry/instrumentation-http
npm install @opentelemetry/instrumentation-express
npm install @opentelemetry/instrumentation-pg
npm install @opentelemetry/instrumentation-redis

These packages monkey-patch or wrap library internals to create spans, record metrics, and propagate context automatically. Once installed and activated, a Flask or Express instrumentation creates server spans for every incoming HTTP request — with all the correct semantic convention attributes — without you writing any tracing code.

python
from flask import Flask
from opentelemetry.instrumentation.flask import FlaskInstrumentor

app = Flask(__name__)

# One line: every route now produces spans with
# http.request.method, url.path, http.route, http.response.status_code
FlaskInstrumentor().instrument_app(app)

@app.route("/users/<int:user_id>")
def get_user(user_id):
    return {"id": user_id, "name": "Alice"}

The OTel Registry

With hundreds of contrib packages across multiple languages, finding the right one can be overwhelming. The OpenTelemetry Registry is the official catalog. It's a searchable directory of instrumentation packages, exporters, resource detectors, and other components — across all supported languages.

When evaluating a contrib package from the registry, consider these factors:

FactorWhat to CheckWhy It Matters
MaturityLook for stable vs alpha vs beta labelsStable packages have locked APIs; alpha packages may have breaking changes
Supported versionsCheck which versions of the target library are supportedA Django instrumentor that only supports Django 3.x won't help on 5.x
Semantic convention complianceDoes it use the latest stable attribute names?Old conventions lead to fragmented dashboards
Signal coverageDoes it produce traces only, or also metrics and logs?Full signal coverage means richer observability
Maintenance activityCheck the GitHub repo for recent commits and issue responseStale packages accumulate bugs and security vulnerabilities
Start with the "all-in-one" meta-package

Most languages offer a meta-package that bundles all stable instrumentations. In Python, opentelemetry-bootstrap -a install detects your installed libraries and installs matching instrumentors automatically. In Node.js, @opentelemetry/auto-instrumentations-node bundles the common ones. Use these to get started quickly, then trim down to only what you need for production.

SLIs, SLOs, and SLAs: Observability Meets Reliability

Collecting telemetry data is only valuable if it helps you answer a single question: is the service healthy for its users? SLIs, SLOs, and SLAs form a hierarchy that translates raw observability signals into reliability contracts your team (and your customers) can reason about.

The Hierarchy: SLI → SLO → SLA

These three concepts build on each other in a strict chain. Each layer adds more context and more consequences to the one below it.

ConceptDefinitionExampleOwned By
SLI (Service Level Indicator)A quantitative measure of one dimension of service behaviorProportion of HTTP requests completing in < 200msEngineering
SLO (Service Level Objective)A target value (or range) for an SLI, measured over a time window99.9% of requests under 200ms over a rolling 30-day windowEngineering + Product
SLA (Service Level Agreement)An SLO with contractual, business consequences for missing it"99.9% availability or customer receives service credits"Business / Legal
SLOs should always be stricter than SLAs

If your SLA promises 99.9% availability, set your internal SLO at 99.95%. This gives your team a buffer to detect and fix problems before breaching the contractual agreement. An SLO violation should trigger engineering action; an SLA violation means your business is already paying the price.

How OTel Data Feeds SLI Calculation

OpenTelemetry is the collection layer — it produces the raw metrics and traces that SLIs are computed from. The OTel SDK instruments your application, emitting histograms (for latency), counters (for requests and errors), and traces (for per-request context). These signals flow through the OTel Collector and land in a metrics backend like Prometheus, where you write queries to derive SLI values.

graph LR
    A["OTel SDK
(latency histogram,
error counter)"] --> B["OTel Collector"] B --> C["Prometheus
(metrics storage)"] C --> D["PromQL Queries
(derive SLIs)"] D --> E["SLO Recording Rules
(evaluate SLI vs target)"] E --> F{"Burn Rate
Exceeds
Threshold?"} F -->|Yes| G["Alert Fires 🔔"] G --> H["Incident Response"] F -->|No| I["Budget OK ✅"]

Common SLI Types

Not every metric is an SLI. Good SLIs measure something the user directly experiences. Here are the five most common types, along with how you'd express each one from OTel-generated metrics.

Availability

The ratio of successful requests to total requests. This is the most fundamental SLI — if the service is down, nothing else matters. A request is "successful" if it returns a non-5xx response.

promql
# Availability SLI: ratio of non-5xx requests over 30 days
sum(rate(http_server_request_duration_seconds_count{http_response_status_code!~"5.."}[5m]))
/
sum(rate(http_server_request_duration_seconds_count[5m]))

Latency

The proportion of requests faster than a given threshold. OTel exports latency as a histogram (http_server_request_duration_seconds), so you use histogram_quantile for percentiles or bucket ratios for SLI-style "good/total" fractions.

promql
# Latency SLI: proportion of requests completing in < 200ms
sum(rate(http_server_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_server_request_duration_seconds_bucket{le="+Inf"}[5m]))

Error Rate, Throughput, and Correctness

SLI TypeFormulaTypical Source
Error Rateerrors / total requestshttp_server_request_duration_seconds_count filtered by status code
Throughputrequests per second above a minimum baselinerate(http_server_request_duration_seconds_count[5m])
Correctnessvalid responses / total responses (requires app-level validation)Custom OTel counter, e.g., app_response_valid_total

Correctness is the hardest to instrument because it requires domain knowledge — the service might return a 200 OK with wrong data. You typically add a custom OTel counter that increments when a response passes validation checks.

Error Budgets

An error budget flips the reliability conversation from "minimize all failure" to "how much failure can we tolerate?" If your SLO is 99.9% over 30 days, your error budget is the remaining 0.1% — roughly 43 minutes of downtime or the equivalent number of failed requests.

text
Error Budget = 1 - SLO target

SLO = 99.9%  →  Error Budget = 0.1%
Over 30 days  →  0.001 × 30 × 24 × 60 = 43.2 minutes of allowed downtime

If you've already consumed 30 minutes this month, you have 13.2 minutes left.
Spending it faster than expected → burn rate alert fires.

When the budget is healthy, teams ship features aggressively. When the budget is nearly exhausted, the team shifts to reliability work. This creates a natural, data-driven balance between velocity and stability.

Multi-Window, Multi-Burn-Rate Alerting

Naive SLO alerting ("fire if SLI drops below target") creates either too many false positives or alerts too late. Google's SRE book introduced multi-window, multi-burn-rate alerting to solve this. The idea: measure how fast you're consuming your error budget (the "burn rate") over multiple time windows, and alert only when both a short window and a long window agree the burn is real.

What Is Burn Rate?

A burn rate of 1x means you'll exactly exhaust your 30-day error budget in 30 days. A burn rate of 14.4x means you'll burn through the entire budget in about 2 days. Higher burn rates demand faster response.

SeverityBurn RateLong WindowShort WindowBudget ConsumedResponse
Page (critical)14.4x1 hour5 minutes2% in 1 hourImmediate
Page (high)6x6 hours30 minutes5% in 6 hoursWithin 30 min
Ticket (medium)3x1 day2 hours10% in 1 dayNext business day
Ticket (low)1x3 days6 hours10% in 3 daysSprint planning

Both the long window and the short window must exceed their threshold for the alert to fire. The short window acts as a reset: if a brief spike is already over, the short window drops below threshold and suppresses the alert.

Prometheus Recording and Alerting Rules

You define the SLI as a Prometheus recording rule, then write alerting rules that compare burn rates across windows. Here's a practical example for a latency SLO (99.9% of requests under 200ms over 30 days):

yaml
# prometheus-rules.yaml
groups:
  - name: slo-latency-rules
    rules:
      # --- Recording rules: pre-compute the SLI ratio ---
      - record: sli:latency:good_rate5m
        expr: |
          sum(rate(http_server_request_duration_seconds_bucket{le="0.2"}[5m]))
      - record: sli:latency:total_rate5m
        expr: |
          sum(rate(http_server_request_duration_seconds_bucket{le="+Inf"}[5m]))

      # --- 1h error ratio (for 14.4x burn rate window) ---
      - record: sli:latency:error_ratio_1h
        expr: |
          1 - (
            sum(increase(http_server_request_duration_seconds_bucket{le="0.2"}[1h]))
            /
            sum(increase(http_server_request_duration_seconds_bucket{le="+Inf"}[1h]))
          )

      # --- 6h error ratio (for 6x burn rate window) ---
      - record: sli:latency:error_ratio_6h
        expr: |
          1 - (
            sum(increase(http_server_request_duration_seconds_bucket{le="0.2"}[6h]))
            /
            sum(increase(http_server_request_duration_seconds_bucket{le="+Inf"}[6h]))
          )

  - name: slo-latency-alerts
    rules:
      # --- Critical: 14.4x burn rate over 1h AND 5m ---
      - alert: LatencySLOCriticalBurnRate
        expr: |
          sli:latency:error_ratio_1h > (14.4 * 0.001)
            and
          sli:latency:error_ratio_5m > (14.4 * 0.001)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Latency SLO burn rate critical (14.4x)"
          description: "Consuming error budget at 14.4x. Will exhaust 30-day budget in ~2 days."

      # --- High: 6x burn rate over 6h AND 30m ---
      - alert: LatencySLOHighBurnRate
        expr: |
          sli:latency:error_ratio_6h > (6 * 0.001)
            and
          sli:latency:error_ratio_30m > (6 * 0.001)
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Latency SLO burn rate high (6x)"
          description: "Consuming error budget at 6x. Will exhaust 30-day budget in ~5 days."
Use recording rules to avoid query explosion

Each burn-rate window requires a separate PromQL query. Without recording rules, an alert evaluation with 4 severity tiers × 2 windows = 8 expensive histogram queries every evaluation cycle. Pre-compute the error ratios as recording rules so your alerting rules are cheap comparisons.

Grafana Dashboard for SLO Tracking

A well-designed SLO dashboard tells the story at a glance: Are we meeting our target? How much budget remains? Is the current burn rate dangerous? Below is a Grafana dashboard JSON model for the key panels. Import this into Grafana or use it as a starting template.

Panel 1: SLI Over Time (Time Series)

promql
# Query A: Current SLI (plot as time series)
sli:latency:good_rate5m / sli:latency:total_rate5m

# Query B: SLO target (plot as constant threshold line)
vector(0.999)

Plot Query A as a line and Query B as a dashed red threshold line. When the SLI dips below the SLO line, you can visually see the moments that eat into your budget.

Panel 2: Error Budget Remaining (Gauge + Time Series)

promql
# Error budget remaining (as a percentage of total budget)
# Uses 30-day window; adjust range to match your SLO period
1 - (
  (
    1 - (
      sum(increase(http_server_request_duration_seconds_bucket{le="0.2"}[30d]))
      /
      sum(increase(http_server_request_duration_seconds_bucket{le="+Inf"}[30d]))
    )
  ) / 0.001
)

Display this as a Gauge panel with thresholds: green above 50%, yellow between 20–50%, red below 20%. Add a companion time series panel showing how the remaining budget has trended over the past 30 days — a downward slope reveals a sustained burn.

Panel 3: Burn Rate (Stat Panel)

promql
# Current burn rate (1h window)
# A value of 1.0 = consuming at exactly the sustainable rate
# A value of 14.4 = will exhaust 30-day budget in ~2 days
sli:latency:error_ratio_1h / 0.001

Use a Stat panel with color thresholds: green below 1x, yellow at 3x, orange at 6x, red at 14.4x. This single number tells an on-call engineer whether they need to act immediately.

Don't alert on SLI value — alert on burn rate

A common mistake is setting a threshold alert like "fire if availability drops below 99.9%." This triggers on brief blips that consume negligible budget and misses slow, sustained degradations that quietly drain it. Burn-rate alerting catches both fast incidents and slow leaks while ignoring noise.

Putting It All Together

The full workflow forms a closed loop. Your OTel-instrumented services emit metrics. Prometheus stores them and evaluates recording rules that pre-compute SLI error ratios at multiple time windows. Alerting rules compare those ratios against burn-rate thresholds. When an alert fires, the on-call engineer opens the Grafana SLO dashboard to see the remaining error budget, the burn trajectory, and which SLI is degraded — then decides whether to roll back, scale up, or investigate further.

This system replaces gut-feel reliability with a quantitative framework. Your SLIs define what "healthy" means. Your SLOs set the bar. Your error budget tells you how much room you have. And your burn-rate alerts tell you when that room is shrinking too fast. Everything starts with the data OpenTelemetry collects.

Observability-Driven Development and Debugging Workflows

Most teams treat observability as an afterthought — they add log lines when a production incident is already underway, scrambling to understand behavior they never instrumented. Observability-Driven Development (ODD) flips this entirely: you instrument your code before you ship it, treating telemetry as a first-class artifact alongside your tests and documentation.

The core principle is simple. If you can't observe a behavior in production, you can't understand it, debug it, or improve it. ODD makes instrumentation part of the development loop, not a post-mortem reaction.

The Mindset Shift

Traditional development treats logging as a debugging tool — you sprinkle console.log or logger.info calls when something goes wrong, deploy a patched build, wait for the issue to recur, and hope you captured enough context. This is reactive and slow. ODD asks a different question: when this code runs in production, what do I need to see to understand its behavior?

Reactive ApproachObservability-Driven Development
Add logging after an incidentInstrument during feature development
Telemetry covers known failure modesTelemetry covers all meaningful behaviors
"We'll add metrics if this becomes a problem""We can't ship without metrics for this SLI"
Debugging requires new deploys to add contextDebugging uses existing traces, metrics, and span events
Tests verify correctnessTests verify correctness and instrumentation

In practice, this means every pull request that introduces a new endpoint, background job, or external call should also introduce the spans, metrics, and attributes needed to observe that code path. You review instrumentation in code review the same way you review error handling.

ODD and TDD are complementary

Test-Driven Development tells you what your code should do. Observability-Driven Development tells you what your code is actually doing in production. TDD guards against regressions in logic; ODD guards against regressions in performance, reliability, and user experience. Treat them as two sides of the same coin.

What "Instrument First" Looks Like

When you start building a new feature — say, a payment processing endpoint — ODD means you define your spans and metrics before writing the business logic. You think about what questions you'll ask in production: How long does payment authorization take? What's the failure rate by payment provider? Which step in the flow is slowest?

python
from opentelemetry import trace, metrics

tracer = trace.get_tracer("payments")
meter = metrics.get_meter("payments")

payment_duration = meter.create_histogram(
    "payment.process.duration",
    unit="ms",
    description="Time to process a payment end-to-end",
)
payment_counter = meter.create_counter(
    "payment.process.total",
    description="Total payment attempts by provider and outcome",
)

def process_payment(order):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.provider", order.provider)
        span.set_attribute("payment.amount", order.amount_cents)
        span.set_attribute("order.id", order.id)

        # Business logic goes here — instrumentation came first
        result = authorize_and_capture(order)

        span.set_attribute("payment.outcome", result.status)
        payment_counter.add(1, {"provider": order.provider, "outcome": result.status})
        payment_duration.record(result.duration_ms, {"provider": order.provider})

Notice the pattern: the span wraps the operation, attributes capture the dimensions you'll query by, and the metric records the SLI you'll alert on. The business logic (authorize_and_capture) is almost secondary — the observable scaffolding was there first.

The Debugging Loop

ODD pays off most visibly during incident response. When your system is well-instrumented, debugging follows a consistent, repeatable loop — from broad signal to precise root cause. Instead of guessing and grepping through log files, you systematically narrow scope using the three pillars of observability.

flowchart TD
    A["🔔 Alert Fires\n(SLO burn rate exceeded)"] --> B["📊 Query Metrics\nIdentify affected SLI"]
    B --> C["🔍 Pivot to Traces\nFind slow/errored requests"]
    C --> D["🏷️ Filter by Attributes\nEndpoint, region, user segment"]
    D --> E["📋 Drill into Single Trace\nInspect span waterfall"]
    E --> F["📝 Read Span Events & Logs\nExceptions, retries, timeouts"]
    F --> G["🎯 Identify Root Cause"]
    G --> H["🔧 Fix & Deploy"]
    H --> I["✅ Verify with Telemetry\nConfirm SLI recovery"]
    I -->|"New alert?"| A

    style A fill:#ff6b6b,stroke:#c92a2a,color:#fff
    style G fill:#51cf66,stroke:#2b8a3e,color:#fff
    style I fill:#339af0,stroke:#1864ab,color:#fff
    

Each step in this loop narrows the blast radius. You start with "something is wrong" and end with "this specific line of code, in this specific service, under these specific conditions, is the cause." The key enabler is rich, structured telemetry — without attributes on your spans, you can't filter; without span events, you can't see exceptions; without correlated trace IDs, you can't pivot from metrics to traces.

Workflow 1: Alert-Driven Investigation

This is the most common debugging workflow. An SLO burn-rate alert fires — say, your checkout service's error rate has spiked beyond the budget. Here's how you walk through it with proper instrumentation in place.

  1. Scope the problem with metrics

    Open your metrics dashboard and look at the SLI that triggered the alert. Query your error rate broken down by endpoint, status code, and region. You discover that POST /checkout in us-east-1 has a 15% error rate, up from the baseline 0.3%.

    promql
    sum(rate(http_server_request_duration_seconds_count{
      status_code=~"5..",
      http_route="/checkout"
    }[5m])) by (deployment_environment, cloud_region)
    /
    sum(rate(http_server_request_duration_seconds_count{
      http_route="/checkout"
    }[5m])) by (deployment_environment, cloud_region)
  2. Pivot to traces

    Now you switch to your trace backend and search for traces from POST /checkout in us-east-1 that have error status. Use exemplars on your metric chart if your tooling supports them — clicking a data point takes you directly to a representative trace.

  3. Find common attributes

    Look at the errored traces as a group. Use your query tool to group by attributes: are they all hitting the same downstream service? Same payment provider? Same database shard? You find that 100% of errors have payment.provider=stripe and the error message is connection timeout.

  4. Drill into a single trace

    Pick one representative trace and open the span waterfall. You see the process_payment span takes 30 seconds (your timeout) and its child span HTTP POST stripe.com/v1/charges shows a network timeout. The span event attached to the exception includes the full stack trace.

  5. Identify root cause and verify the fix

    Stripe's status page confirms a regional outage in us-east-1. You enable your fallback payment provider, deploy, and watch the SLI recover on the same metrics dashboard that alerted you. The loop is closed.

Workflow 2: Customer-Reported Issue

A customer writes in: "My order from 2 hours ago still shows as processing." There's no alert because the overall SLI is healthy — this is a single-user issue. Without ODD, you'd be searching through log files with grep. With proper instrumentation, you have a direct path.

Search your trace backend for traces with user.id = "cust_8293" in the last 3 hours. You find their checkout trace, open it, and see the send_order_confirmation span failed with a serialization error. The span event shows the customer's address contained a Unicode character that broke your email template renderer. The fix is a one-line encoding change, and you can verify it by asking the customer to retry and watching the new trace complete cleanly.

yaml
# Trace search query — varies by backend, but attributes are universal
# Jaeger:    tag=user.id:cust_8293
# Tempo:     { span.user.id = "cust_8293" }
# Honeycomb: WHERE user.id = "cust_8293" LAST 3 HOURS

# What you see in the trace:
span: POST /checkout          duration: 2.4s   status: OK
  span: validate_cart          duration: 45ms   status: OK
  span: process_payment        duration: 1.8s   status: OK
  span: send_order_confirmation duration: 580ms  status: ERROR
    event: exception
      type: UnicodeEncodeError
      message: "'ascii' codec can't encode character '\\xf1' in position 42"

This workflow is only possible if user.id is set as a span attribute. That's the ODD mindset — you add user.id during development because you know you'll need to search by it later, not because a support ticket demanded it.

Workflow 3: Canary Deployment Comparison

You've deployed a canary that serves 5% of traffic. Before promoting it, you need to compare its behavior against the baseline. This is where metrics and traces work in concert.

Query your latency histogram filtered by deployment.canary=true versus deployment.canary=false. The canary's p99 latency is 40% higher. Now pivot to traces: pull a sample of slow traces from the canary and compare them to baseline traces for the same endpoint. Using Jaeger's trace comparison feature or Honeycomb's BubbleUp, you see that canary traces spend extra time in a new validate_inventory span that makes a synchronous database call you didn't intend. The N+1 query pattern is obvious in the trace waterfall — each cart item triggers a separate database span.

Attributes make canary analysis possible

Add deployment.version, deployment.canary, and deployment.id as resource attributes in your OTel SDK configuration. These propagate to every span and metric automatically, letting you slice all telemetry by deployment without changing application code.

Tooling for These Workflows

The workflows above are tool-agnostic, but the experience varies significantly depending on which observability platform you use. The common requirement is the ability to move fluidly between metrics, traces, and logs — and to filter by arbitrary attributes at each step.

ToolStrengthBest For
Grafana Explore Unified query interface across Prometheus, Tempo, and Loki. "Trace to logs" and "trace to metrics" links let you pivot between signals in a single UI. Teams already in the Grafana ecosystem who want metrics→traces→logs correlation without vendor lock-in.
Honeycomb Query builder with BubbleUp automatically surfaces attributes that differ between slow and fast requests. Designed around high-cardinality trace analysis. Investigating unknown-unknowns — when you don't know which attribute is the culprit and need the tool to surface it.
Jaeger Trace comparison feature lets you diff two traces side-by-side. Lightweight, open-source, and OTel-native. Canary vs. baseline comparisons, open-source-first teams, and environments where you need to self-host.
Tools don't replace instrumentation

No observability platform can surface insights from telemetry that doesn't exist. If your spans lack attributes, your metrics lack dimensions, or your logs lack correlation IDs, even the best query builder will return nothing useful. The investment in ODD is in your instrumentation — the tool just makes querying it comfortable.

Best Practices, Anti-Patterns, and Cost Management

Instrumenting your applications is only half the battle. The difference between an observability setup that scales gracefully and one that collapses under its own weight comes down to operational discipline. This section distills hard-won lessons from production deployments into concrete do's, don'ts, and cost-control strategies.

Best Practices

1. Always Set service.name and deployment.environment

These two resource attributes are the most important metadata you can attach to your telemetry. Without service.name, your traces and metrics land in your backend as anonymous noise. Without deployment.environment, you can't distinguish staging traffic from production incidents.

yaml
# otel-collector-config.yaml — resource processor
processors:
  resource:
    attributes:
      - key: service.name
        value: "checkout-service"
        action: upsert
      - key: deployment.environment
        value: "production"
        action: upsert
      - key: service.version
        value: "2.4.1"
        action: upsert

Set these via environment variables (OTEL_RESOURCE_ATTRIBUTES) or in the SDK resource configuration so they're present on every span, metric, and log record from the start.

2. Use Semantic Conventions Consistently

OpenTelemetry defines semantic conventions for common attribute names — http.request.method, db.system, rpc.service, and hundreds more. When every team invents their own names (httpMethod, method, request_type), you lose the ability to write cross-service queries and reuse dashboards. Stick to the conventions, and your tooling will reward you.

3. Batch Exports — Never Block the Request Path

Span and metric export should happen asynchronously in the background. The BatchSpanProcessor queues completed spans and flushes them in batches, keeping per-request overhead under a microsecond. The SimpleSpanProcessor exports each span synchronously and exists only for debugging.

python
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Good: batched, async export
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317"),
    max_queue_size=2048,
    max_export_batch_size=512,
    schedule_delay_millis=5000,
)
provider.add_span_processor(processor)

4. Set Memory Limits on the Collector

An unbounded Collector will consume all available memory under traffic spikes and get OOM-killed. The memory_limiter processor is non-negotiable in any production pipeline. Place it as the first processor in your pipeline so it can apply backpressure before the queue grows unbounded.

yaml
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500        # Hard cap
    spike_limit_mib: 512   # Reserve for bursts

service:
  pipelines:
    traces:
      processors: [memory_limiter, batch]  # memory_limiter FIRST

5. Use Tail Sampling for Cost Control

Head sampling (deciding at trace creation) is simple but blind — it randomly discards slow requests and errors. Tail sampling waits until a trace is complete, then applies policies: keep all errors, keep traces over 2 seconds, sample the rest at 10%. This dramatically reduces storage costs while preserving every trace you'd actually investigate.

yaml
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: baseline-sample
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

6. Correlate All Three Signals

The real power of OpenTelemetry is connecting traces, metrics, and logs into one coherent picture. Inject trace_id and span_id into every log line so you can jump from an error log straight to the trace. Use exemplars on metrics so a latency spike on a histogram links directly to an example slow trace.

python
import logging
from opentelemetry import trace

# Inject trace context into structured logs
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
    "Payment processed",
    extra={
        "trace_id": format(ctx.trace_id, "032x"),
        "span_id": format(ctx.span_id, "016x"),
        "amount": 49.99,
    },
)

7. Start with Auto-Instrumentation, Add Manual Spans Incrementally

Auto-instrumentation libraries cover HTTP servers, database clients, gRPC, and message queues out of the box — often with zero code changes. Get this baseline deployed first. Then add manual spans only where you need business-level visibility: payment processing, recommendation ranking, or complex workflows that auto-instrumentation can't see inside.

Tip

A good rule of thumb: if you can answer "why was this request slow?" with your current spans, you have enough. Add manual spans only when the auto-instrumented trace leaves a gap you've actually hit during an investigation.

Anti-Patterns to Avoid

These are the patterns that seem reasonable at first but cause real operational pain at scale. Most of them are expensive to fix after the fact, so catching them early matters.

1. Cardinality Explosion in Metric Labels

This is the single most common way teams blow up their observability costs. Every unique combination of label values creates a separate time series. If you add user_id as a metric label and you have 1 million users, a single counter becomes 1 million time series. Your Prometheus instance runs out of memory, your vendor bill skyrockets, and queries grind to a halt.

LabelCardinalityVerdict
http.request.method~7 (GET, POST, etc.)✅ Safe
http.response.status_code~50✅ Safe
http.route~100 (bounded by routes)✅ Safe
user.idMillions❌ Never use as a metric label
request.idUnbounded❌ Never use as a metric label
db.statementUnbounded (raw SQL)❌ Use as a span attribute, not a label

Put high-cardinality identifiers on span attributes (where they're invaluable for search) and keep metric labels to bounded, low-cardinality dimensions.

2. Over-Instrumentation — Spans for Every Function Call

Creating a span for every internal function turns a single API call into a trace with hundreds of spans. This inflates export volume, slows down the trace viewer, and buries the signal in noise. Spans should represent units of work with clear boundaries — an HTTP request, a database query, a queue publish — not internal method calls like validateInput() or formatResponse().

3. Ignoring Context Propagation

If you see traces that start and stop at service boundaries instead of flowing end-to-end, context propagation is broken. This usually means one of: the HTTP client library isn't injecting traceparent headers, a reverse proxy is stripping them, or a service is using the wrong propagator format. Always verify propagation works across your entire call chain during setup — not during an outage.

4. Dashboard Sprawl Without SLOs

Teams often respond to a new observability platform by building dozens of dashboards for every metric they can find. Within a month, nobody looks at them. Dashboards without a purpose are just noise. Start from Service Level Objectives — "99.5% of checkout requests complete in under 500ms" — and build the dashboards, alerts, and burn-rate calculations that directly serve those SLOs.

5. Sending Unsampled Traces at High Volume

A service handling 10,000 requests per second generates roughly 100,000+ spans per second when you include downstream calls. At 1 KB per span, that's ~100 MB/s of raw telemetry. Without sampling, you're paying to store data nobody will ever query. Apply head sampling at the SDK level for baseline reduction, and tail sampling at the Collector for intelligent retention.

6. Not Setting Up Resource Detection

When you skip resource detectors, your telemetry lacks host, container, and cloud metadata. You can't filter by Kubernetes namespace, can't group by EC2 instance type, and can't correlate with infrastructure metrics. OTel SDKs and the Collector both offer resource detectors for AWS, GCP, Azure, and Kubernetes — enable them.

Warning

Anti-patterns 1 and 5 are the top two drivers of runaway observability costs. A single cardinality explosion or unsampled high-volume service can consume more budget than the rest of your infrastructure combined.

Cost Management Strategies

Observability costs scale with data volume. In a vendor-hosted model, you pay per GB ingested or per million spans; in a self-hosted model, you pay in storage, compute, and engineering time. Either way, controlling the volume of data you generate, transmit, and retain is the primary lever.

Estimate Data Volumes Before You Ship

Do the math before enabling tracing on a high-traffic service. A rough formula:

bash
# Back-of-envelope data volume estimation
requests_per_sec=5000
avg_spans_per_trace=8
bytes_per_span=1000          # ~1 KB is typical for OTLP
sample_rate=0.10             # 10% sampling

daily_gb=$(echo "$requests_per_sec * $avg_spans_per_trace * $bytes_per_span * $sample_rate * 86400 / 1073741824" | bc)
echo "Estimated daily volume: ${daily_gb} GB"
# At 5K rps, 8 spans, 10% sample → ~32 GB/day

Use Collector Processors to Filter and Drop

The Collector pipeline is the ideal place to shed unnecessary data before it reaches your backend. Use the filter processor to drop health-check spans, the attributes processor to strip verbose attributes, and the transform processor for more complex logic.

yaml
processors:
  filter/drop-health:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'

  attributes/strip-verbose:
    actions:
      - key: db.statement
        action: delete    # Remove raw SQL — use db.operation instead
      - key: http.request.header.authorization
        action: delete    # Never store auth headers

Aggregate Metrics Before Export

Instead of exporting raw histogram observations, use delta temporality and pre-aggregate at the SDK or Collector level. The metricstransform processor can combine metrics, rename them, and toggle aggregation temporality. If you're self-hosting Prometheus, the remote_write path with recording rules reduces the cardinality that hits long-term storage.

Use Tiered Retention

Not all telemetry needs the same retention window. A practical tiered approach:

Data TypeRetentionRationale
Raw spans (sampled)7–14 daysEnough for active incident investigation
Error/slow traces30–90 daysSupports post-mortems and trend analysis
Aggregated metrics13 monthsYear-over-year comparison for capacity planning
Raw logs7–30 daysMost logs are only useful during active debugging
SLO burn-rate metrics13 monthsTracks reliability trends over full error budget windows
Note

Cost management is iterative. Start by measuring your current ingest volume per service, identify the top 3 contributors, and apply targeted sampling or filtering there first. A 10% reduction on your noisiest service often saves more than optimizing everything else combined.

Production Deployment Patterns and Scaling OTel

Getting OpenTelemetry running locally is straightforward. Getting it running reliably in production — handling millions of spans per second, surviving node failures, and securing every hop — is a different challenge entirely. This section covers the deployment topologies, scaling strategies, and security configurations you need for real-world OTel infrastructure.

Kubernetes Deployment Architecture

The standard Kubernetes deployment uses a two-tier Collector architecture: lightweight agent Collectors on every node that forward to centralized gateway Collectors that handle aggregation, tail sampling, and export. The OTel Operator automates SDK injection into application pods via annotations.

flowchart LR
    subgraph Node1["K8s Node 1"]
        App1["App Pod\n(auto-injected OTel SDK)"]
        App2["App Pod\n(auto-injected OTel SDK)"]
        DS1["DaemonSet Collector\n(Agent)"]
        App1 -->|OTLP| DS1
        App2 -->|OTLP| DS1
    end

    subgraph Node2["K8s Node 2"]
        App3["App Pod\n(auto-injected OTel SDK)"]
        App4["App Pod\n(auto-injected OTel SDK)"]
        DS2["DaemonSet Collector\n(Agent)"]
        App3 -->|OTLP| DS2
        App4 -->|OTLP| DS2
    end

    subgraph Gateway["Gateway Tier"]
        GW1["Gateway Collector\n(Deployment replica 1)"]
        GW2["Gateway Collector\n(Deployment replica 2)"]
    end

    DS1 -->|OTLP/gRPC| GW1
    DS1 -->|OTLP/gRPC| GW2
    DS2 -->|OTLP/gRPC| GW1
    DS2 -->|OTLP/gRPC| GW2

    GW1 --> Tempo["Tempo\n(Traces)"]
    GW1 --> Mimir["Mimir\n(Metrics)"]
    GW1 --> Loki["Loki\n(Logs)"]
    GW2 --> Tempo
    GW2 --> Mimir
    GW2 --> Loki

    Operator["OTel Operator"] -.->|inject sidecar/init-container| App1
    Operator -.->|inject sidecar/init-container| App3
    

Agent Collectors run as a DaemonSet so every node has exactly one. They receive telemetry from local pods over OTLP, perform lightweight processing (batching, resource attribution), and forward to the gateway tier. Gateway Collectors run as a Deployment (or StatefulSet for stateful sampling) and handle the heavy work: tail sampling, span-to-metrics generation, and fan-out to multiple backends.

DaemonSet Collector (Agent Mode)

The agent Collector runs on every node. Its job is to receive telemetry cheaply, add node-level metadata, and forward everything to the gateway. Keep the agent pipeline lean — no tail sampling, no complex transformations.

yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-agent
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-agent
  template:
    metadata:
      labels:
        app: otel-agent
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.102.0
          args: ["--config=/conf/agent-config.yaml"]
          ports:
            - containerPort: 4317  # OTLP gRPC
              hostPort: 4317
            - containerPort: 4318  # OTLP HTTP
              hostPort: 4318
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          volumeMounts:
            - name: config
              mountPath: /conf
      volumes:
        - name: config
          configMap:
            name: otel-agent-config

The corresponding agent configuration keeps things minimal — receive, batch, and forward:

yaml
# agent-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 400
    spike_limit_mib: 100
  k8sattributes:
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.node.name

exporters:
  otlp:
    endpoint: otel-gateway.observability.svc.cluster.local:4317
    tls:
      insecure: false
      ca_file: /etc/tls/ca.crt

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp]

Gateway Collector (Deployment Mode)

Gateway Collectors handle aggregation, tail sampling, and export. They run as a Deployment so you can scale replicas independently of node count. For tail sampling, use a StatefulSet with the loadbalancing exporter on agents to route by trace_id — this ensures all spans for a given trace land on the same gateway instance.

yaml
# gateway-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 512
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-policy
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }
  batch:
    timeout: 10s
    send_batch_size: 2048

exporters:
  otlphttp/tempo:
    endpoint: https://tempo.observability.svc.cluster.local:4318
  prometheusremotewrite/mimir:
    endpoint: https://mimir.observability.svc.cluster.local/api/v1/push
  otlphttp/loki:
    endpoint: https://loki.observability.svc.cluster.local:3100/otlp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlphttp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite/mimir]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/loki]
Tail sampling requires trace affinity

The tail_sampling processor needs all spans for a trace on the same Collector instance. Use the loadbalancing exporter on agents with routing_key: traceID so spans are consistently hashed to the same gateway. Without this, tail sampling decisions will be based on incomplete traces.

OTel Operator and Auto-Instrumentation

The OpenTelemetry Operator is a Kubernetes operator that manages Collector instances and injects auto-instrumentation into application pods. Instead of manually adding SDK dependencies and init code, you annotate your pods and the Operator handles injection via an init container.

yaml
# Instrumentation CRD — tells the Operator how to inject
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: otel-instrumentation
  namespace: my-app
spec:
  exporter:
    endpoint: http://otel-agent.observability.svc.cluster.local:4317
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "0.25"
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.4.0
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.46b0
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.51.0

Then annotate your application pods to trigger injection:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  template:
    metadata:
      annotations:
        # Pick ONE based on your runtime:
        instrumentation.opentelemetry.io/inject-java: "true"
        # instrumentation.opentelemetry.io/inject-python: "true"
        # instrumentation.opentelemetry.io/inject-nodejs: "true"
    spec:
      containers:
        - name: payment-service
          image: myregistry/payment-service:v2.1.0

Deploying with Helm

The official Helm chart supports both agent (DaemonSet) and gateway (Deployment) modes. You can deploy the full two-tier architecture with a single chart by overriding the right values:

yaml
# values-agent.yaml (for DaemonSet agents)
mode: daemonset
image:
  repository: otel/opentelemetry-collector-contrib
  tag: 0.102.0
resources:
  limits:
    cpu: 500m
    memory: 512Mi
ports:
  otlp:
    enabled: true
    hostPort: 4317
config:
  exporters:
    otlp:
      endpoint: otel-gateway.observability:4317
  service:
    pipelines:
      traces:
        exporters: [otlp]
      metrics:
        exporters: [otlp]
      logs:
        exporters: [otlp]
bash
# Install both tiers
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts

helm install otel-agent open-telemetry/opentelemetry-collector \
  -f values-agent.yaml -n observability

helm install otel-gateway open-telemetry/opentelemetry-collector \
  -f values-gateway.yaml -n observability

# Install the Operator for auto-instrumentation
helm install otel-operator open-telemetry/opentelemetry-operator \
  -n observability --set admissionWebhooks.certManager.enabled=true

Non-Kubernetes Deployments

Not every workload runs on Kubernetes. For VMs and bare-metal servers, the Collector runs as a systemd service. For containerized but non-orchestrated workloads, a Docker sidecar pattern works well.

bash
# /etc/systemd/system/otel-collector.service
[Unit]
Description=OpenTelemetry Collector
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=otel
Group=otel
ExecStart=/usr/local/bin/otelcol-contrib --config /etc/otel/config.yaml
Restart=always
RestartSec=5
MemoryMax=512M
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Install the binary, drop the config in /etc/otel/config.yaml, then systemctl enable --now otel-collector. The MemoryMax directive acts as a system-level safety net alongside the memory_limiter processor.

yaml
# docker-compose.yml
services:
  my-app:
    image: myregistry/my-app:latest
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
    depends_on:
      - otel-collector

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.102.0
    command: ["--config=/etc/otel/config.yaml"]
    volumes:
      - ./collector-config.yaml:/etc/otel/config.yaml:ro
    ports:
      - "4317:4317"
      - "4318:4318"
    deploy:
      resources:
        limits:
          memory: 512M

The sidecar pattern co-locates the Collector with your application. The app sends to otel-collector:4317 over Docker's internal network — no host port exposure needed for the data path.

Scaling Gateway Collectors

Horizontal scaling of gateway Collectors requires careful thought, especially when tail sampling is involved. The key constraint: all spans for a single trace must reach the same gateway instance, or the sampler makes decisions on incomplete data.

Load Balancing by Trace ID

Configure the loadbalancing exporter on your agent Collectors. It uses consistent hashing on traceID to route spans to specific gateway backends, discovered via DNS or Kubernetes services:

yaml
# Agent exporter config for trace-aware load balancing
exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        tls:
          insecure: false
          ca_file: /etc/tls/ca.crt
    resolver:
      dns:
        hostname: otel-gateway-headless.observability.svc.cluster.local
        port: 4317

Use a headless Service (no ClusterIP) for the gateway so DNS returns individual pod IPs. The loadbalancing exporter resolves them and hashes traceID to pick a target. When gateway pods scale up or down, the resolver detects the change and redistributes.

Kafka as a Buffer

At very high throughput (100k+ spans/second), place Kafka or Pulsar between agents and gateways. This decouples producers from consumers, absorbs traffic spikes, and provides replay capability if a gateway goes down.

yaml
# Agent side: export to Kafka
exporters:
  kafka:
    brokers: ["kafka-0:9092", "kafka-1:9092", "kafka-2:9092"]
    topic: otel-traces
    encoding: otlp_proto
    producer:
      max_message_bytes: 10000000
      compression: zstd

---
# Gateway side: consume from Kafka
receivers:
  kafka:
    brokers: ["kafka-0:9092", "kafka-1:9092", "kafka-2:9092"]
    topic: otel-traces
    encoding: otlp_proto
    group_id: otel-gateway
    initial_offset: latest

Backpressure Handling

The memory_limiter processor is your first line of defense against OOM kills. It should be the first processor in every pipeline. When memory exceeds the soft limit, it starts refusing data; when it drops below, it resumes. Combine this with retry policies on exporters:

yaml
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500       # Hard limit — start refusing data
    spike_limit_mib: 512  # Soft limit = limit_mib - spike_limit_mib

exporters:
  otlphttp/tempo:
    endpoint: https://tempo:4318
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
Processor ordering matters

Always place memory_limiter first in the processor list. If it comes after a processor that buffers data (like batch), memory can spike past the limit before the limiter gets a chance to act. The correct order is: memory_limiter → your processors → batch.

Security Configuration

Telemetry data often contains sensitive information — HTTP headers, database query parameters, user IDs. Every link in the telemetry pipeline needs encryption and authentication.

TLS and mTLS

Configure TLS on both the receiver (server) and exporter (client) side. For mTLS between services and the Collector, each party presents a certificate and verifies the other's:

yaml
# Receiver with mTLS (server side)
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /etc/tls/server.crt
          key_file: /etc/tls/server.key
          client_ca_file: /etc/tls/ca.crt  # Verify client certs

# Exporter with mTLS (client side)
exporters:
  otlp:
    endpoint: gateway.observability:4317
    tls:
      cert_file: /etc/tls/client.crt
      key_file: /etc/tls/client.key
      ca_file: /etc/tls/ca.crt

Authentication and RBAC

Use the bearertokenauth or oidcauth extensions to authenticate exporters against backends. For multi-tenant setups, the headerssetter extension can inject tenant-specific headers:

yaml
extensions:
  bearertokenauth:
    token: "${env:OTEL_AUTH_TOKEN}"
  basicauth/server:
    htpasswd:
      inline: |
        agent-user:$2y$10$hashed_password_here

# Protect the receiver with basic auth
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        auth:
          authenticator: basicauth/server

# Authenticate to the backend with bearer token
exporters:
  otlphttp/tempo:
    endpoint: https://tempo.example.com:4318
    auth:
      authenticator: bearertokenauth

service:
  extensions: [bearertokenauth, basicauth/server]

On the Kubernetes side, apply RBAC to the Collector's ServiceAccount. The agent needs read access to pod metadata (for k8sattributes), but nothing more:

yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-agent
rules:
  - apiGroups: [""]
    resources: ["pods", "namespaces", "nodes"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-agent
subjects:
  - kind: ServiceAccount
    name: otel-agent
    namespace: observability
roleRef:
  kind: ClusterRole
  name: otel-agent
  apiGroup: rbac.authorization.k8s.io

Migration Strategy: Running OTel Alongside Existing Tools

Most teams don't greenfield their observability stack. You already have Prometheus scraping metrics, Jaeger agents collecting traces, and Fluentd shipping logs. The migration to OTel should be incremental — not a flag day.

Phase 1: Dual-Write with OTel Collector

Deploy the OTel Collector as a sidecar or gateway that receives from your existing agents and from new OTel-instrumented services. Use the Collector's multi-exporter capability to write to both old and new backends simultaneously:

yaml
# Dual-write config: accept Jaeger + OTLP, export to both backends
receivers:
  jaeger:
    protocols:
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  prometheus:
    config:
      scrape_configs:
        - job_name: 'existing-services'
          kubernetes_sd_configs:
            - role: pod

exporters:
  # Old backends (keep during transition)
  jaeger:
    endpoint: jaeger-collector.tracing:14250
  prometheusremotewrite/old:
    endpoint: http://old-prometheus:9090/api/v1/write
  # New backends
  otlphttp/tempo:
    endpoint: http://tempo:4318
  prometheusremotewrite/mimir:
    endpoint: http://mimir:9009/api/v1/push

service:
  pipelines:
    traces:
      receivers: [jaeger, otlp]
      processors: [batch]
      exporters: [jaeger, otlphttp/tempo]  # dual-write
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch]
      exporters: [prometheusremotewrite/old, prometheusremotewrite/mimir]

Phased Rollout Plan

PhaseWhat ChangesDurationRollback Plan
1 — Collector deployDeploy OTel Collector alongside existing agents. No app changes.1–2 weeksRemove Collector — zero app impact
2 — Dual-writeRoute existing agent output through OTel Collector. Write to both old and new backends.2–4 weeksRevert agent configs to point directly at old backends
3 — Instrument new servicesNew services use OTel SDK / auto-instrumentation. Existing services unchanged.OngoingNew services fall back to legacy agents
4 — Migrate existing servicesReplace Jaeger/Prometheus client libs with OTel SDK, service by service.4–12 weeksPer-service rollback via feature flags or revert
5 — Decommission legacyRemove old agents, stop dual-writing, decommission old backends.1–2 weeksRe-enable dual-write if gaps are found
Validate with comparison dashboards

During phases 2–4, build dashboards that compare data from old and new backends side by side. Look for discrepancies in trace counts, metric values, and log volumes. Only move to the next phase when the numbers converge. This is your safety net — don't skip it.