← Back to News
Technology

Observability signal types

Making sense of OpenTelemetry logs, events, metrics and traces

We will start with a short introduction into observability, go through the signal types

add some notes about how they fit together and end with a conclusion and suggestions for further reading.

Introduction into OpenTelemetry (OTel) signal types

Observability should make it easy to observe what is going on in a “system” - possibly distributed over multiple services, machines, operators and so on. A “system” can be anything from a single binary to a complex distributed microservice architecture including multiple software vendors and infrastructure providers on multiple machines.

Good observability should allow you to get concrete information about what a system does, how it performs and - if the unspeakable happens - what went wrong.

Debugging should be made easier through good observability instrumentation.

After decades of approaches tried, the industry now has a robust (de facto) standard for

  • a data model for the (signal) types of information and
  • a protocol (OTLP) to transfer that information between source systems and observability software like MLAP.

Anyone who has ever tried to put together data from multiple systems will notice how valuable that is.

No more regex field parsing.
No more metric formats hard-coded into an application.
No more lock-in to a vendor.

Signal types

Let us dive into the different signal types or: What kinds of observability data is there anyway?

  • Logs as in “print this text telling that this happened right now”
  • Events as in “this well-known named thing happened right now”
  • Metrics as in “measure and count everything”
  • Traces as in “what happened when for how long for this one request”
  • Profiles as in “where did the processing time go on this system - statistically speaking” are still in experimental phases

Logs

TLDR: OTel logs are backwards-compatible to how logs have been forever: Simple timestamped text - but now with a clear path to a more structured representation.

For software developers, logs are the most natural way to let their programs communicate about their internal state. Basically every software runtime has a way to print some text - Hello World being the canonical example of “look - the program is running”.

What is in a log entry? The classics are:

  • Message or body are the names used for the actual text.
  • Timestamps of the wall clock time and date bring an order to logs: What happened after what and when?
  • Severity is often used to indicate how important a message is or who should look at it (developers, operators, users).

After that, each layer of abstraction tends to add their own fields:

  • Application name or Facility to distinguish between different applications on the same machine
  • Host name to distinguish between different machines where logs originated

In large parts of the IT industry, the Syslog protocol (e.g. RFC 5424) is used to communicate logs. In reality, adherence to the protocol varies so wildly that expecting specific fields - let alone the use of structured data - is often not realistic.

One exception to this is the Windows world where the Event Log serves the same purpose and is well structured. Think XML document which already contains fields like the program’s process ID, an event ID from a catalog and so on.

Yet, with systemd’s journald and applications starting to log in JSON formats for example, the direction for the non-Windows parts of the industry is clear, too: Structured is better than unstructured.

Even better when the structure is standardized and semantic meaning of fields is clear.

This is what OpenTelemetry logs are all about: Same use case (basically short text messages) but with a clear structure and semantic meaning of fields.

In OpenTelemetry, logs can have fields for non-standard, application-specific attributes. For example, logging an authorization event might have attributes for the user id, the resource id, and the result of the action. This can make searching for specific events like “show me all authorization events for user X that failed” very easy and efficient.

Strengths:

  • Simple
  • Available everywhere
  • Flexible - it’s just text

Weaknesses:

  • Information contained in text is not structured for machines but for humans
  • Extracting information from text is not fun (think regular expressions)

When to use:

  • Information for humans to reason about what happened
  • Unexpected events that have no better way to be communicated to the user or operator

Caveats:

  • Keep your clocks synced to make the order of events useful between different systems
  • Prefer structured output to information buried in plain text

Example data in an OpenTelemetry log:

timestamp: "2026-03-27T15:50:36Z"
severity_text: "WARN"
body: "Authentication failed after 3 auth services returned DENY"
resource:
  service.name: "authentication-service"
  host.name: "backend-fra3-001"
attributes:
  user.id: "alice"
  auth.result: "denied"
  auth.reason: "unknown_username"

(Note: The payload is abbreviated for readability.)

Events

TLDR: OTel events are logs with an event name and (typically) without body/message text.

Events in the OpenTelemetry sense are basically logs without the unstructured part. They still document a named event at a point in time. The event name should identify the type of event and should not change over time. It is not intended to contain variable information in the text, for example.

For a given event name, the attributes should also be consistent. This is often natural since the attribute values come from the state of the code that logs the event in the first place.

Strengths:

  • No message/body means less ambiguity
  • Still flexible using attributes

Weaknesses:

  • Not really backwards compatible with how people used logs in the past
  • Mostly an OTel thing even though Windows’ Event ID is similar in spirit

When to use:

  • Communicate that a specific, well-defined event happened
  • Green field projects where you can avoid dynamic messages/bodies from the start

Example data in an OpenTelemetry event:

timestamp: "2026-03-27T15:50:36Z"
event_name: "authentication.failure"
resource:
  service.name: "authentication-service"
  host.name: "backend-fra3-001"
attributes:
  user.id: "alice"
  auth.result: "denied"
  auth.reason: "unknown_username"

(Note: The payload is abbreviated for readability.)

Metrics

TLDR: A metric is numbers that change over time associated with a name and optional attributes. Each name/attribute combination forms a timeseries.

Whenever you have a number and might want to see it later and especially how it changed over time - you want a metric. Examples are numbers of requests, temperatures, memory usage, duration a specific thing took.

Typically, a metric is what we call something with a name - regardless of the attributes. For example, “number of incoming requests” is a metric.

A timeseries in comparison is a specific combination of a name and a set of key/value attributes. So there might be a timeseries “number of incoming requests” for each host, each service or even each user (but see below). Each number you put into a timeseries along with a timestamp (when was it measured) is a datapoint.

The number of different timeseries or attribute value combinations is called cardinality. (Too) high cardinality is the prototypical way people find out how many timeseries their database or storage can handle before running out of RAM or CPU. Millions of timeseries are typically fine even on small systems but add several zeroes and query and ingestion get slower and the database starts using more RAM.

Cardinality is determined by both the number of attribute names (keys) but also by the number of attribute values. And they multiply of course. So while it might seem like a good idea to count (say) requests per browser session ID, a trivial load test might generate millions of those in short time.

Strengths:

  • Lightweight to store lots of datapoints - no problem storing a value every few seconds
  • The only way to watch systems over time at scale

Weaknesses:

  • Avoid extremely large cardinalities.
  • No way to attach a message to an individual datapoint
  • No information about why something changed

When to use:

  • Monitoring system health
  • Alerting using thresholds
  • As soon as you want to see anything changing over time
  • Definitely when you feel the urge to put a number into a log message (don’t do that!)

Example data illustrating what’s in an OpenTelemetry metric data submission:

dataPoints:
- timeUnixNano: "1711558236000000000"
  metric_name: "authentication.requests"
  asInt: "124"
  attributes:
  - auth.result: "denied"
  - auth.reason: "unknown_username"
  # Notice we omit `user.id` here to avoid high cardinality!

(Note: The payload is abbreviated for readability.)

Traces

TLDR: A trace answers the question “Where was the time spent for this request?” by gathering “spans” for each part (load balancer, application code, database request etc.) processing the request.

To be able to answer that question, two things must come together:

  1. Each component that the request passes through must report a span to a central component along with some common ID which is the same for all spans of a trace. In OTel, this is the trace ID. MLAP is such a central component that accepts, processes and stores OTel traces in scalable storage.
  2. A way to search for traces and view the details of them. In MLAP, this is possible using both Jaeger as well as Grafana’s Tempo.

Traces are especially useful in the context of distributed systems, i.e. systems made of multiple components where observing a request pass through manually (e.g. by attaching a debugger) is not practical.

If each component must report something somewhere - how does that work?

The somewhere part usually means configuring an endpoint where each component sends its span data as OTel telemetry data.

Also there must be a way to pass the trace ID along with the request: OTel context propagation. One common example is passing trace IDs in the traceparent HTTP headers (or as baggage in other protocols like gRPC). Another is by adding trace IDs to messages put into a queue (e.g. Apache Kafka) if the overall process of handling a request touches such a queue.

To make all components work together well they must agree on all these details. Software that a request passes through needs to be aware of the trace ID and pass it along in subrequests and also emit spans for itself. Adapting software to do this is called instrumentation - which can be a manual software development process (typically using SDKs/libraries) or done automatically (“auto-instrumentation”) using special agents or tools.

For a long time there were both commercial (APM) and open-source solutions for both the instrumentation and protocols. Since a request is often served by multiple technologies (programming languages, infrastructure components) in combination, each tracing solution needed to support all of those - which led to lots of duplicated effort over the years.

OTel is a breath of fresh air here:

  • It standardizes on a single protocol (OTLP),
  • comes with a growing collection of auto-instrumentation solutions and SDKs for most common programming languages and frameworks and also
  • makes available basic tools for collecting and processing telemetry data (OTel Collector).

It allows solutions like our MLAP to offer traces without requiring MLAP-specific instrumentation for all the connected services - OTel instrumentation is all we need.

Strengths:

  • Exposes latencies, retries of failures per component instantly
  • Visual representation of a request’s flow through a distributed system is very powerful for understanding dependencies and where time is spent
  • Spans can reveal useful information like actual database queries

Weaknesses:

  • Instrumentation is needed in every component in the trace
  • Traces are chatty - they produce a lot of data so sampling is often used at scale
  • Not very useful for long term trend analysis - look into span_metrics for this

Example data illustrating what’s in an OpenTelemetry trace data submission:

resourceSpans:
  - resource:
      attributes:
        - key: service.name
          value: website-frontend
        - key: ci.project.path
          value: iplus1/website-prod
        - key: ci.commit.sha
          value: 2e596ee2b49d4e354559166ab86f5594c2442320
    scopeSpans:
      - scope: '@opentelemetry/instrumentation-document-load'
        spans:
          - traceId: ee5d91486ebc8d730e564263296c71af
            spanId: cd91d4efea768a52
            traceState: ot=rv:ea482e83a2056d;th:b334
            parentSpanId: d489e51b76759d2c
            flags: 257
            name: documentFetch
            kind: 1
            startTimeUnixNano: '1778861202849000000'
            endTimeUnixNano: '1778861202887000000'
            attributes:
              - key: http.url
                value: https://iplus1.de/
              - key: http.response_content_length_uncompressed
                value: 82554
...
          - traceId: ee5d91486ebc8d730e564263296c71af
            spanId: d489e51b76759d2c
            traceState: ot=rv:ea482e83a2056d;th:b334
            flags: 257
            name: documentLoad
            kind: 1
            startTimeUnixNano: '1778861202848000000'
            endTimeUnixNano: '1778861202982000000'
            attributes:
              - key: http.url
                value: https://iplus1.de/
              - key: http.user_agent
                value: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:150.0) Gecko/20100101 Firefox/150.0
            events:
              - timeUnixNano: '1778861202848000000'
                name: fetchStart
...
              - timeUnixNano: '1778861202931000000'
                name: firstContentfulPaint
            status: {}

(Note: The payload is slightly abbreviated for readability.)

Correlation: Bringing It All Together

Diagnosing system behavior often requires looking at multiple signals together.

One scenario could look like this:

  • A customer reports that “something is slow”
  • Event (or log) data from a login shows the rough time the user was active - maybe even with the requesting IP address
  • Metrics can be queried for that user or IP address or at least that timeframe for unusual delay
  • Traces containing spans related to the metrics can be found by unusual timing or even by user ID
  • Inspecting specific traces shows what was actually happening for a request
  • Logs related to that trace or the span’s component can be used to drill down into the root cause
  • Once the root cause is known, monitoring can be improved to get alerted before customers notice

This is a typical workflow for customers using MLAP: Starting with logs, adding application metrics along the way, then adding traces to the mix. Having all the tools available and ready-to-use makes this a smooth process.

Conclusion

Logs/Events, Metrics and Traces each have their own purpose in understanding your systems. Metrics provide the long-term and current-state overview. Traces reveal the exact path a request took and where bottlenecks are. Logs and events allow you to drill down into details of system behavior.

OTel makes this all come together nicely.

A good observability strategy combines all this using OTLP as a protocol and OTel tooling, avoiding vendor lock-in while giving you a comprehensive view of your distributed system. MLAP is a complete solution for ingesting, processing, storing and looking into all this data in a unified way using nothing but open-source software.

Suggested reading: