PromCon 2025
This year’s PromCon, the conference focusing on Prometheus and its vast ecosystem, took place on October 21st and 22nd in Munich. The schedule promised various interesting talks about Prometheus as well as Alertmanager, Open Telemetry and other components. In this blog post, we will emphasize our personal highlights of the conference and give you a brief summary about the latest and greatest in the world of Prometheus.
Why I Recommend Native Prometheus Instrumentation over OpenTelemetry by Julius Volz
The first talk of the conference was one we were very excited for. Julius Volz, one of the co-founders of Prometheus, already published a blog post earlier this year, covering the same topic, and we highly recommend reading it. OpenTelemetry provides SDKs for many popular languages such as go, python, java etc. These SDKs can then be used to generate one or more type of signals (metrics, logs and traces). OpenTelemetry focuses on the generation/instrumentation as well as the transfer side, but does not itself provide any means of storing or acting upon (querying, alerting etc.) the generated signals. This is where Prometheus comes into play. We can use OpenTelemetry to generate metrics and then send these metrics to Prometheus for storage, queries and alerting. This combination of multiple technologies can however lead to some issues:
Target health monitoring
Prometheus provides many different ways to discover monitoring targets. If you are using Prometheus in a Kubernetes environment, you’ll most probably use the Kubernetes service discovery feature to get a list of monitoring targets. For more static environments, it might also be feasible to statically define the targets as part of your Prometheus configuration. No matter which service discovery mechanism you are using, Prometheus will leverage the configured mechanisms to build an inventory of targets. And use a pull based approach to scrape the metrics. If a target is unreachable, Prometheus will update the up metric of the affected target accordingly. In the world of OpenTelemetry the signals are usually first sent to a collector acting as an intermediary party. Which allows some degree of pre-processing (for example renaming or dropping metrics). The OpenTelemetry collector then pushes these metrics using Prometheus Remote Write or OTLP. In this scenario Prometheus simply acts as a metrics backend and query engine. There is no service discovery involved and as a result no up metrics are generated by Prometheus. Without precautions, this can easily lead to a situation where you lose a target without even knowing about it.
Metric naming
Historically, Prometheus metric names had to conform to the following regular expression: [a-zA-Z_:][a-zA-Z0-9_:]*. In contrast, OpenTelemetry commonly uses dots in metric names. Therefore, to ingest OpenTelemetry-generated metrics into Prometheus, the metric and label names had to be transformed to comply with Prometheus naming conventions. In addition, in Prometheus it’s best practice to add the unit of the measured quantity and/or the type as part of the metric name (e.g. http_request_duration_seconds or node_memory_usage_bytes). Prometheus handled this automatically in the transformation step by using the OpenTelemetry metrics metadata to derive the matching suffixes.
However, this automatic transformation came with its own drawbacks. Because Prometheus rewrote the original metric names. The names defined in the code no longer matched the names that appeared in Prometheus. This often caused confusion during debugging and during discussions between developers and operators. Everyone was looking at “the same metric”, but using different names.
Even with the more recent addition of UTF-8 support — which removes the need to rename OpenTelemetry metrics — there are still behavioral differences in how queries look depending on the naming style. Series with classic Prometheus-style naming still follow the familiar pattern:
my_metric{my_label="value"}
But a dotted metric coming from OpenTelemetry uses a slightly different syntax:
{"my.metric", "my.label"="value"}
Both forms represent the same data model, but the difference in notation affects how people write, share, and mentally parse queries. The resulting inconsistency is part of the broader challenge discussed in the talk: mixing instrumentation conventions across ecosystems can complicate day-to-day observability workflows, even when full compatibility exists on paper.
While in our point of view, the naming and query differences are the most visible friction points, they are not the only practical considerations. Depending on the workload, OpenTelemetry can be noticeably slower than the native Prometheus client. Even simple counter increments have been measured at 5–22× slower in some benchmarks. Resource attributes also require more upfront design: Prometheus target labels are few and stable, whereas OpenTelemetry resource attributes must be mapped into labels manually or exposed through separate _info metrics, which in turn forces joins in PromQL.
Operational overhead plays a role as well. Ingesting OpenTelemetry directly into Prometheus usually requires extra configuration (for example, enabling the OTLP receiver and tuning out-of-order ingestion windows), and these settings must be considered carefully to avoid performance or DoS risks.
None of this makes OpenTelemetry “wrong” — for many teams, its vendor-neutral model and unified telemetry story outweigh the downsides. But when Prometheus is the primary destination for metrics, native Prometheus instrumentation still offers the most straightforward and low-friction path.
Everything you need to know about OpenMetrics 2.0! by György Krajcsovits and Bartłomiej Płotka
The Prometheus exposition format’s signature feature—“what you see is what you query”—has made it wildly successful. Copy a metric name and labels from /metrics, paste into PromQL, and it just works. But five years after OpenMetrics 1.0 standardized this format, György Krajcsovits (Grafana) and Bartłomiej Płotka (Google) revealed how new features are straining the current design. OpenMetrics 2.0 aims to fix this while preserving what made the format great.
The problem: Prometheus’ stateless, line-based format is breaking down. Consider created timestamps—parsers must somehow correlate these separate metrics:
http_requests_total 124.0 http_requests_created 1761033600.123
Histograms are worse, requiring many separate series with “magic suffixes” to be matched together:
http_request_seconds_bucket{le="1.0"} 10
http_request_seconds_bucket{le="+Inf"} 20
http_request_seconds_count 20
http_request_seconds_sum 323.0
Meanwhile, strict naming conventions (like mandatory _total suffixes) clash with OpenTelemetry’s approach, forcing awkward translations that confuse users.
The solution: OpenMetrics 2.0 introduces complex types that collapse multiple series into one self-contained line:
http_request_seconds {count:20,sum:323.0,bucket:[1.0:10,+Inf:20]}
No more magic suffixes. No more special le labels. Created timestamps become inline tags:
http_requests_total 124.0 st@1761033600.123
Early benchmarks suggest promising performance gains: histogram parsing could be up to 10x faster, and the inline st@ notation may significantly reduce CPU and memory consumption. The spec also relaxes naming rules (_total becomes optional) and adds UTF-8 support for international metric names.
OpenMetrics 2.0 is still work in progress. The working group welcomes community feedback through their weekly calls to help refine the specification and balance backward compatibility with performance improvements.
SAAFE – A prioritized alerting model to troubleshoot your incidents by Jorge Creixell and Manoj Acharya
This talk was presented by Jorge Creixell and focused on the SAAFE framework. SAAFE stands for Saturation, Amend, Anomaly, Failure, and Error — five categories designed to capture everything that matters during the lifecycle of an incident, not just the things that should wake someone up at 3 AM. It proposes that modern observability is not just about detecting problems, but about preserving the context necessary to understand them quickly. In other words: page on actionable symptoms only, but provide all the surrounding information that helps engineers reason about what’s really happening.
The motivation behind SAAFE is straightforward: the ecosystem around Prometheus has gotten very good at alerting. Many organizations have alerting rules that capture everything unusual that might happen in their stack, which leads to the well-known problem of alert fatigue. When engineers receive too many alerts, especially for situations that don’t need immediate action, two bad outcomes follow: they start ignoring alerts, and they lose trust in the monitoring system. SAAFE flips the philosophy. Instead of disabling meaningful signals to reduce noise, it separates signals into two layers: actionable alerts and contextual assertions. Assertions are still emitted as alert primitives, but they are not meant to page. They are used to build a landscape that shows the context of a pageable alert.
What makes the model compelling is how it decomposes system events into the five dimensions of SAAFE:
- Saturation captures resource pressure — CPU, memory, storage, network I/O. High saturation doesn’t always break a system, but once a real incident begins, it’s often one of the first breadcrumbs an SRE looks for. Saturation can be both a cause and a consequence of failures, which is one of the reasons why these signals are so important to understand a failure fully.
- Amend represents changes such as deployments, configuration edits, scaling events, new nodes being added to a cluster, etc.
- Anomaly describes unusual patterns in metrics. Combining SAAFE with the PromQL Anomaly Detection framework.
- Failure captures binary states: something is either working or it isn’t. CrashLoopBackOff, insufficient replicas for a deployment, crashes of systemd services — this bucket describes events that represent real degradation. They’re not expressed as proportions or percentages: they’re yes/no conditions.
- Error describes quantitative degradation — failing SLOs, HTTP 5xx ratios, timeout percentages. Errors often trigger actual pages because they represent user impact.
Put together, these five views act like the dashboard of a car. A check-engine light means do something now; the tachometer, temperature gauge, and fuel indicator don’t page the driver, but trying to diagnose the problem without them would be unnecessarily complicated. SAAFE brings a similar philosophy to distributed systems by acknowledging that alerts are usually just the beginning, and more context is required to fully diagnose the underlying issue.
When a high-severity alert fires, responders can immediately jump to a global assertion timeline, showing all SAAFE events across the system. If latency just spiked on a frontend service, the next step often is not “check the frontend CPU”. With a SAAFE dashboard we can now also see what else changed around that moment. Was there a new deployment? Is one of the backend services failing or responding slowly? Was there a sudden jump in traffic? Did infrastructure scale? SAAFE places these events onto a shared timeline so responders can visually correlate cause and effect rather than reconstruct state from memory and intuition.
Reverse-Engineering PromQL Usage: A Proxy’s Tale by Nicolas Takashi
“Who is running this expensive PromQL expression and breaking my monitoring system?” Nicolas Takashi from Coralogix presented a practical solution: Prom Analytics Proxy, a lightweight proxy that sits between clients and Prometheus-compatible backends to track query usage patterns.
The problem is universal—we collect metrics but don’t observe how they’re used. Existing solutions have limitations: Prometheus query logs require configuration and centralized aggregation, Grafana inspector only shows individual panels, and distributed traces struggle with sampling and aggregate questions. Takashi and Michael Hoffmann built a proxy with zero backend changes that works across Thanos, Mimir, Cortex, and VictoriaMetrics.
The proxy captures queries, forwards them immediately without blocking, and asynchronously stores analytics (query text, step, type, label matchers, duration, samples) in PostgreSQL or SQLite. It provides three views: System Health (query distribution, latency, time ranges), Query Fingerprints (unique expressions sorted by execution count, duration, or errors), and Execution Details (individual query context for troubleshooting).
Integration with Metrics Usage adds a Metrics Catalog showing where each metric is used (dashboards, alerts, recording rules) and identifying unused metrics. This answers critical questions: “Can I drop this metric?” “Which dashboards break if I change this label?”
Running in production at 40-50k queries/second with <1GB memory per instance, the stateless design allows horizontal scaling. Future plans include query optimization recommendations and potentially dropping unused metrics at ingestion time. The project is available at github.com/nicolastakashi/prom-analytics-proxy.
Conclusion
PromCon 2025 was a great opportunity to learn many new things and catch up with the latest developments in the Prometheus ecosystem. There’s a lot happening around OpenMetrics 2.0 and OpenTelemetry—both trying to bring new features, improve performance, and add flexibility, all while staying true to the original Prometheus philosophy.
Beyond the talks and technical insights, it was also great to meet the community in person and, of course, enjoy some excellent beer in Munich. If you missed the conference or want to revisit the talks, recordings are available on YouTube (day 1 and day 2). A big thanks to the organizers and speakers for making this year’s PromCon both inspiring and enjoyable!