The whole time I was learning/porting to Otel I felt like I was back in the Java world again. Every time I stepped through the code it felt like EnterpriseFizzBuzz. No discoverability. At all. And their own jargon that looks like it was made by people high on something.
And in NodeJS, about four times the CPU usage of StatsD. We ended up doing our own aggregation to tamp this down and to reduce tag proliferation (StatsD is fine having multiple processes reporting the same tags, OTEL clobbers). At peak load we had 1 CPU running at 60-80% utilization. Until something changes we couldn’t vertically scale. Other factors on that project mean that’s now unlikely to happen but it grates.
OTEL is actively hostile to any language that uses one process per core. What a joke.
Just go with Prometheus. It’s not like there are other contenders out there.
I'm fairly convinced that OTEL is in a form of 'vendor capture', i.e. because the only way to get a standard was to compromise with various bigcorps and sloppy startups to glue-gun it all together.
I tried doing a simple otel setup in .NET and after a few hours of trying to grok the documentation of the vendor my org has chosen, hopped into a discord run by a colleague that has part of their business model around 'pay for the good otel on the OSS product' and immediately stated that whatever it cost, it was worth the money.
I'd rather build another reliable event/pubsub library without prior experience than try to implement OTEL.
This matches my conclusion as well. Just use Prometheus and whatever client library for your language of choice, it's 1000x simpler than the OTEL story.
You probably don´t understand what Otel is if you think that Prometheus is an alternative.
You'd do better to point out which distinction you think the parent poster is missing.
My guess is that Prometheus cannot do distributed tracing, while OpenTelemetry can. Is that what you meant?
Why Otel compared to prometheus+syslog+(favorite way to do request tagging, eg: MDC in slf4j)+grep?
Syslog is kinda a pain, but it's an hour of work and log aggregation is set up. Is the difference the pain of doing simple things with elastic compute and kubernetes?
Prometheus is good, but let's be clear...you don't get tracing.
For tracing FOSS: Grafana Tempo.
Tempo's a backend/sink for traces, but if you click through to the Tempo docs and find out how to generate tracing data[1], you learn that you have two options: OpenTelemetry, which they recommend, and Zipkin, which they do not recommend.
[1] https://grafana.com/docs/tempo/latest/getting-started/instru...
Tempo is a traces server. Prometheus is a metrics server.
Grafana, the same company that develops and sells Tempo created a horizontally scalable version of Prometheus called Mimir.
OpenTelemetry is an ecosystem, not just 1 app. It’s protocols, libraries, specs, a Collector (which acts as a clearinghouse for metrics+traces+logs data). It’s bigger than just Tempo. The intention of Patel seems to be to decouple the protocol from the app by having adapters for all of the pieces.
Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.
How would you build the "holy grail" map that shows a trace of every sub component in a transaction broken down by start/stop time etc... for instance show the load balancer see a request, the request get handled by middlewares etc, then go onto some kind of handler/controller, the sub-queries inside of that like database calls or cache calls. I don't think that is possible with prometheus?
> Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.
Correct. Prometheus is just metrics.
The main argument for oTel is that instead of one proprietary vendor SDK or importing prometheus and jaeger and whatever you want to use for logging, just import oTel and all that will be done with a common / open data format.
I still believe in that dream but it's clear that the whole project needs some time/resources to mature a bit more.
If anybody remembers the Terraform/ToFu drama, it's been really wild to see how much support everybody pledged for ToFu but all the traditional observability providers have just kinda tolerated oTel :/
Code traces are metrics. Run times per function calls metrics, count of specific function call metrics.
Otel is an attempt to package such arithmetic.
Web apps have added so many layers of syntax sugar and semantic wank, we’ve lost sight its all just the same old math operations relative to different math objects. Sets are not triangles but both are tested, quantified, and compared with the same old mathematical ops we learn by middle school.
OpenTelemetry's traces are trees of spans. You cannot represent this efficiently without a combinatorial explosion of labels.
You may be thinking of metrics in the sense of counters and gauges, but that's not the data model that OpenTelemetry (and before they, Zipkin, Jaeger, and OpenCensus) uses for traces.
The data model for tracing is to emit events that provide a span ID and an optional parent span ID. The event collector can piece these together into a tree after the fact, which will work as long as the parent structure is maintained.
Prometheus is absolutely not suitable for this.
Quibbling about the word "telemetry" doesn't really help here. OpenTelemetry supports three different, completely different subsets of functionality: Metrics (counters, gauges, histograms), traces (span events in a tree structure), and logging (structured log events). They each have completely different client interfaces.
No, code traces are not just metrics; and while you can knit together something approximating traces from metrics, you'll quickly run into the reason why traces are a distinct thing. First, in a distributed system, you'll discover that you can't rely on clocks to get the timing of subsecond events correct. Second, you'll be contextless about code paths. So, you might independantly reinvent the idea of passing along a context - and now you're just making your own tracing system but without any of the benefit of building on years of existing discoveries in this field.
OTel does feel a little bit heavy, unless you're already used to e.g. New Relic, Dynatrace, etc. where you have to run an agent process and instrumentize your code to some extent; it's never going to be free to audit every function call! This is why (a) you sample down and don't keep every trace, and (b) unless your company is extremely flush with cash you probably don't run tracing in every environment. If you can get away with it just in a staging or perf test env you can reap most of the benefit without the production impact and cost.
All those things you describe are computable metrics. They have to be or Otel itself would not be able to compute them for consumption. All you described are cherry picked semantic indirections to obfuscate it’s all just a computer computing metrics of its own memory states.
Sorry for knowing how computers actually work (EE grad not a CS grad). I know that can frustrate CS grads who think their preferred OS and favorite programming language is how a computer works. You’re describing how contemporary SWEs view their day job.
Edit: teleMETRY …what’s in a name? Oh right …meaning.
To be a smart-ass, one has to be smart first. Quit this.
As a no grad to EE grad: traces mean a bundle of metrics that varies in structure hence you can't store and process them as effective as a list of counters unless you have a distinct bin for each possible trace, combinatorial explosion y'know.
You know the conversation is going well for you when you resort to citing the "meaning" of a name instead of, you know, base reality. Who needs the territory, I've got my map right here.
Speaking of meaning, the best I can make of your point is that you're using a much broader definition of "metrics" than the rest of this conversation, and in particular broader than Prometheus (remember context? very important for "meaning"!) supports. That or you really just don't know what a "trace" is (in this context).
huh? I've always heard and read and experienced that "logs, traces, metrics" are the 3 legs of the observability stool.
Open teleMETRY
Any guesses as to etymology?
By this logic, you can say that logging, metrics and tracing are all fundamentally just different kinds of data and we should be calling it just plain databases and CRUD.
They're related, but people have a very specific idea and concept of what each is, you haven't actually provided a good argument why we should throw out these distinctions just because they somewhat resemble each other if you ignore a few details
Yeah part of the problem is it’s called Opentelemetry and half of you are only talking about tracing, not metrics. Telemetry is metrics. It’s been metrics since at least the Mercury Program.
Metrics in OTEL is about three years old and it’s garbage for something that’s been in development for three years.
Simpler near-term, but more painful long term when you want to switch vendors/stacks.
And switching log implementations can be a pain in the butt. Ask me how I know.
But I’d rather do that three more times before I want to see OpenTelemetry again.
Also Prometheus is getting OTEL interop.
Nine times out of ten, I've got more valuable problems to solve than a theoretical future change of our vendor/stack for telemetry. I'll gladly borrow from my future self's time if it means I can focus on something more important right now.
I did our migration from StatsD to OTEL because our third party StatsD service was getting flaky. The first person from OPs to get to me pushed OTEL. The rest were fine with Prometheus and it was late in the process before they realized what had happened. I believe if we had gone straight to Prometheus I would have been done in half the time and solved half the problems I had to solve anyway for OTEL. If someone had to replace it again in the future I fully believe it would have taken cumulatively as much time to go StatsD->Prometheus->OTEL as it took to go StatsD->OTEL, especially when you consider that OTEL is not quite baked.
Meanwhile functionality to retain and recruit new customers sat in the backlog.
Edit to add: also regarding the perf issues I saw: do you really want to pay for an extra server or half a server in your cluster just in case some day comes? These decisions were much fuzzier when you ordered hardware once every two years and just had to live with the capacity you got.
Is this the same scam as "standard SQL"? Switching database products is never straightforward in practice, despite any marketing copy or wishful thinking.
Prometheus ecosystem is very interoperable, by the way.
> It’s not like there are other contenders out there.
Apache Skywalking might be worth a look in some circumstances, doesn't eat too many resources, is fairly straightforwards to setup and run, admittedly somewhat jank (not the most polished UI or docs), but works okay: https://skywalking.apache.org/
Also I quite liked that a minimal setup is indeed pretty minimal: a web UI, a server instance and a DB that you already know https://skywalking.apache.org/docs/main/latest/en/setup/back...
In some ways, it's a lot like Zabbix in the monitoring space - neither will necessarily impress anyone, but both have a nice amount of utility.
Using otel from C++ side... To have cumulative metrics from multiple applications (e.g. not "statds/delta") I create a relatively low cardinality process.vpid integer (and somehow coordinate this number to be unique as long as the app emitting it is stil alive) - you can use some global object to coordinate it.
Then you can have something that sums, and removes the attribute.
With statsd/delta if you lose sending a signal - then all data gets skewed, with cumulation - you only use precision.
edit... forgot to say - my use case is "push based" metrics as these are coming from "batch" tools, not long running processes that can be scraped.
Same. I implemented Otel once and exactly once. I wouldn't wish it on any company.
Otel is a design by committee garbage pile of half baked ideas.
This matches my experience. Very difficult to understand what I needed to get the effect I wanted.
I wonder what your experience is with Sentry? Not just for error reporting but especially also their support for traces.
Also open-source & self-hostable.
Likely only a handful of people care, but Sentry hasn't been open source in quite a while https://github.com/getsentry/sentry/blob/24.12.1/LICENSE.md (I'd have to do tag-spelunking to find the last Apache 2 version)
Glitchtip is the Sentry compatible open source (MIT) one https://gitlab.com/glitchtip/glitchtip-backend/-/blob/v4.2.2... with the extra advantage that it doesn't require like 12 containers to deploy (e.g. https://github.com/getsentry/self-hosted/blob/24.12.1/docker... )
Sentry is not horizontally scalable, thus ~ not-scalable at all, if your company is big.
Quota/pricing.
There are a lot of Java programmers working on it.
(And some Go tbf.)
Yeah and a blind man can see this, it’s so loud.