Sampling is tracing's weakness
Distributed tracing is AWESOME! I saw it's power more than 6 years ago, and I've been regularly talking about it since. It's helped me troubleshoot many issues that would have been extremely difficult to with just metrics and logs.
However, it's not widely adopted, and I thought it was because it's difficult to instrument applications end to end. OpenTelemetry has done a decent job at making it easy, and I think it's easy-ish to instrument applications today.
The other big issue is that it's hard to get value from traces. For example, let's take the Mimir where a typical installation gets 10k
write requests for each read request. The write requests typically succeed in about 100ms
, while the read requests can run longer than 10s
depending on the queries users are running.
Given the sheer number of write requests, it doesn't make sense to do tail-sampling
, where every single request produces a trace, and then we throw away most of the data to keep a very tiny slice (discard 99.999%
of the writes and keep most of the reads, for example). This just produces too much data and wastes a lot of resources, and is hard to do reliably at scale.
The alternate is worse, where you do head sampling with a fixed sampling rate. But with the number of writes
to reads
ratio, you'll likely be only getting traces for write
requests that you don't care about. And at this scale, you can also only do a tiny sampling rate, like 0.001%
which adds almost no value.
Now, Mimir is an extreme case of the problem, but this problem exists in general for all applications. Some endpoints are more valuable for tracing than others, and you should be able to control the sampling rate on a per-endpoint basis.
Remote Sampling NEEDS LOVE
Jaeger introduced remote sampling to solve this problem, and it is now present in OpenTelemetry as well. It lets you pick and control the sampling rate on a per-endpoint basis, and even lets you update them at runtime. You can now set the read sampling rate to 25%
while dropping 99.99999%
of the writes. This is what we leverage at Grafana Labs, and it works quite well. And if there is a different endpoint you want to see more traces of when troubleshooting, you can change the config file and the updated rate is propagated in a minute.
Now you might be like, you wrote about this before, right? In the blog post you just referenced 3 times! Well, 4 times, but yes.
But I am writing about this again because I am helping setup observability for a Ruby on Rails application and I want to add OpenTelemetry tracing to it. But it has a ton of endpoints, some more frequently called than others, and some more important than others. And if we cannot set the per-endpoint sampling rate, the cost-value equation changes significantly.
To my dismay, opentelemetry-ruby doesn't support remote sampling ☹️ I couldn't Claude code my way to it, either. And it just hit me that not enough people understand the importance of remote-sampling and wanted to bang on this drum again!
The one redeeming factor in my case is that this is not a distributed system, but a monolith rails app. So we might be able to get away with tail-sampling by running a collector on each node. However, is it worth the cost of performance and resources? I don’t know yet.
Member discussion