Tracing: structured logging, but better in every way

@ttmrichter@lemmy.world

My own thoughts.

Instead of defining the difference between logging and tracing, the author spams the screen with pages’ worth of examples of why logging is bad, then jumps into tracing by immediately referencing code that uses a specific tracing library (OpenTelemetry Tracer) without at any point explaining what that code is actually doing to someone who is not familiar with it already. To me this smacks of preaching to the choir since if you’re already familiar with this tool, you’re likely already a) familiar with what “tracing” is compared to “logging”, and b) probably a tracing advocate to begin with. If you want to persuade an undecided or unfamiliar audience, confusing them and/or making assumptions about what they know or don’t know is … suboptimal.
If you’re going to screen dump your code in your rant, FUCKING COMMENT IT YOU GIT! I don’t want to have to read through 100 lines of code in an unfamiliar language written to an unfamiliar architecture to find the three (!) lines that are actually on the fucking topic!
If you’re going to show changes in your code, put before/after snapshots side by side so I don’t have to go scrolling back to the uncommented hundred-line blob to see what changed. It’s not that hard. Using his own damned example from “Step 1”:

// BEFORE
func PrepareContainer(ctx context.Context, container ContainerContext, locales []string, dryRun bool, allLocalesRequired bool) (*StatusResult, error) {
	logger.Info(`Filling home page template`)

// AFTER
var tr = otel.Tracer("container_api")

func PrepareContainer(ctx context.Context, container ContainerContext, locales []string, dryRun bool, allLocalesRequired bool) (*StatusResult, error) {
	ctx, span := tr.Start(ctx, "prepare_container")
	defer span.End()

(And while you’re at it, how 'bout explaining the fucking code you wrote? How hard is it to add a line explaining what that defer span.End() nonsense is? Remember, you’re trying to sell people on the need for tracing. If they already know what you’re talking about you’re preaching to the choir, son.)

Of course in “The Result” he talks about the diff between the two functions … but doesn’t actually provide that diff. Instead he provides another hundred-line blob kept far away from the original so you have to bounce back and forth between them to spot the differences. Side-by-side diffs are a thing and there’s plenty of tools that make supplying them trivial. Maybe the author should think about using them.

The technique this guy is espousing, if I’m reading it right, sounds fine but only in limited realms. This would kill development in my realm (small embedded systems), for example. If you have (effectively, from my domain’s perspective) infinite RAM, CPU, persistent storage, and bandwidth, then yes, this is likely a very good technique. (I can’t be certain, of course, because he hasn’t actually explained anything, just blasted uncommented code while referencing a library he assumes we know about. The only reason I followed any of it is because I’m familiar with Erlang’s tooling for this kind of stuff which puts what he’s showing off to shame.) But if your RAM is limited (hint: measured in 2-digit KB and shared by your stack(s), heap, and static memory), if your CPU is a blazing-fast 80MHz, and if you think 1MB of persistent storage (which your program binary has to share) is a true bucket of gold in wealth, and, yes, if you’re transmitting over a communications link that would have '80s-era modem jockies looking on you with pity, then maybe, just maybe, tracing isn’t so great an idea after all.

bahmanm

I got to admit that your point about the presentation skills of the author are all correct! Perhaps the reason that I was able to relate to the material and ignore those flaws was that it’s a topic that I’ve been actively struggling w/ in the past few years 😅

That said, I’m still happy that this wasn’t a YouTube video or we’d be having this conversation in the comments section (if ever!) 😂

To your point and @krnpnk@feddit.de’s RE embedded systems:

That’s absolutely true that such a mindset is probably not going to work in an embedded environment. The author, w/o explicitly mentioning it anywhere, is explicitly talking about distributed systems where you’ve got plenty of resources, stable network connectivity and a log/trace ingestion solution (like Sumo or Datadog) alongside your setup.

That’s indeed an expensive setup, esp for embedded software.

The narrow scope and the stylistic problem aside, I believe the author’s view is correct, if a bit radical.
One of major pain points of troubleshooting distributed systems is sifting through the logs produced by different services and teams w/ different takes of what are the important bits of information in a log message.

It get extremely hairy when you’ve got a non-linear lifeline for a request (ie branches of execution.) And even worse when you need to keep your logs free of any type of information which could potentially identify a customer.

The article and the conversation here got me thinking that may be a combo of tracing and structured logging can help simplify investigations.

@krnpnk@feddit.de

Some thoughts from my side (coming from another domain - more embedded):

Whether you use a message string or a named bool does not change anything. It’s still logging.
It’s of course nice to just trace everything and filter / search afterwards, but in embedded for example your machine may just crash if you try that. For that log levels are the traditional way to filter before a log is written.
I don’t get how timestamping / ordering is necessarily worse for logging. Maybe it’s just the framework that is used?
You sure can have hierarchical information in log frameworks

In my opinion log levels sure make sense, but it may vary wildly depending on what you’re doing. We run our software in different environments:

Development machines / VMs
Development boards
Production ECUs

And it’s run by different sets of people:

Devs
Integrators
Customers
…

Depending on the combination of where / who you get different requirements.

I get that Logging is hard and often you get messages with a wrong log level or you’re missing a message at a crucial point etc. But tracing is not better in every way - they should complement each other.

bahmanm

Thanks for sharing your insights.

Thinking out loud here…

In my experience with traditional logging and distributed systems, timestamps and request IDs do store the information required to partially reconstruct a timeline:

In the case of a linear (single branch) timeline you can always “query” by a request ID and order by the timestamps and that’s pretty much what tracing will do too.
Things, however, get complicated when you’ve a timeline w/ multiple branches.
For example, consider the following relatively simple diagram.
Reconstructing the causality and join/fork relations between the executions nodes is almost impossible using traditional logs whereas a tracing solution will turn this into a nice visual w/ all the spans and sub-spans.

That said, logs do shine when things go wrong; when you start your investigation by using a stacktrace in the logs as a clue. That (stacktrace) is something that I’m not sure a tracing solution will be able to provide.

they should complement each other

Yes! You nailed it 💯

Logs are indispensable for troubleshooting (and potentially nothing else) while tracers are great for, well, tracing the data/request throughout the system and analyse the mutations.

bahmanm

I’m not sure how this got cross-posted! I most certainly didn’t do it 🤷‍♂️

Tracing: structured logging, but better in every way

Tracing: structured logging, but better in every way

General Programming Discussion

Rules:

Other communities

Systems

Functional Programming

Also related