the case of a leaky goroutine

This commit is contained in:
Wouter Groeneveld 2024-03-20 21:52:47 +01:00
parent 43cd695c2f
commit 51c401246d
2 changed files with 76 additions and 0 deletions

View File

@ -0,0 +1,76 @@
---
title: "The Case Of A Leaky Goroutine"
date: 2024-03-24T09:00:00+01:00
categories:
- programming
tags:
- go
---
In the programming language Go, it's _very_ easy to program something using high-level concurrent patterns thanks to the concept of _Goroutines_ and channels used to signal between them. A Goroutine is essentially a coroutine that maps onto green threads that map onto real native threads on your OS in an `NxM` way. The simple `go func()` prepend-style syntax makes fire-and-forget Goroutines for executing small tasks in parallel trivial.
Or is it? If we are to believe Katherine Cox-Buday, the author of O'Reilly's [Concurrency In Go](https://katherine.cox-buday.com/concurrency-in-go/), it's not:
> Concurrency can be notoriously difficult to get right, but fortunately, the Go programming language was designed with concurrency in mind. In this practical book, youll learn how Go was written to help introduce and master these concepts, as well as how to use basic concurrency patterns to form large systems that are reliable and remain simple and easy to understand.
That sounds rather optimistic, but the countless of _memory leaks in Go and how to avoid them_ articles and [leak detection packages](https://github.com/uber-go/goleak) tell us otherwise. The most common misuses or cases of "leaky" Goroutines---routines that live on forever even though we think they're garbage collected---are neatly laid out by Uber's Georgian-Vlad Saioc in their [LeakProf Goroutine Leak Detection](https://www.uber.com/en-BE/blog/leakprof-featherlight-in-production-goroutine-leak-detection/) system.
We stumbled upon a leaky gut---erm, code gut?---two weeks ago when an Out Of Memory suddenly restarted Kubernetes pods halfway through workflow runs that of course are not quite idempotent. Not knowing where to begin, we fired up Go's profiler `pprof` and got to work. After a day of poking around, we found our own version of a never-ending Goroutine factory. This post summarizes our findings in case they might come in handy for others or my future self.
## Identifying the problem (Profiling)
Grafana's dashboard can monitor Goroutine memory usage and seeing it spike without going down is an obvious red flag, but doesn't give you details just yet. For that, you can stay with the Grafana stack using [Pyroscope](https://grafana.com/docs/pyroscope/latest/) that charts out interactive memory flame graphs based on `pprof` dumps it pulls from your container (provided the whole setup shebang is done right):
![](../pyroscope.jpg "A pyroscope goroutine zoomed in view.")
The chart tells us that `runtime.gopark` is holding onto Goroutines coming from pipeline `func`s we didn't even know existed. Lo and behold, these convert contexts into channels using generic `interface{}`s as part of the pipeline by creating a Goroutine and waiting for the channel to be done---except that it'll never be, since the context that's passed in isn't a derived one like `.WithCancel()`. In other words, the context will cancel if the whole root request ends---which can be never for a background job with a background context. Whoops. We'll get back to that.
You can also run Pyroscope locally using Docker, by the way: `docker run -it -p 4040:4040 grafana/pyroscope`.
If you don't care about Grafana, no worries, `pprof` comes with a HTTP server or attaches itself to yours once you import `import _ "net/http/pprof"`. From now on, `/debug/pprof/` is an endpoint where heap/CPU/whatever can be dumped from using for instance CURL. See the [Official pprof docs](https://go-language.org/go-docs/runtime/pprof/) and the [Go dev blog entry on pprof](https://go.dev/blog/pprof) for more information.
Once you managed to get your profile dump, you can analyze it with `go tool pprof [profile_file]`. If you've installed `graphviz`, it'll generate visual representation of your snapshot, as seen in the aforementioned Go dev blog entry[^gotorch]. The most interesting view is of course a diff between a baseline and one after lots of leaky Goroutine work---use the `diff_base` flag for that (see the [Go dev blot entry on pgo](https://go.dev/blog/pgo)). Profile percentages are relative to the first dump.
[^gotorch]: The `pprof` tool recently gained flame graph mode, making Uber's `go-torch` redundant.
Let's get back to that context that's never truly cancelled. This piece of code is the perpetrator:
enkel een context. In plaats van verder te werken op die context door middel van context.WithCancel(...) gebruikten wij de volgende func om die omvorming te doen:
```go
func ToDoneInterface(done <-chan struct{}) <-chan interface{} {
interfaceStream := make(chan interface{})
go func() {
defer close(interfaceStream)
select {
case <-done:
return
}
}()
return interfaceStream
}
```
The `defer close()` seems to close well, but it's on the wrong channel. The `select{}` will wait until it received a signal from `done`: that's either when something is sent or when a `close()` has been called (a `nil` value). So, the Goroutine closes the channel **after** the first received `struct{}` on the passed `done` channel. If we pass the same channel multiple times---which we do---and that channel lives longer than is the case wich `ctx.Done()`---which it is---this Goroutine will leak.
I don't know if that all makes sense if you're not familiar with Go, or even if you are. I know I had to stare at the above code block for a good hour and its usage context (got it, `Context`? Go-joke!) before realizing something wasn't as it was supposed to be here.
## Reproducing the problem (Fixing)
There's a neat way to detect memory leaks in tests using the package [goleak](https://github.com/uber-go/goleak):
```go
func TestA(t *testing.T) {
defer goleak.VerifyNone(t)
// test logic here.
}
```
It works by looking at what's still on the stack after everything should be garbage-collected. We then cooked up a script that spins up a consume/produce cycle using `context.Background()` as the root context without cancellation and then with it. The Pyroscope Go API can act as a helpful shortcut to auto-feed profiles straight from your program.
The most systematic way to detect leaky Goroutines early must be Uber's [LeakProf](https://www.uber.com/en-BE/blog/leakprof-featherlight-in-production-goroutine-leak-detection/), a separately deployed system that regularly pulls in `pprof` dumps, enriches it with stack data, closely monitors memory usage, and even automatically files a bug report in case the shit hits the fan. I don't think we're there yet!
Conclusion: the problem is often hidden in a small corner... Don't convert channels! Stick with Go's built-in context pattern and derive from the one passed in if needed!

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB