What is it all about?
Fan-in is a pretty common concurrency pattern in Go, often used to aggregate results from multiple goroutines. For example, when implementing a worker pool to process jobs concurrently, fan-in can used to collect results from all workers into a single channel. Or imagine you are writing a system to aggregate logs or metrics from different sources. Out of the box Go provides all required instruments to easily build fan-in implementation. What we effectively need to do is to write a function to merge several channels into one.
The most standard implementation can look like (you can find it in many articles/guides online):
|
|
Here we create one goroutine per channel to read values from it until it is closed. And we use sync.WaitGroup
to close the output channel safely after all goroutines are finished. This code looks clean, idiomatic, and works quite efficiently. But person with background in more low-level languages like c
, rust
or zig
can perceive this implementation as a bit wasteful. Spawning green thread for each source? That means allocating memory to at least hold stack of all of them. So let’s not resist the urge to explore whether there’s a faster or more efficient implementation 🙂.
In the search of most optimal implementation
Let’s start by drafting a simple case with three input channels. In this case we only need one select
statement:
|
|
From the first glance this implementation is at least more efficient in way that it does not spawn separate goroutine to read each input channel. But this way we can only handle fixed amount of input channels.
Can we extrapolate this implementation to dynamic amount of input channels not known at compile time? 🤔 - Yes we can! And for that we will use reflection. Yes, I know, I know, using reflection can really slow down your program, but shouldn’t we at least try?
|
|
Now that we have two implementations we can run some benchmarks to compare them1. We will do the benchmarks for three use-cases:
- “worker pool simulation” - small amount of sources (equal to number of CPUs);
- “metrics collection simulation” - moderate amount of sources;
- “huge event based system” - large amount of sources.
For simplicity we will use
int
data type for all three use-cases2.
Let’s start by preparing “testing infrastructure”:
|
|
All our benchmarks will look the same, the only difference is amount of input channels and exact fan-in method we are testing. For worker simulation we will use amount of input channels equal to runtime.NumCPU()
, for metrics simulation we will use 100 input channels, and to simulate huge event system we will use 1000 input channels. Amount of messages to be sent is always ten times the amount of input channels, so each channel will send 10 messages.
|
|
So let’s look at our first results:
goos: linux
goarch: amd64
pkg: github.com/x-dvr/go_experiments/fanin
cpu: Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz
BenchmarkWorkerPoolCanonical-16 34435 34947 ns/op 779 B/op 16 allocs/op
BenchmarkWorkerPoolReflect-16 4508 241188 ns/op 174899 B/op 3216 allocs/op
BenchmarkMetricsCanonical-16 4651 261056 ns/op 4885 B/op 100 allocs/op
BenchmarkMetricsReflect-16 122 9791728 ns/op 7368072 B/op 104134 allocs/op
BenchmarkHugeSourceCountCanonical-16 309 3888395 ns/op 50734 B/op 1015 allocs/op
BenchmarkHugeSourceCountReflect-16 1 1221015673 ns/op 694551600 B/op 10042288 allocs/op
That is quite a disappointing result for potential fans of reflection. In the worker pool use case, the canonical implementation outperformed the reflection-based one by almost 7 times; in the metrics use case, by 37 times; and in the huge event system — by a whopping 314 times! What’s also hard to miss is the number of allocations made by the reflection-based implementation. In canonical implementation we have allocations for each sending goroutine (I couldn’t find a cleaner way to measure message passing performance). But in reflection-based version we have amount of allocations proportional to the product of amount input channels and amount of sent messages. If we run the reflection benchmark with the option to collect a memory profile, we’ll see that 99% of the memory is allocated inside the reflect.Select
function. So on every call it does an allocation for each select case in the list.
In order to reduce amount of goroutines required to process input channels we can try two more approaches:
- Split input channels into groups of 4, spawn one goroutine for each group and use select operator;
- Spawn only one goroutine with a loop over all input channels using select with
default
case to avoid blocking;
Second variant can be much more wasteful, due to the constant non-blocking iteration over all inputs. But let’s see the actual benchmarks before doing any conclusions.
Our “split by four” implementation can look like this:
|
|
And final “one big loop” implementation is the pretty straightforward:
|
|
Time for the second round of benchmarks. I removed reflection-based implementation due to it’s extreme inefficiency. Results for all remaining implementations, you can see in a table below:
BenchmarkWorkerPoolCanonical-16 33226 35925 ns/op 780 B/op 16 allocs/op
BenchmarkWorkerPoolBatch-16 24518 48428 ns/op 769 B/op 16 allocs/op
BenchmarkWorkerPoolLoop-16 12849 93811 ns/op 770 B/op 16 allocs/op
BenchmarkMetricsCanonical-16 4347 265471 ns/op 4891 B/op 100 allocs/op
BenchmarkMetricsBatch-16 3796 307656 ns/op 4828 B/op 100 allocs/op
BenchmarkMetricsLoop-16 5588 205936 ns/op 4885 B/op 101 allocs/op
BenchmarkHugeSourceCountCanonical-16 300 3988277 ns/op 50906 B/op 1015 allocs/op
BenchmarkHugeSourceCountBatch-16 433 2691683 ns/op 130797 B/op 2163 allocs/op
BenchmarkHugeSourceCountLoop-16 825 1447144 ns/op 48899 B/op 1009 allocs/op
Instead of a conclusion
So, what can we learn from all this? First of all, there’s more than one way to implement fan-in in Go — but not all approaches are created equal.
The canonical implementation, while simple and idiomatic, performs remarkably well across all scenarios. Despite spawning one goroutine per input, it remains the fastest choice for small to moderately sized workloads. Outperforming batched version by 1.3 times, and loop version by 2.6 times. Even in large setups, it held its own, significantly loosing only to loop-based version, proving that sometimes the obvious solution is the right one.
The reflection-based approach, though flexible and elegant at first glance, comes at a significant cost. Its poor performance and massive memory allocation make it unsuitable for any performance-critical application. This experiment reaffirms a common Go wisdom: avoid reflection in the hot path.
Our attempt to reduce goroutines with the batching strategy (using select over fixed groups of inputs) yielded mixed results. While it slightly outperformed the canonical version in the large-scale test. It fel of short to canonical implementation for small and medium sized workloads. It’s a valid tradeoff in extreme situations where goroutine overhead must be minimized, but not a general-purpose win.
Surprisingly, the loop-based approach (using a single goroutine with non-blocking selects) performed extremely well for large and medium input sets. Its simple structure and lower pressure on Go scheduler make it an interesting choice when you should avoid spawning many goroutines. In a heavy load scenario it outperformed canonical implementation by almost 2.8 times. However, its inefficiency in smaller setups and potential CPU waste3 make it a niche tool.
TL;DR?
- Canonical: Best all-around. Fast, simple, idiomatic. Use it by default.
- Reflection-based: Avoid for performance-sensitive code.
- Batch Select: A good option for high-scale setups, if you’re OK with some extra code complexity.
- Loop-based: Works best for tons of input channels, but can potentially be CPU-hungry3.
At the end of the day, Go gives us several ways to implement fan-in — and it’s up to us to pick the one that fits our needs best. If you’re writing highly concurrent systems in Go, don’t make assumptions, it’s worth benchmarking your own workload.
Full source code including benchmarks can be found in my repository inside
fanin
directory. ↩︎I did run benchmarks with more complex structured data types, but this did not change overall picture. ↩︎
I ran both worker pool benchmarks for 20 seconds using perf. The loop-based version executed 276 billion instructions, compared to 166 billion for the canonical one. Interestingly, despite doing more work, the loop version only used about 1.61 logical CPUs on average, while the canonical version used 1.86. ↩︎ ↩︎