An Incomplete Guide to Optimizing the Throughput of GNURadio Flowgraphs
by [SAn]Streamlining the performance of a GNURadio flowgraph, especially with real-time constraints, can be challenging. This guide aims to highlight key considerations and offer insights into potential approaches.
## Context
- The flowgraph uses an SDR as either a source, sink, or both (or elements with real-time constraints).
- The aim is to increase throughput without generating SDR underruns (neither overflows nor underflows).
## Things to consider
- Hierarchical blocks can have "real" blocks inside. From now on, when we refer to a block, we mean non-hierarchical.
- Each block gets one Os thread where the work function runs.
- The size of the output buffers and the size of the SDR buffer. These determine the minimum latency that the system has to achieve to not have underruns.
- The average throughput of the SDR (the sample rate).
- The number of CPUs. Only one block can run per CPU at a given time, so they will take turns if there are less CPUs than blocks.
- The number of blocks in the flowgraph.
- Increasing the buffers may improve throughput but will make the total system latency worse (may be relevant or not depending on the use case).
- CPU cache is at least one order of magnitude faster than RAM. Sometimes expanding a work function to do the task of two works improves throughput and lowers latency.
- A sweet spot of buffer size may exist where all the processing is done in the CPU cache, so bigger is not always faster.
- If Simultaneous Multithreading is enabled (hyperthreading), the CPU cache is shared between "CPU threads".
- RAM is a shared resource, only one CPU can read or write to the RAM at the same time (well parallel access is supported with multi-channel memory).
- Understand how GNURadio buffers work. Read this great article: Behind the Veil: A Peek at GNU Radio’s Buffer Architecture.
## Measuring and profiling
### Categorize Blocks
Identify and categorize blocks as fast or slow based on their operational speed.
#### Utilize htop
Run the flowgraph and htop to spot the names of the blocks with higher CPU usage. Use the Time column, and make htop show the userland threads with its names.
#### Employ gnuradio perfcounters
Run the flowgraph and use perfcounters to get a sorted list of the work_time_total
per block. (hint: run using GR_CONF_PERFCOUNTERS_ON=1
and call block.pc_work_time_total()
for each block).
### Benchmarking the flowgraph peak throughput (without real time considerations)
Understanding both peak and average throughput can unveil invaluable insights into the performance and efficiency of the system. The peak throughput is the maximum rate at which data can be processed. It provides insights into the system’s capacity under optimal conditions. The average throughput is the mean rate at which data is typically processed, often determined by the SDR's sample rate.
Comparing these two metrics unveils the throughput efficiency ratio. A higher ratio indicates that the system has ample capacity but may not be optimally configured or efficiently utilized. Consider a scenario where the peak throughput is 100Msps (mega samples per second), while the SDR’s sample rate is 10Msps. This 10:1 ratio indicates a system with substantial untapped potential.
To get the peak throughput run the flowgraph (or subparts of it) using a null sink (without the SDR) and using a fast source as a constant or noise source. To time the execution using setting the max_noutput_items like tb.start(max_noutput_items=100_000_000)
to process always the same quantity of items.
Another way is to use a file source with a file in tmpfs like /tmp/
but it will have more impact on the system.
## Ideas to increase throughput
- Optimize the slow blocks, starting with the slowest. Some strategies:
- Switch to faster algorithms
- Reduce the number of taps of filters
- Optimize the code (a big topic for other post)
- Split the block into two or more blocks. This can improve or worsen performance.
- Combine multiple fast blocks into one. This will reduce the number of Os threads, thus context switches. For this write a new block that does the work of a couple of blocks that are contiguous in the flowgraph.
- Increase the buffer size of the blocks with
block.set_min_output_buffer(size)
. - Set thread affinities and priorities (use
block.set_thread_priority
andblock.set_processor_affinity
). I don't know a general rule but some ideas that may improve (or make things worst depending on the use case):- Give higher priority to the SDR and slower blocks.
- Pin slower blocks to some CPUs (like the slower to CPU7, the second slower to CPU6) and pin the faster blocks to the other CPUs (CPU0 to CPU5). The idea is to reduce the chance that the OS scheduler moves a slow block and destroy its cache.
- Pin contiguous blocks to the same Core. As contiguous blocks in the flowgraph share a buffer (the output of a block is the input of the next one), sharing the Core will share its cache. Take in consideration that with hyperthreading enabled then there are two "CPU threads" per core.