Skip to content

Separate UDP GSO and GRO controls for Linux (--gso and --gro options) #2037

@roaoliver

Description

@roaoliver

[Enhancement] Add independent control for UDP GSO and GRO (--gso, --gro)

Problem Statement

Currently, iperf3 supports UDP Generic Segmentation Offload (GSO) and Generic Receive Offload (GRO) via a single combined flag: --gsro. While this is convenient, it forces both offloads to be active simultaneously.

Testing shows that while GSO (sender-side) is almost universally beneficial, GRO (receiver-side) can cause significant CPU saturation on the receiver depending on the hardware and kernel version. Currently, there is no way to benefit from GSO without incurring the receiver-side overhead of GRO.

Proposed Solution

I propose splitting the offload controls into two independent flags while maintaining backward compatibility:

  1. --gso: Enables only UDP_SEGMENT on the sender.
  2. --gro: Enables only UDP_GRO on the receiver.
  3. --gsro: Remains as a legacy/convenience flag that enables both.

Technical Motivation & Test Evidence

Testing across different Network Interface Cards (NICs) reveals that a "one-size-fits-all" approach to offloading is suboptimal.

Test Environments

  • Env 1: Intel X552 (10GbE) | Ubuntu 22.04 (Kernel 6.8)
  • Env 2: Broadcom BCM57800 (10GbE) | Ubuntu 22.04 (Kernel 5.15)
  • Env 3: Mellanox ConnectX-5 (25GbE+) | Ubuntu 24.04 (Kernel 6.17)

Comparative Data

Setup Mode Sender CPU Receiver CPU Notes
Intel Baseline 78.3% 43.1%
Intel --gsro 32.1% 100.0% Receiver saturated by GRO
Intel --gso 32.0% 41.4% Optimal efficiency
Broadcom Baseline 99.9% 100.0% 9.92 Gbps
Broadcom --gsro 70.6% 99.5% 9.92 Gbps
Broadcom --gso 72.1% 98.6% 9.92 Gbps (Stable)
Mellanox Baseline 99.8% 76.8% Bitrate stuck at 18.7G
Mellanox --gsro 66.0% 99.8% Bitrate reached 25G
Mellanox --gso 65.7% 96.5% Bitrate reached 25G

Observations & Hardware Variance

  • The Intel Bottleneck: On the X552, GRO is extremely "expensive," saturating the receiver CPU. Splitting the flags allows us to keep the 46% CPU saving on the sender (via GSO) without breaking the receiver.
  • The Broadcom Stability: Unlike the Intel card, the Broadcom BCM57800 shows similar behavior whether using --gsro or --gso. While it doesn't suffer as much from GRO overhead, providing independent flags ensures consistent behavior across different testing toolsets.
  • The Mellanox Throughput: On high-speed ConnectX-5 cards, GSO is the difference between hitting line rate (25G) or being CPU-bound at 18G. As with the Intel tests, using only GSO keeps the receiver from hitting the 100% ceiling.

Conclusion: Because different NIC models (Intel vs. Broadcom vs. Mellanox) handle offloads differently, users need granular control to avoid artificial bottlenecks during performance validation.

Observations

  • GSO Efficiency: In all environments, GSO drastically reduced sender CPU load or allowed for higher bitrates by offloading segmentation to the NIC.
  • GRO Bottleneck: In Environment 1 and 3, GRO caused the receiver CPU to hit 100% saturation. Being able to disable GRO while keeping GSO active allows for higher performance tests without receiver-side bottlenecks.

Implementation Details

The changes involve:

  • Updating iperf_api.h/c to include the new boolean flags in the test settings.
  • Modifying the logic in the UDP stream handlers to check for --gso or --gro specifically.
  • Updating documentation in iperf3.1 (man page).

Use Cases

  • Precise performance tuning for high-speed (10G/40G/100G) networks.
  • Isolating kernel stack vs. hardware offload issues during debugging.
  • Supporting environments where only the sender hardware supports offloading.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions