This change adds a struct mctp_dst, representing the result of a routing
lookup. This decouples the struct mctp_route from the actual
implementation of a routing operation.
This will allow for future routing changes which may require more
involved lookup logic, such as gateway routing - which may require
multiple traversals of the routing table.
Since we only use the struct mctp_route at lookup time, we no longer
hold routes over a routing operation, as we only need it to populate the
dst. However, we do hold the dev while the dst is active.
This requires some changes to the route test infrastructure, as we no
longer have a mock route to handle the route output operation, and
transient dsts are created by the routing code, so we can't override
them as easily.
Instead, we use kunit->priv to stash a packet queue, and a custom
dst_output function queues into that packet queue, which we can use for
later expectations.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-3-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In our input_cloned_frag test, we currently allocate our test buffers
arbitrarily-sized at 100 bytes.
We only expect to receive a max of 15 bytes from the socket, so reduce
to a more appropriate size. There are some upcoming changes to the
routing code which hit a frame-size limit on s390, so reduce the usage
before that lands.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-2-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Stacking technology is a type of technology used to expand ports on
Ethernet switches. It is widely used as a common access method in
large-scale Internet data center architectures. Years of practice
have proved that stacking technology has advantages and disadvantages
in high-reliability network architecture scenarios. For instance,
in stacking networking arch, conventional switch system upgrades
require multiple stacked devices to restart at the same time.
Therefore, it is inevitable that the business will be interrupted
for a while. It is for this reason that "no-stacking" in data centers
has become a trend. Additionally, when the stacking link connecting
the switches fails or is abnormal, the stack will split. Although it is
not common, it still happens in actual operation. The problem is that
after the split, it is equivalent to two switches with the same
configuration appearing in the network, causing network configuration
conflicts and ultimately interrupting the services carried by the
stacking system.
To improve network stability, "non-stacking" solutions have been
increasingly adopted, particularly by public cloud providers and
tech companies like Alibaba, Tencent, and Didi. "non-stacking" is
a method of mimicing switch stacking that convinces a LACP peer,
bonding in this case, connected to a set of "non-stacked" switches
that all of its ports are connected to a single switch
(i.e., LACP aggregator), as if those switches were stacked. This
enables the LACP peer's ports to aggregate together, and requires
(a) special switch configuration, described in the linked article,
and (b) modifications to the bonding 802.3ad (LACP) mode to send
all ARP/ND packets across all ports of the active aggregator.
Note that, with multiple aggregators, the current broadcast mode
logic will send only packets to the selected aggregator(s).
+-----------+ +-----------+
| switch1 | | switch2 |
+-----------+ +-----------+
^ ^
| |
+-----------------+
| bond4 lacp |
+-----------------+
| |
| NIC1 | NIC2
+-----------------+
| server |
+-----------------+
- https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
Link: https://patch.msgid.link/84d0a044514157bb856a10b6d03a1028c4883561.1751031306.git.tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Mark Bloch says:
====================
net/mlx5: HWS, Optimize matchers ICM usage
This series optimizes ICM usage for unidirectional rules and
empty matchers and with the last patch we make hardware steering
the default FDB steering provider for NICs that don't support software
steering.
Hardware steering (HWS) uses a type of rule table container (RTC) that
is unidirectional, so matchers consist of two RTCs to accommodate
bidirectional rules.
This small series enables resizing the two RTCs independently by
tracking the number of rules separately. For extreme cases where all
rules are unidirectional, this results in saving close to half the
memory footprint.
Results for inserting 1M unidirectional rules using a simple module:
Pages Memory
Before this patch: 300k 1.5GiB
After this patch: 160k 900MiB
The 'Pages' column measures the number of 4KiB pages the device requests
for itself (the ICM).
The 'Memory' column is the difference between peak usage and baseline
usage (before starting the test) as reported by `free -h`.
In addition, second to last patch of the series handles a case where all
the matcher's rules were deleted: the large RTCs of the matcher are no
longer required, and we can save some more ICM by shrinking the matcher
to its initial size.
Finally the last patch makes hardware steering the default mode
when in swichdev for NICs that don't have software steering support.
====================
Link: https://patch.msgid.link/20250703185431.445571-1-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matcher size is dynamic: it starts at initial size, and then it grows
through rehash as more and more rules are added to this matcher.
When rules are deleted, matcher's size is not decreased. Rehash
approach is greedy. The idea is: if the matcher got to a certain size
at some point, chances are - it will get to this size again, so it is
better to avoid costly rehash operations whenever possible.
However, when all the rules of the matcher are deleted, this should
be viewed as special case. If the matcher actually got to the point
where it has zero rules, it might be an indication that some usecase
from the past is no longer happening. This is where some ICM can be
freed.
This patch handles this case: when a number of rules in a matcher
goes down to zero, the matcher's tables are shrunk to the initial
size.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250703185431.445571-10-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Track and grow matcher sizes individually for RX and TX RTCs. This
allows RX-only or TX-only use cases to effectively halve the device
resources they use.
For testing we used a simple module that inserts 1M RX-only rules and
measured the number of pages the device requests, and memory usage as
reported by `free -h`.
Pages Memory
Before this patch: 300k 1.5GiB
After this patch: 160k 900MiB
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250703185431.445571-8-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matchers were using the pool abstraction solely as a convenience
to allocate two STE ranges. The pool's core functionality, that
of allocating individual items from the range, was unused.
Matchers rely either on the hardware to hash rules into a table,
or on a user-provided index.
Remove the STE pool from the matcher and allocate the STE ranges
manually instead.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250703185431.445571-6-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-07-03
Vladimir Oltean converts Intel drivers (ice, igc, igb, ixgbe, i40e) to
utilize new timestamping API (ndo_hwtstamp_get() and ndo_hwtstamp_set()).
For ixgbe:
Paul, Don, Slawomir, and Radoslaw add Malicious Driver Detection (MDD)
support for X550 and E610 devices to detect, report, and handle
potentially malicious VFs.
Simon Horman corrects spelling mistakes.
For igbvf:
Kohei Enju removes a couple of unreported counters and adds reporting
of Tx timeouts.
* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
igbvf: add tx_timeout_count to ethtool statistics
igbvf: remove unused interrupt counter fields from struct igbvf_adapter
ixgbe: spelling corrections
ixgbe: turn off MDD while modifying SRRCTL
ixgbe: add Tx hang detection unhandled MDD
ixgbe: check for MDD events
ixgbe: add MDD support
i40e: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
ixgbe: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
igb: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
igc: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
ice: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
====================
Link: https://patch.msgid.link/20250703174242.3829277-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: phylink: support !autoneg configuration for SFPs
This series comes from discussion during a patch series that was posted
at the beginning of April, but these patches were never posted (I was
too busy!)
We restrict ->sfp_interfaces to those that the host system supports,
and ensure that ->sfp_interfaces is cleared when a SFP is removed. We
then add phylink_sfp_select_interface_speed() which will select an
appropriate interface from ->sfp_interfaces for the speed, and use that
in our phylink_ethtool_ksettings_set() when a SFP bus is present on a
directly connected host (not with a PHY.)
====================
Link: https://patch.msgid.link/aGT_hoBELDysGbrp@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vikas Gupta says:
====================
Introducing Broadcom BNGE Ethernet Driver
This patch series introduces the Ethernet driver for Broadcom’s
BCM5770X chip family, which supports 50/100/200/400/800 Gbps
link speeds. The driver is built as the bng_en.ko kernel module.
To keep the series within a reviewable size (~5K lines of code), this initial
submission focuses on the core infrastructure and initialization, including:
1) PCIe support (device IDs, probe/remove)
2) Devlink support
3) Firmware communication mechanism
4) Creation of network device
5) PF Resource management (rings, IRQs, etc. for netdev & aux dev)
Support for Tx/Rx datapaths, link management, ethtool/devlink operations
and additional features will be introduced in the subsequent patch series.
The bng_en driver shares the bnxt_hsi.h file with the bnxt_en driver,
as the bng_en driver leverages the hardware communication protocol
used by the bnxt_en driver.
====================
Link: https://patch.msgid.link/20250701143511.280702-1-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Query resources from the firmware and, based on the
availability of resources, initialize the default
settings. The default settings include:
1. Rings and other resource reservations with the
firmware. This ensures that resources are reserved
before network and auxiliary devices are created.
2. Mapping the BAR, which helps access doorbells since
its size is known after querying the firmware.
3. Retrieving the TCs and hardware CoS queue mappings.
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250701143511.280702-10-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Get the resources and capabilities from the firmware.
Add functions to manage the resources with the firmware.
These functions will help netdev reserve the resources
with the firmware before registering the device in future
patches. The resources and their information, such as
the maximum available and reserved, are part of the members
present in the bnge_hw_resc struct.
The bnge_reserve_rings() function also populates
the RSS table entries once the RX rings are reserved with
the firmware.
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250701143511.280702-8-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support to communicate with the firmware.
Future patches will use these functions to send the
messages to the firmware.
Functions support allocating request/response buffers
to send a particular command. Each command has certain
timeout value to which the driver waits for response from
the firmware. In error case, commands may be either timed
out waiting on response from the firmware or may return
a specific error code.
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250701143511.280702-4-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Breno Leitao says:
====================
netpoll: Factor out functions from netpoll_send_udp() and add ipv6 selftest
Refactors the netpoll UDP transmit path to improve code clarity,
maintainability, and protocol-layer encapsulation.
Function netpoll_send_udp() has more than 100 LoC, which is hard to
understand and review. After this patchset, it has only 32 LoC, which is
more manageable.
The series systematically moves the construction of protocol headers
(UDP, IPv4, IPv6, Ethernet) out of the core `netpoll_send_udp()`
function into dedicated static helpers:
- `push_udp()` for UDP header setup
- `push_ipv4()` and `push_ipv6()` for IP header setup
- `push_eth()` for Ethernet header setup
This results in a clean, layered abstraction that mirrors the protocol
stack, reduces code duplication, and improves readability.
Also, to make sure this is not breaking anything, add IPv6 selftest to
netconsole tests, which will exercise this code. This test would also pick
problems similiar to the one fixed by f599020702 ("net: netpoll:
Initialize UDP checksum field before checksumming"), which was
embarrassin we didn't have a selftest catch it.
Anyway, there are **no functional changes** intended in this patchset.
v1: https://lore.kernel.org/20250627-netpoll_untagle_ip-v1-0-61a21692f84a@debian.org
====================
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-0-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add IPv6 support to the netconsole basic functionality tests by:
- Introducing separate IPv4 and IPv6 address variables (SRCIP4/SRCIP6,
DSTIP4/DSTIP6) to replace the single SRCIP/DSTIP variables
- Adding select_ipv4_or_ipv6() function to choose protocol version
- Updating socat configuration to use UDP6-LISTEN for IPv6 tests
- Adding wait_for_port() wrapper to handle protocol-specific port waiting
- Expanding test matrix to run both basic and extended formats against
both IPv4 and IPv6 protocols
- Improving cleanup to kill any remaining socat processes
- Adding sleep delays for better IPv6 packet handling reliability
The test now validates netconsole functionality across both IP versions,
improving test coverage for dual-stack network environments.
This test would avoid the regression fixed by commit f599020702 ("net:
netpoll: Initialize UDP checksum field before checksumming")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-7-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move UDP header construction from netpoll_send_udp() into a new
static helper function push_udp(). This completes the protocol
layer refactoring by:
1. Creating a dedicated helper for UDP header assembly
2. Removing UDP-specific logic from the main send function
3. Establishing a consistent pattern with existing IPv4/IPv6 helpers:
- push_udp()
- push_ipv4()
- push_ipv6()
The change improves code organization and maintains the encapsulation
pattern established in previous refactorings.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-5-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move IPv4 header construction from netpoll_send_udp() into a new
static helper function push_ipv4(). This completes the refactoring
started with IPv6 header handling, creating symmetric helper functions
for both IP versions.
Changes include:
1. Extracting IPv4 header setup logic into push_ipv4()
2. Replacing inline IPv4 code with helper call
3. Moving eth assignment after helper calls for consistency
The refactoring reduces code duplication and improves maintainability
by isolating IP version-specific logic.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-4-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace pointer-dereference sizeof() operations with explicit struct names
for improved readability and maintainability. This change:
1. Replaces `sizeof(*udph)` with `sizeof(struct udphdr)`
2. Replaces `sizeof(*ip6h)` with `sizeof(struct ipv6hdr)`
3. Replaces `sizeof(*iph)` with `sizeof(struct iphdr)`
This will make it easy to move code in the upcoming patches.
No functional changes are introduced by this patch.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-1-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle says:
====================
net: ethernet: mtk_eth_soc: improve device tree handling
This series further improves the mtk_eth_soc driver in preparation to
complete upstream support for the MediaTek MT7988 SoC family.
Frank Wunderlich's previous attempt to have the ethernet node included
in mt7988a.dtsi and cover support for MT7988 in the device tree bindings
was criticized for the way mtk_eth_soc references SRAM in device tree[1].
Having a 2nd 'reg' property, like introduced by commit ebb1e4f9cf
("net: ethernet: mtk_eth_soc: add support for in-SoC SRAM") isn't
acceptable and a dedicated "mmio-sram" node should be used instead.
In order to make the code more clean and readable, the existing
hardcoded offsets for the scratch ring, RX and TX rings are dropped in
favor of using the generic allocator. However, support for the hardcoded
offset of the SRAM itself being included as part of the Ethernet's "reg"
MMIO space is kept as it will still be required in order to support
existing legacy device trees of the MT7986 SoC family.
While at it also replace confusing error messages when using legacy
device trees without "interrupt-names" with a warning informing users
that they are using a legacy device tree.
[1]: https://patchwork.ozlabs.org/comment/3533543/
====================
Link: https://patch.msgid.link/cover.1751461762.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>