Add a way to share an ifq from a src ring that is real (i.e. bound to a
HW RX queue) with other rings. This is done by passing a new flag
IORING_ZCRX_IFQ_REG_IMPORT in the registration struct
io_uring_zcrx_ifq_reg, alongside the fd of an exported zcrx ifq.
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper io_fill_zcrx_offsets() that sets the constant offsets in
struct io_uring_zcrx_offsets returned to userspace.
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an option to wrap a zcrx instance into a file and expose it to the
user space. Currently, users can't do anything meaningful with the file,
but it'll be used in a next patch to import it into another io_uring
instance. It's implemented as a new op called ZCRX_CTRL_EXPORT for the
IORING_REGISTER_ZCRX_CTRL registration opcode.
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for adding zcrx ifq exporting and importing, move
io_zcrx_scrub() and its dependencies up the file to be closer to
io_close_queue().
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
zcrx tries to detach ifq / terminate page pools when the io_uring ctx
owning it is being destroyed. There will be multiple io_uring instances
attached to it in the future, so add a separate counter to track the
users. Note, refs can't be reused for this purpose as it only used to
prevent zcrx and rings destruction, and also used by page pools to keep
it alive.
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an zcrx interface via IORING_REGISTER_ZCRX_CTRL that forces the
kernel to flush / consume entries from the refill queue. Just as with
the IORING_REGISTER_ZCRX_REFILL attempt, the motivation is to address
cases where the refill queue becomes full, and the user can't return
buffers and needs to stash them. It's still a slow path, and the user
should size refill queue appropriately, but it should be helpful for
handling temporary traffic spikes and other unpredictable conditions.
The interface is simpler comparing to ZCRX_REFILL as it doesn't need
temporary refill entry arrays and gives natural batching, whereas
ZCRX_REFILL requires even more user logic to be somewhat efficient.
Also, add a structure for the operation. It's not currently used but
can serve for future improvements like limiting the number of buffers to
process, etc.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It'll be annoying and take enough of boilerplate code to implement
new zcrx features as separate io_uring register opcode. Introduce
IORING_REGISTER_ZCRX_CTRL that will multiplex such calls to zcrx.
Note, there are no real users of the opcode in this patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
folio_nr_pages() is a faster helper function to get the number of pages when
NR_PAGES_IN_LARGE_FOLIO is enabled.
Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert zcrx to struct netmem_desc, and use struct net_iov::desc to
access its fields instead of struct net_iov inner union alises.
zcrx only directly reads niov->pp, so with this patch it doesn't depend
on the union anymore.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Byungchul Park <byungchul@sk.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge 6.18-rc io_uring fixes, as certain coming changes depend on some
of these.
* io_uring-6.18:
io_uring/rsrc: don't use blk_rq_nr_phys_segments() as number of bvecs
io_uring/query: return number of available queries
io_uring/rw: ensure allocated iovec gets cleared for early failure
io_uring: fix regbuf vector size truncation
io_uring: fix types for region size calulation
io_uring/zcrx: remove sync refill uapi
io_uring: fix buffer auto-commit for multishot uring_cmd
io_uring: correct __must_hold annotation in io_install_fixed_file
io_uring zcrx: add MAINTAINERS entry
io_uring: Fix code indentation error
io_uring/sqpoll: be smarter on when to update the stime usage
io_uring/sqpoll: switch away from getrusage() for CPU accounting
io_uring: fix incorrect unlikely() usage in io_waitid_prep()
Signed-off-by: Jens Axboe <axboe@kernel.dk>
netdev ops must be called under instance lock or rtnl_lock, but
io_register_zcrx_ifq() isn't doing this for netdev_queue_get_dma_dev().
Fix this by taking the instance lock using netdev_get_by_index_lock().
Extended the instance lock section to include attaching a memory
provider. Could not move io_zcrx_create_area() outside, since the dmabuf
codepath IORING_ZCRX_AREA_DMABUF requires ifq->dev.
Fixes: 59b8b32ac8 ("io_uring/zcrx: add support for custom DMA devices")
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a refcount to struct io_zcrx_ifq to reverse the refcounting
relationship i.e. rings now reference ifqs instead. As a result of this,
remove ctx->refs that an ifq holds on a ring via the page pool memory
provider.
This ref ifq->refs is held by internal users of an ifq, namely rings and
the page pool memory provider associated with an ifq. This is needed to
keep the ifq around until the page pool is destroyed.
Since ifqs now no longer hold refs to ring ctx, there isn't a need to
split the cleanup of ifqs into two: io_shutdown_zcrx_ifqs() in
io_ring_exit_work() while waiting for ctx->refs to drop to 0, and
io_unregister_zcrx_ifqs() after. Remove io_shutdown_zcrx_ifqs().
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for removing the ref on ctx->refs held by an ifq and
removing io_shutdown_zcrx_ifqs(), move io_unregister_zcrx_ifqs() down
such that it can call io_zcrx_scrub().
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for removing ifq->ctx and making ifq lifetime independent
of ring ctx, add user_struct and mm_struct to io_zcrx_ifq.
In the ifq cleanup path, these are the only fields used from the main
ring ctx to do accounting. Taking a copy in the ifq allows ifq->ctx to
be removed later, including the ctx->refs held by the ifq.
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add io_zcrx_ifq arg to io_zcrx_free_area(). A QOL change to reduce line
widths.
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refactor io_{un}account_mem() to take user_struct and mm_struct
directly, instead of accessing it from the ring ctx.
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refactor io_free_region() to take user_struct directly, instead of
accessing it from the ring ctx.
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a better way to handle the problem IORING_REGISTER_ZCRX_REFILL
solves. The uapi can also be slightly adjusted to accommodate future
extensions. Remove the feature for now, it'll be reworked for the next
release.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't forget to adjust the source offset in io_copy_page(), otherwise
it'll be copying into the same location in some cases for highmem
setups.
Fixes: e67645bb7f ("io_uring/zcrx: prepare fallback for larger pages")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's reported that sometimes a zcrx request can receive more than was
requested. It's caused by io_zcrx_recv_skb() adjusting desc->count for
all received buffers including frag lists, but then doing recursive
calls to process frag list skbs, which leads to desc->count double
accounting and underflow.
Reported-and-tested-by: Matthias Jasny <matthiasjasny@gmail.com>
Fixes: 6699ec9a23 ("io_uring/zcrx: add a read limit to recvzc requests")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
Pull networking updates from Paolo Abeni:
"Core & protocols:
- Improve drop account scalability on NUMA hosts for RAW and UDP
sockets and the backlog, almost doubling the Pps capacity under DoS
- Optimize the UDP RX performance under stress, reducing contention,
revisiting the binary layout of the involved data structs and
implementing NUMA-aware locking. This improves UDP RX performance
by an additional 50%, even more under extreme conditions
- Add support for PSP encryption of TCP connections; this mechanism
has some similarities with IPsec and TLS, but offers superior HW
offloads capabilities
- Ongoing work to support Accurate ECN for TCP. AccECN allows more
than one congestion notification signal per RTT and is a building
block for Low Latency, Low Loss, and Scalable Throughput (L4S)
- Reorganize the TCP socket binary layout for data locality, reducing
the number of touched cachelines in the fastpath
- Refactor skb deferral free to better scale on large multi-NUMA
hosts, this improves TCP and UDP RX performances significantly on
such HW
- Increase the default socket memory buffer limits from 256K to 4M to
better fit modern link speeds
- Improve handling of setups with a large number of nexthop, making
dump operating scaling linearly and avoiding unneeded
synchronize_rcu() on delete
- Improve bridge handling of VLAN FDB, storing a single entry per
bridge instead of one entry per port; this makes the dump order of
magnitude faster on large switches
- Restore IP ID correctly for encapsulated packets at GSO
segmentation time, allowing GRO to merge packets in more scenarios
- Improve netfilter matching performance on large sets
- Improve MPTCP receive path performance by leveraging recently
introduced core infrastructure (skb deferral free) and adopting
recent TCP autotuning changes
- Allow bridges to redirect to a backup port when the bridge port is
administratively down
- Introduce MPTCP 'laminar' endpoint that con be used only once per
connection and simplify common MPTCP setups
- Add RCU safety to dst->dev, closing a lot of possible races
- A significant crypto library API for SCTP, MPTCP and IPv6 SR,
reducing code duplication
- Supports pulling data from an skb frag into the linear area of an
XDP buffer
Things we sprinkled into general kernel code:
- Generate netlink documentation from YAML using an integrated YAML
parser
Driver API:
- Support using IPv6 Flow Label in Rx hash computation and RSS queue
selection
- Introduce API for fetching the DMA device for a given queue,
allowing TCP zerocopy RX on more H/W setups
- Make XDP helpers compatible with unreadable memory, allowing more
easily building DevMem-enabled drivers with a unified XDP/skbs
datapath
- Add a new dedicated ethtool callback enabling drivers to provide
the number of RX rings directly, improving efficiency and clarity
in RX ring queries and RSS configuration
- Introduce a burst period for the health reporter, allowing better
handling of multiple errors due to the same root cause
- Support for DPLL phase offset exponential moving average,
controlling the average smoothing factor
Device drivers:
- Add a new Huawei driver for 3rd gen NIC (hinic3)
- Add a new SpacemiT driver for K1 ethernet MAC
- Add a generic abstraction for shared memory communication
devices (dibps)
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- Use multiple per-queue doorbell, to avoid MMIO contention
issues
- support adjacent functions, allowing them to delegate their
SR-IOV VFs to sibling PFs
- support RSS for IPSec offload
- support exposing raw cycle counters in PTP and mlx5
- support for disabling host PFs.
- Intel (100G, ice, idpf):
- ice: support for SRIOV VFs over an Active-Active link
aggregate
- ice: support for firmware logging via debugfs
- ice: support for Earliest TxTime First (ETF) hardware offload
- idpf: support basic XDP functionalities and XSk
- Broadcom (bnxt):
- support Hyper-V VF ID
- dynamic SRIOV resource allocations for RoCE
- Meta (fbnic):
- support queue API, zero-copy Rx and Tx
- support basic XDP functionalities
- devlink health support for FW crashes and OTP mem corruptions
- expand hardware stats coverage to FEC, PHY, and Pause
- Wangxun:
- support ethtool coalesce options
- support for multiple RSS contexts
- Ethernet virtual:
- Macsec:
- replace custom netlink attribute checks with policy-level
checks
- Bonding:
- support aggregator selection based on port priority
- Microsoft vNIC:
- use page pool fragments for RX buffers instead of full pages
to improve memory efficiency
- Ethernet NICs consumer, and embedded:
- Qualcomm: support Ethernet function for IPQ9574 SoC
- Airoha: implement wlan offloading via NPU
- Freescale
- enetc: add NETC timer PTP driver and add PTP support
- fec: enable the Jumbo frame support for i.MX8QM
- Renesas (R-Car S4):
- support HW offloading for layer 2 switching
- support for RZ/{T2H, N2H} SoCs
- Cadence (macb): support TAPRIO traffic scheduling
- TI:
- support for Gigabit ICSS ethernet SoC (icssm-prueth)
- Synopsys (stmmac): a lot of cleanups
- Ethernet PHYs:
- Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
driver
- Support bcm63268 GPHY power control
- Support for Micrel lan8842 PHY and PTP
- Support for Aquantia AQR412 and AQR115
- CAN:
- a large CAN-XL preparation work
- reorganize raw_sock and uniqframe struct to minimize memory
usage
- rcar_canfd: update the CAN-FD handling
- WiFi:
- extended Neighbor Awareness Networking (NAN) support
- S1G channel representation cleanup
- improve S1G support
- WiFi drivers:
- Intel (iwlwifi):
- major refactor and cleanup
- Broadcom (brcm80211):
- support for AP isolation
- RealTek (rtw88/89) rtw88/89:
- preparation work for RTL8922DE support
- MediaTek (mt76):
- HW restart improvements
- MLO support
- Qualcomm/Atheros (ath10k):
- GTK rekey fixes
- Bluetooth drivers:
- btusb: support for several new IDs for MT7925
- btintel: support for BlazarIW core
- btintel_pcie: support for _suspend() / _resume()
- btintel_pcie: support for Scorpious, Panther Lake-H484 IDs"
* tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits)
net: stmmac: Add support for Allwinner A523 GMAC200
dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible
Revert "Documentation: net: add flow control guide and document ethtool API"
octeontx2-pf: fix bitmap leak
octeontx2-vf: fix bitmap leak
net/mlx5e: Use extack in set rxfh callback
net/mlx5e: Introduce mlx5e_rss_params for RSS configuration
net/mlx5e: Introduce mlx5e_rss_init_params
net/mlx5e: Remove unused mdev param from RSS indir init
net/mlx5: Improve QoS error messages with actual depth values
net/mlx5e: Prevent entering switchdev mode with inconsistent netns
net/mlx5: HWS, Generalize complex matchers
net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
selftests/net: add tcp_port_share to .gitignore
Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set"
net: add NUMA awareness to skb_attempt_defer_free()
net: use llist for sd->defer_list
net: make softnet_data.defer_count an atomic
selftests: drv-net: psp: add tests for destroying devices
selftests: drv-net: psp: add test for auto-adjusting TCP MSS
...
Within a folio/compound page, nth_page() is no longer required. Given
that we call folio_test_partial_kmap()+kmap_local_page(), the code would
already be problematic if the pages would span multiple folios.
So let's just assume that all src pages belong to a single folio/compound
page and can be iterated ordinarily. The dst page is currently always a
single page, so we're not actually iterating anything.
Link: https://lkml.kernel.org/r/20250901150359.867252-21-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
net_iov / freelist / etc. arrays can be quite long, make sure they're
accounted.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Returning buffers via a ring is performant and convenient, but it
becomes a problem when/if the user misconfigured the ring size and it
becomes full. Add a synchronous way to return buffers back to the page
pool via a new register opcode. It's supposed to be a reliable slow
path for refilling.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper for verifying a rqe and extracting a niov out of it. It'll
be reused in following patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The cache should be empty when io_pp_zc_alloc_netmems() is called,
that's promised by page pool and further checked, so there is no need to
recalculate the available space in io_zcrx_ring_refill().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use guards for rq_lock in io_zcrx_ring_refill(), makes it a tad simpler.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reduce the scope of a local var netmem in io_zcrx_ring_refill.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove ifq->lock and reuse pp_lock to protect the netdev pointer.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for reusing the lock for other purposes, rename it to
"pp_lock". As before, it can be taken deeper inside the networking stack
by page pool, and so the syscall io_uring must avoid holding it while
doing queue reconfiguration or anything that can result in immediate pp
init/destruction.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of using PAGE_SIZE for the niov size add a niov_shift field to
ifq, and patch up all important places. Copy fallback still assumes
PAGE_SIZE, so it'll be wasting some memory for now.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set struct io_zcrx_mem::sgt for umem areas as well to simplify looking
up the current sg table.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It was removed from uapi, so now it's always 0 and can be removed
together with offset handling in io_populate_area_dma().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With a common type for storing dma addresses and io_populate_area_dma(),
type-specific area mapping helpers are trivial, so open code them and
deduplicate the call to io_populate_area_dma().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_zcrx_copy_chunk() doesn't and shouldn't care from which area the
buffer is allocated, don't try to resolve the area in it but pass the
ifq to io_zcrx_alloc_fallback() and let it handle it. Also rename it for
more clarity.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a warning if io_populate_area_dma() can't fill in all net_iovs, it
should never happen.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_import_area() is responsible for importing memory and parsing
io_uring_zcrx_area_reg, so move all area reg structure checks into the
function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't pass a pointer to a pointer where an area should be stored to
io_zcrx_create_area(), and let it handle finding the right place for a
new area. It's more straightforward and will be needed to support
multiple areas.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
page_pool_unref_and_test() tries to better follow usuall refcount
semantics, use it instead of page_pool_unref_netmem().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
memchr_inv() is more ambiguous than mem_is_zero(), so use the latter
for zero checks.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refill queue entries are 16B structures, but because of the ring header
placement, they're 8B aligned but not naturally / 16B aligned, which
means some of them span across 2 cache lines. Push rqes to a new cache
line.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
zcrx currently requires the ring to be set up with fixed 32b CQEs,
allow it to use IORING_SETUP_CQE_MIXED as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull io_uring updates from Jens Axboe:
- Optimization to avoid reference counts on non-cloned registered
buffers. This is how these buffers were handled prior to having
cloning support, and we can still use that approach as long as the
buffers haven't been cloned to another ring.
- Cleanup and improvement for uring_cmd, where btrfs was the only user
of storing allocated data for the lifetime of the uring_cmd. Clean
that up so we can get rid of the need to do that.
- Avoid unnecessary memory copies in uring_cmd usage. This is
particularly important as a lot of uring_cmd usage necessitates the
use of 128b SQEs.
- A few updates for recv multishot, where it's now possible to add
fairness limits for limiting how much is transferred for each retry
loop. Additionally, recv multishot now supports an overall cap as
well, where once reached the multishot recv will terminate. The
latter is useful for buffer management and juggling many recv streams
at the same time.
- Add support for returning the TX timestamps via a new socket command.
This feature can work in either singleshot or multishot mode, where
the latter triggers a completion whenever new timestamps are
available. This is an alternative to using the existing error queue.
- Add support for an io_uring "mock" file, which is the start of being
able to do 100% targeted testing in terms of exercising io_uring
request handling. The idea is to have a file type that can be
anything the tester would like, and behave exactly how you want it to
behave in terms of hitting the code paths you want.
- Improve zcrx by using sgtables to de-duplicate and improve dma
address handling.
- Prep work for supporting larger pages for zcrx.
- Various little improvements and fixes.
* tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits)
io_uring/zcrx: fix leaking pages on sg init fail
io_uring/zcrx: don't leak pages on account failure
io_uring/zcrx: fix null ifq on area destruction
io_uring: fix breakage in EXPERT menu
io_uring/cmd: remove struct io_uring_cmd_data
btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd
io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag
io_uring/zcrx: account area memory
io_uring: export io_[un]account_mem
io_uring/net: Support multishot receive len cap
io_uring: deduplicate wakeup handling
io_uring/net: cast min_not_zero() type
io_uring/poll: cleanup apoll freeing
io_uring/net: allow multishot receive per-invocation cap
io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags
io_uring/net: use passed in 'len' in io_recv_buf_select()
io_uring/zcrx: prepare fallback for larger pages
io_uring/zcrx: assert area type in io_zcrx_iov_page
io_uring/zcrx: allocate sgtable for umem areas
io_uring/zcrx: introduce io_populate_area_dma
...