CONFIG_MSHV_ROOT allows kernels built to run as a normal Hyper-V guest
to exclude the root partition code, which is expected to grow
significantly over time.
This option is a tristate so future driver code can be built as a
(m)odule, allowing faster development iteration cycles.
If CONFIG_MSHV_ROOT is disabled, don't compile hv_proc.c, and stub
hv_root_partition() to return false unconditionally. This allows the
compiler to optimize away root partition code blocks since they will
be disabled at compile time.
In the case of booting as root partition *without* CONFIG_MSHV_ROOT
enabled, print a critical error (the kernel will likely crash).
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1740167795-13296-4-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1740167795-13296-4-git-send-email-nunodasneves@linux.microsoft.com>
Linus observed that the symbol_request(utf8_data_table) call fails when
CONFIG_UNICODE=y and CONFIG_TRIM_UNUSED_KSYMS=y.
symbol_get() relies on the symbol data being present in the ksymtab for
symbol lookups. However, EXPORT_SYMBOL_GPL(utf8_data_table) is dropped
due to CONFIG_TRIM_UNUSED_KSYMS, as no module references it in this case.
Probably, this has been broken since commit dbacb0ef67 ("kconfig option
for TRIM_UNUSED_KSYMS").
This commit addresses the issue by leveraging modpost. Symbol names
passed to symbol_get() are recorded in the special .no_trim_symbol
section, which is then parsed by modpost to forcibly keep such symbols.
The .no_trim_symbol section is discarded by the linker scripts, so there
is no impact on the size of the final vmlinux or modules.
This commit cannot resolve the issue for direct calls to __symbol_get()
because the symbol name is not known at compile-time.
Although symbol_get() may eventually be deprecated, this workaround
should be good enough meanwhile.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
Pull non-MM updates from Andrew Morton:
"Mainly individually changelogged singleton patches. The patch series
in this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap
library code
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
some cleanup and Rust preparation in the xarray library code
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven
fixes pathnames in some code comments
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
the new secs_to_jiffies() in various places where that is
appropriate
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API
- "Convert ocfs2 to use folios" from Matthew Wilcox does that
- "Remove get_task_comm() and print task comm directly" from Yafang
Shao removes now-unneeded calls to get_task_comm() in various
places
- "squashfs: reduce memory usage and update docs" from Phillip
Lougher implements some memory savings in squashfs and performs
some maintainability work
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented
with a corrupted image
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger
- "minmax.h: Cleanups and minor optimisations" from David Laight does
some maintenance work on the min/max library code
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
work on the xarray library code"
* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
ocfs2: use str_yes_no() and str_no_yes() helper functions
include/linux/lz4.h: add some missing macros
Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
Xarray: remove repeat check in xas_squash_marks()
Xarray: distinguish large entries correctly in xas_split_alloc()
Xarray: move forward index correctly in xas_pause()
Xarray: do not return sibling entries from xas_find_marked()
ipc/util.c: complete the kernel-doc function descriptions
gcov: clang: use correct function param names
latencytop: use correct kernel-doc format for func params
minmax.h: remove some #defines that are only expanded once
minmax.h: simplify the variants of clamp()
minmax.h: move all the clamp() definitions after the min/max() ones
minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
minmax.h: reduce the #define expansion of min(), max() and clamp()
minmax.h: update some comments
minmax.h: add whitespace around operators and after commas
nilfs2: do not update mtime of renamed directory that is not moved
nilfs2: handle errors that nilfs_prepare_chunk() may return
CREDITS: fix spelling mistake
...
We already have a generic implementation of alloc/free up to P4D level, as
well as pgd_free(). Let's finish the work and add a generic PGD-level
alloc helper as well.
Unlike at lower levels, almost all architectures need some specific magic
at PGD level (typically initialising PGD entries), so introducing a
generic pgd_alloc() isn't worth it. Instead we introduce two new helpers,
__pgd_alloc() and __pgd_free(), and make use of them in the arch-specific
pgd_alloc() and pgd_free() wherever possible. To accommodate as many arch
as possible, __pgd_alloc() takes a page allocation order.
Because pagetable_alloc() allocates zeroed pages, explicit zeroing in
pgd_alloc() becomes redundant and we can get rid of it. Some trivial
implementations of pgd_free() also become unnecessary once __pgd_alloc()
is used; remove them.
Another small improvement is consistent accounting of PGD pages by using
GFP_PGTABLE_{USER,KERNEL} as appropriate.
Not all PGD allocations can be handled by the generic helpers. In
particular, multiple architectures allocate PGDs from a kmem_cache, and
those PGDs may not be page-sized.
Link: https://lkml.kernel.org/r/20250103184415.2744423-6-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull hyperv updates from Wei Liu:
- Introduce a new set of Hyper-V headers in include/hyperv and replace
the old hyperv-tlfs.h with the new headers (Nuno Das Neves)
- Fixes for the Hyper-V VTL mode (Roman Kisel)
- Fixes for cpu mask usage in Hyper-V code (Michael Kelley)
- Document the guest VM hibernation behaviour (Michael Kelley)
- Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain)
* tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Documentation: hyperv: Add overview of guest VM hibernation
hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id()
hyperv: Do not overlap the hvcall IO areas in get_vtl()
hyperv: Enable the hypercall output page for the VTL mode
hv_balloon: Fallback to generic_online_page() for non-HV hot added mem
Drivers: hv: vmbus: Log on missing offers if any
Drivers: hv: vmbus: Wait for boot-time offers during boot and resume
uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup
iommu/hyper-v: Don't assume cpu_possible_mask is dense
Drivers: hv: Don't assume cpu_possible_mask is dense
x86/hyperv: Don't assume cpu_possible_mask is dense
hyperv: Remove the now unused hyperv-tlfs.h files
hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
hyperv: Add new Hyper-V headers in include/hyperv
hyperv: Clean up unnecessary #includes
hyperv: Move hv_connection_id to hyperv-tlfs.h
Pull timer fixes from Borislav Petkov:
- Fix a case where posix timers with a thread-group-wide target would
miss signals if some of the group's threads are exiting
- Fix a hang caused by ndelay() calling the wrong delay function
__udelay()
- Fix a wrong offset calculation in adjtimex(2) when using ADJ_MICRO
(microsecond resolution) and a negative offset
* tag 'timers_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-timers: Target group sigqueue to current task only if not exiting
delay: Fix ndelay() spuriously treated as udelay()
ntp: Remove invalid cast in time offset math
Pull Kbuild updates from Masahiro Yamada:
- Add generic support for built-in boot DTB files
- Enable TAB cycling for dialog buttons in nconfig
- Fix issues in streamline_config.pl
- Refactor Kconfig
- Add support for Clang's AutoFDO (Automatic Feedback-Directed
Optimization)
- Add support for Clang's Propeller, a profile-guided optimization.
- Change the working directory to the external module directory for M=
builds
- Support building external modules in a separate output directory
- Enable objtool for *.mod.o and additional kernel objects
- Use lz4 instead of deprecated lz4c
- Work around a performance issue with "git describe"
- Refactor modpost
* tag 'kbuild-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (85 commits)
kbuild: rename .tmp_vmlinux.kallsyms0.syms to .tmp_vmlinux0.syms
gitignore: Don't ignore 'tags' directory
kbuild: add dependency from vmlinux to resolve_btfids
modpost: replace tdb_hash() with hash_str()
kbuild: deb-pkg: add python3:native to build dependency
genksyms: reduce indentation in export_symbol()
modpost: improve error messages in device_id_check()
modpost: rename alias symbol for MODULE_DEVICE_TABLE()
modpost: rename variables in handle_moddevtable()
modpost: move strstarts() to modpost.h
modpost: convert do_usb_table() to a generic handler
modpost: convert do_of_table() to a generic handler
modpost: convert do_pnp_device_entry() to a generic handler
modpost: convert do_pnp_card_entries() to a generic handler
modpost: call module_alias_printf() from all do_*_entry() functions
modpost: pass (struct module *) to do_*_entry() functions
modpost: remove DEF_FIELD_ADDR_VAR() macro
modpost: deduplicate MODULE_ALIAS() for all drivers
modpost: introduce module_alias_printf() helper
modpost: remove unnecessary check in do_acpi_entry()
...
Pull RISC-v updates from Palmer Dabbelt:
- Support for pointer masking in userspace
- Support for probing vector misaligned access performance
- Support for qspinlock on systems with Zacas and Zabha
* tag 'riscv-for-linus-6.13-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (38 commits)
RISC-V: Remove unnecessary include from compat.h
riscv: Fix default misaligned access trap
riscv: Add qspinlock support
dt-bindings: riscv: Add Ziccrse ISA extension description
riscv: Add ISA extension parsing for Ziccrse
asm-generic: ticket-lock: Add separate ticket-lock.h
asm-generic: ticket-lock: Reuse arch_spinlock_t of qspinlock
riscv: Implement xchg8/16() using Zabha
riscv: Implement arch_cmpxchg128() using Zacas
riscv: Improve zacas fully-ordered cmpxchg()
riscv: Implement cmpxchg8/16() using Zabha
dt-bindings: riscv: Add Zabha ISA extension description
riscv: Implement cmpxchg32/64() using Zacas
riscv: Do not fail to build on byte/halfword operations with Zawrs
riscv: Move cpufeature.h macros into their own header
KVM: riscv: selftests: Add Smnpm and Ssnpm to get-reg-list test
RISC-V: KVM: Allow Smnpm and Ssnpm extensions for guests
riscv: hwprobe: Export the Supm ISA extension
riscv: selftests: Add a pointer masking test
riscv: Allow ptrace control of the tagged address ABI
...
Commit b1fca27d38 ("kernel debug: support resetting WARN*_ONCE")
added support for clearing the state of once warnings. However,
it is not functional when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION or
CONFIG_LTO_CLANG is enabled, because .data.once matches the
.data.[0-9a-zA-Z_]* pattern in the DATA_MAIN macro.
Commit cb87481ee8 ("kbuild: linker script do not match C names unless
LD_DEAD_CODE_DATA_ELIMINATION is configured") was introduced to suppress
the issue for the default CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=n case,
providing a minimal fix for stable backporting. We were aware this did
not address the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. The
plan was to apply correct fixes and then revert cb87481ee8. [1]
Seven years have passed since then, yet the #ifdef workaround remains in
place. Meanwhile, commit b1fca27d38 introduced the .data.once section,
and commit dc5723b02e ("kbuild: add support for Clang LTO") extended
the #ifdef.
Using a ".." separator in the section name fixes the issue for
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_LTO_CLANG.
[1]: https://lore.kernel.org/linux-kbuild/CAK7LNASck6BfdLnESxXUeECYL26yUDm0cwRZuM4gmaWUkxjL5g@mail.gmail.com/
Fixes: b1fca27d38 ("kernel debug: support resetting WARN*_ONCE")
Fixes: dc5723b02e ("kbuild: add support for Clang LTO")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Commit 7ccaba5314 ("consolidate WARN_...ONCE() static variables")
was intended to collect all .data.unlikely sections into one chunk.
However, this has not worked when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
or CONFIG_LTO_CLANG is enabled, because .data.unlikely matches the
.data.[0-9a-zA-Z_]* pattern in the DATA_MAIN macro.
Commit cb87481ee8 ("kbuild: linker script do not match C names unless
LD_DEAD_CODE_DATA_ELIMINATION is configured") was introduced to suppress
the issue for the default CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=n case,
providing a minimal fix for stable backporting. We were aware this did
not address the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. The
plan was to apply correct fixes and then revert cb87481ee8. [1]
Seven years have passed since then, yet the #ifdef workaround remains in
place.
Using a ".." separator in the section name fixes the issue for
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_LTO_CLANG.
[1]: https://lore.kernel.org/linux-kbuild/CAK7LNASck6BfdLnESxXUeECYL26yUDm0cwRZuM4gmaWUkxjL5g@mail.gmail.com/
Fixes: cb87481ee8 ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Add the build support for using Clang's Propeller optimizer. Like
AutoFDO, Propeller uses hardware sampling to gather information
about the frequency of execution of different code paths within a
binary. This information is then used to guide the compiler's
optimization decisions, resulting in a more efficient binary.
The support requires a Clang compiler LLVM 19 or later, and the
create_llvm_prof tool
(https://github.com/google/autofdo/releases/tag/v0.30.1). This
commit is limited to x86 platforms that support PMU features
like LBR on Intel machines and AMD Zen3 BRS.
Here is an example workflow for building an AutoFDO+Propeller
optimized kernel:
1) Build the kernel on the host machine, with AutoFDO and Propeller
build config
CONFIG_AUTOFDO_CLANG=y
CONFIG_PROPELLER_CLANG=y
then
$ make LLVM=1 CLANG_AUTOFDO_PROFILE=<autofdo_profile>
“<autofdo_profile>” is the profile collected when doing a non-Propeller
AutoFDO build. This step builds a kernel that has the same optimization
level as AutoFDO, plus a metadata section that records basic block
information. This kernel image runs as fast as an AutoFDO optimized
kernel.
2) Install the kernel on test/production machines.
3) Run the load tests. The '-c' option in perf specifies the sample
event period. We suggest using a suitable prime number,
like 500009, for this purpose.
For Intel platforms:
$ perf record -e BR_INST_RETIRED.NEAR_TAKEN:k -a -N -b -c <count> \
-o <perf_file> -- <loadtest>
For AMD platforms:
The supported system are: Zen3 with BRS, or Zen4 with amd_lbr_v2
# To see if Zen3 support LBR:
$ cat proc/cpuinfo | grep " brs"
# To see if Zen4 support LBR:
$ cat proc/cpuinfo | grep amd_lbr_v2
# If the result is yes, then collect the profile using:
$ perf record --pfm-events RETIRED_TAKEN_BRANCH_INSTRUCTIONS:k -a \
-N -b -c <count> -o <perf_file> -- <loadtest>
4) (Optional) Download the raw perf file to the host machine.
5) Generate Propeller profile:
$ create_llvm_prof --binary=<vmlinux> --profile=<perf_file> \
--format=propeller --propeller_output_module_name \
--out=<propeller_profile_prefix>_cc_profile.txt \
--propeller_symorder=<propeller_profile_prefix>_ld_profile.txt
“create_llvm_prof” is the profile conversion tool, and a prebuilt
binary for linux can be found on
https://github.com/google/autofdo/releases/tag/v0.30.1 (can also build
from source).
"<propeller_profile_prefix>" can be something like
"/home/user/dir/any_string".
This command generates a pair of Propeller profiles:
"<propeller_profile_prefix>_cc_profile.txt" and
"<propeller_profile_prefix>_ld_profile.txt".
6) Rebuild the kernel using the AutoFDO and Propeller profile files.
CONFIG_AUTOFDO_CLANG=y
CONFIG_PROPELLER_CLANG=y
and
$ make LLVM=1 CLANG_AUTOFDO_PROFILE=<autofdo_profile> \
CLANG_PROPELLER_PROFILE_PREFIX=<propeller_profile_prefix>
Co-developed-by: Han Shen <shenhan@google.com>
Signed-off-by: Han Shen <shenhan@google.com>
Signed-off-by: Rong Xu <xur@google.com>
Suggested-by: Sriraman Tallam <tmsriram@google.com>
Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com>
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Stephane Eranian <eranian@google.com>
Tested-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Enable the machine function split optimization for AutoFDO in Clang.
Machine function split (MFS) is a pass in the Clang compiler that
splits a function into hot and cold parts. The linker groups all
cold blocks across functions together. This decreases hot code
fragmentation and improves iCache and iTLB utilization.
MFS requires a profile so this is enabled only for the AutoFDO builds.
Co-developed-by: Han Shen <shenhan@google.com>
Signed-off-by: Han Shen <shenhan@google.com>
Signed-off-by: Rong Xu <xur@google.com>
Suggested-by: Sriraman Tallam <tmsriram@google.com>
Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com>
Tested-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Yabin Cui <yabinc@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Enable -ffunction-sections by default for the AutoFDO build.
With -ffunction-sections, the compiler places each function in its own
section named .text.function_name instead of placing all functions in
the .text section. In the AutoFDO build, this allows the linker to
utilize profile information to reorganize functions for improved
utilization of iCache and iTLB.
Co-developed-by: Han Shen <shenhan@google.com>
Signed-off-by: Han Shen <shenhan@google.com>
Signed-off-by: Rong Xu <xur@google.com>
Suggested-by: Sriraman Tallam <tmsriram@google.com>
Tested-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Yabin Cui <yabinc@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Add markers like __hot_text_start, __hot_text_end, __unlikely_text_start,
and __unlikely_text_end which will be included in System.map. These markers
indicate how the compiler groups functions, providing valuable information
to developers about the layout and optimization of the code.
Co-developed-by: Han Shen <shenhan@google.com>
Signed-off-by: Han Shen <shenhan@google.com>
Signed-off-by: Rong Xu <xur@google.com>
Suggested-by: Sriraman Tallam <tmsriram@google.com>
Tested-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Yabin Cui <yabinc@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
When the -ffunction-sections compiler option is enabled, each function
is placed in a separate section named .text.function_name rather than
putting all functions in a single .text section.
However, using -function-sections can cause problems with the
linker script. The comments included in include/asm-generic/vmlinux.lds.h
note these issues.:
“TEXT_MAIN here will match .text.fixup and .text.unlikely if dead
code elimination is enabled, so these sections should be converted
to use ".." first.”
It is unclear whether there is a straightforward method for converting
a suffix to "..".
This patch modifies the order of subsections within the text output
section. Specifically, it changes current order:
.text.hot, .text, .text_unlikely, .text.unknown, .text.asan
to the new order:
.text.asan, .text.unknown, .text_unlikely, .text.hot, .text
Here is the rationale behind the new layout:
The majority of the code resides in three sections: .text.hot, .text,
and .text.unlikely, with .text.unknown containing a negligible amount.
.text.asan is only generated in ASAN builds.
The primary goal is to group code segments based on their execution
frequency (hotness).
First, we want to place .text.hot adjacent to .text. Since we cannot put
.text.hot after .text (Due to constraints with -ffunction-sections,
placing .text.hot after .text is problematic), we need to put
.text.hot before .text.
Then it comes to .text.unlikely, we cannot put it after .text (same
-ffunction-sections issue) . Therefore, we position .text.unlikely
before .text.hot.
.text.unknown and .tex.asan follow the same logic.
This revised ordering effectively reverses the original arrangement (for
.text.unlikely, .text.unknown, and .tex.asan), maintaining a similar level
of affinity between sections.
It also places .text.hot section at the beginning of a page to better
utilize the TLB entry.
Note that the limitation arises because the linker script employs glob
patterns instead of regular expressions for string matching. While there
is a method to maintain the current order using complex patterns, this
significantly complicates the pattern and increases the likelihood of
errors.
This patch also changes vmlinux.lds.S for the sparc64 architecture to
accommodate specific symbol placement requirements.
Co-developed-by: Han Shen <shenhan@google.com>
Signed-off-by: Han Shen <shenhan@google.com>
Signed-off-by: Rong Xu <xur@google.com>
Suggested-by: Sriraman Tallam <tmsriram@google.com>
Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com>
Tested-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Yabin Cui <yabinc@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Pull MM updates from Andrew Morton:
- The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection
algorithm. This leads to improved memory savings.
- Wei Yang has gone to town on the mapletree code, contributing several
series which clean up the implementation:
- "refine mas_mab_cp()"
- "Reduce the space to be cleared for maple_big_node"
- "maple_tree: simplify mas_push_node()"
- "Following cleanup after introduce mas_wr_store_type()"
- "refine storing null"
- The series "selftests/mm: hugetlb_fault_after_madv improvements" from
David Hildenbrand fixes this selftest for s390.
- The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
implements some rationaizations and cleanups in the page mapping
code.
- The series "mm: optimize shadow entries removal" from Shakeel Butt
optimizes the file truncation code by speeding up the handling of
shadow entries.
- The series "Remove PageKsm()" from Matthew Wilcox completes the
migration of this flag over to being a folio-based flag.
- The series "Unify hugetlb into arch_get_unmapped_area functions" from
Oscar Salvador implements a bunch of consolidations and cleanups in
the hugetlb code.
- The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
takes away the wp-fault time practice of turning a huge zero page
into small pages. Instead we replace the whole thing with a THP. More
consistent cleaner and potentiall saves a large number of pagefaults.
- The series "percpu: Add a test case and fix for clang" from Andy
Shevchenko enhances and fixes the kernel's built in percpu test code.
- The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
optimizes mremap() by avoiding doing things which we didn't need to
do.
- The series "Improve the tmpfs large folio read performance" from
Baolin Wang teaches tmpfs to copy data into userspace at the folio
size rather than as individual pages. A 20% speedup was observed.
- The series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
splitting.
- The series "memcg-v1: fully deprecate charge moving" from Shakeel
Butt removes the long-deprecated memcgv2 charge moving feature.
- The series "fix error handling in mmap_region() and refactor" from
Lorenzo Stoakes cleanup up some of the mmap() error handling and
addresses some potential performance issues.
- The series "x86/module: use large ROX pages for text allocations"
from Mike Rapoport teaches x86 to use large pages for
read-only-execute module text.
- The series "page allocation tag compression" from Suren Baghdasaryan
is followon maintenance work for the new page allocation profiling
feature.
- The series "page->index removals in mm" from Matthew Wilcox remove
most references to page->index in mm/. A slow march towards shrinking
struct page.
- The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
interface tests" from Andrew Paniakin performs maintenance work for
DAMON's self testing code.
- The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
improves zswap's batching of compression and decompression. It is a
step along the way towards using Intel IAA hardware acceleration for
this zswap operation.
- The series "kasan: migrate the last module test to kunit" from
Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
tests over to the KUnit framework.
- The series "implement lightweight guard pages" from Lorenzo Stoakes
permits userapace to place fault-generating guard pages within a
single VMA, rather than requiring that multiple VMAs be created for
this. Improved efficiencies for userspace memory allocators are
expected.
- The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
tracepoints to provide increased visibility into memcg stats flushing
activity.
- The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
fixes a zram buglet which potentially affected performance.
- The series "mm: add more kernel parameters to control mTHP" from
Maíra Canal enhances our ability to control/configuremultisize THP
from the kernel boot command line.
- The series "kasan: few improvements on kunit tests" from Sabyrzhan
Tasbolatov has a couple of fixups for the KASAN KUnit tests.
- The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
from Kairui Song optimizes list_lru memory utilization when lockdep
is enabled.
* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
mm/kfence: add a new kunit test test_use_after_free_read_nofault()
zram: fix NULL pointer in comp_algorithm_show()
memcg/hugetlb: add hugeTLB counters to memcg
vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
zram: ZRAM_DEF_COMP should depend on ZRAM
MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
Docs/mm/damon: recommend academic papers to read and/or cite
mm: define general function pXd_init()
kmemleak: iommu/iova: fix transient kmemleak false positive
mm/list_lru: simplify the list_lru walk callback function
mm/list_lru: split the lock to per-cgroup scope
mm/list_lru: simplify reparenting and initial allocation
mm/list_lru: code clean up for reparenting
mm/list_lru: don't export list_lru_add
mm/list_lru: don't pass unnecessary key parameters
kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
...
Pull asm-generic updates from Arnd Bergmann:
"These are a number of unrelated cleanups, generally simplifying the
architecture specific header files:
- A series from Al Viro simplifies asm/vga.h, after it turns out that
most of it can be generalized.
- A series from Julian Vetter adds a common version of
memcpy_{to,from}io() and memset_io() and changes most architectures
to use that instead of their own implementation
- A series from Niklas Schnelle concludes his work to make PC style
inb()/outb() optional
- Nicolas Pitre contributes improvements for the generic do_div()
helper
- Christoph Hellwig adds a generic version of page_to_phys() and
phys_to_page(), replacing the slightly different architecture
specific definitions.
- Uwe Kleine-Koenig has a minor cleanup for ioctl definitions"
* tag 'asm-generic-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (24 commits)
empty include/asm-generic/vga.h
sparc: get rid of asm/vga.h
asm/vga.h: don't bother with scr_mem{cpy,move}v() unless we need to
vt_buffer.h: get rid of dead code in default scr_...() instances
tty: serial: export serial_8250_warn_need_ioport
lib/iomem_copy: fix kerneldoc format style
hexagon: simplify asm/io.h for !HAS_IOPORT
loongarch: Use new fallback IO memcpy/memset
csky: Use new fallback IO memcpy/memset
arm64: Use new fallback IO memcpy/memset
New implementation for IO memcpy and IO memset
watchdog: Add HAS_IOPORT dependency for SBC8360 and SBC7240
__arch_xprod64(): make __always_inline when optimizing for performance
ARM: div64: improve __arch_xprod_64()
asm-generic/div64: optimize/simplify __div64_const32()
lib/math/test_div64: add some edge cases relevant to __div64_const32()
asm-generic: add an optional pfn_valid check to page_to_phys
asm-generic: provide generic page_to_phys and phys_to_page implementations
asm-generic/io.h: Remove I/O port accessors for HAS_IOPORT=n
tty: serial: handle HAS_IOPORT dependencies
...
Pull timer updates from Thomas Gleixner:
"A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the
signal of the timer is ignored so that the timer signal can be
delivered once the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small
intervals and keeps the system pointlessly out of low power states
for no value. This is a long standing non-trivial problem due to
the lock order of posix-timer lock and the sighand lock along with
life time issues as the timer and the sigqueue have different life
time rules.
Cure this by:
- Embedding the sigqueue into the timer struct to have the same
life time rules. Aside of that this also avoids the lookup of
the timer in the signal delivery and rearm path as it's just a
always valid container_of() now.
- Queuing ignored timer signals onto a seperate ignored list.
- Moving queued timer signals onto the ignored list when the
signal is switched to SIG_IGN before it could be delivered.
- Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal
delivery code to rearm the timer.
This also required to consolidate the signal delivery rules so they
are consistent across all situations. With that all self test
scenarios finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time
stamps by default and switch to fine grained time stamps when inode
attributes are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that
the VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
- Move all sleep and timeout functions into one file
- Rework udelay() and ndelay() into proper documented inline
functions and replace the hardcoded magic numbers by proper
defines.
- Rework the fsleep() implementation to take the reality of the
timer wheel granularity on different HZ values into account.
Right now the boundaries are hard coded time ranges which fail
to provide the requested accuracy on different HZ settings.
- Update documentation for all sleep/timeout related functions
and fix up stale documentation links all over the place
- Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as
that's the clock which drives the timekeeping adjustments via the
various user space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file
descriptor based posix clocks, but their usability is very limited.
They can't be accessed fast as they always go all the way out to
the hardware and they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the
kernel provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework
converts timekeeping and adjtimex(2) into generic functionality
which operates on pointers to data structures instead of using
static variables.
This allows to provide time accessors and adjtimex(2) functionality
for the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less
straight forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the
core code and a few usage sites of the less frequently used
interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is
already prepared and scheduled for the next merge window.
- Drivers:
- Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with
other clusters.
- Mostly boring cleanups, fixes, improvements and code movement"
* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
posix-timers: Fix spurious warning on double enqueue versus do_exit()
clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
clocksource/drivers/gpx: Remove redundant casts
clocksource/drivers/timer-ti-dm: Fix child node refcount handling
dt-bindings: timer: actions,owl-timer: convert to YAML
clocksource/drivers/ralink: Add Ralink System Tick Counter driver
clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
clocksource/drivers:sp804: Make user selectable
clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
hrtimers: Delete hrtimer_init_on_stack()
alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
io_uring: Switch to use hrtimer_setup_on_stack()
sched/idle: Switch to use hrtimer_setup_on_stack()
hrtimers: Delete hrtimer_init_sleeper_on_stack()
wait: Switch to use hrtimer_setup_sleeper_on_stack()
timers: Switch to use hrtimer_setup_sleeper_on_stack()
net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
futex: Switch to use hrtimer_setup_sleeper_on_stack()
fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
...
Pull vdso data page handling updates from Thomas Gleixner:
"First steps of consolidating the VDSO data page handling.
The VDSO data page handling is architecture specific for historical
reasons, but there is no real technical reason to do so.
Aside of that VDSO data has become a dump ground for various
mechanisms and fail to provide a clear separation of the
functionalities.
Clean this up by:
- consolidating the VDSO page data by getting rid of architecture
specific warts especially in x86 and PowerPC.
- removing the last includes of header files which are pulling in
other headers outside of the VDSO namespace.
- seperating timekeeping and other VDSO data accordingly.
Further consolidation of the VDSO page handling is done in subsequent
changes scheduled for the next merge window.
This also lays the ground for expanding the VDSO time getters for
independent PTP clocks in a generic way without making every
architecture add support seperately"
* tag 'timers-vdso-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
x86/vdso: Add missing brackets in switch case
vdso: Rename struct arch_vdso_data to arch_vdso_time_data
powerpc: Split systemcfg struct definitions out from vdso
powerpc: Split systemcfg data out of vdso data page
powerpc: Add kconfig option for the systemcfg page
powerpc/pseries/lparcfg: Use num_possible_cpus() for potential processors
powerpc/pseries/lparcfg: Fix printing of system_active_processors
powerpc/procfs: Propagate error of remap_pfn_range()
powerpc/vdso: Remove offset comment from 32bit vdso_arch_data
x86/vdso: Split virtual clock pages into dedicated mapping
x86/vdso: Delete vvar.h
x86/vdso: Access vdso data without vvar.h
x86/vdso: Move the rng offset to vsyscall.h
x86/vdso: Access rng vdso data without vvar.h
x86/vdso: Access timens vdso data without vvar.h
x86/vdso: Allocate vvar page from C code
x86/vdso: Access rng data from kernel without vvar
x86/vdso: Place vdso_data at beginning of vvar page
x86/vdso: Use __arch_get_vdso_data() to access vdso data
x86/mm/mmap: Remove arch_vma_name()
...
Pull xattr updates from Al Viro:
"Sanitize xattr and io_uring interactions with it, add *xattrat()
syscalls, sanitize struct filename handling in there"
* tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xattr: remove redundant check on variable err
fs/xattr: add *at family syscalls
new helpers: file_removexattr(), filename_removexattr()
new helpers: file_listxattr(), filename_listxattr()
replace do_getxattr() with saner helpers.
replace do_setxattr() with saner helpers.
new helper: import_xattr_name()
fs: rename struct xattr_ctx to kernel_xattr_ctx
xattr: switch to CLASS(fd)
io_[gs]etxattr_prep(): just use getname()
io_uring: IORING_OP_F[GS]ETXATTR is fine with REQ_F_FIXED_FILE
getname_maybe_null() - the third variant of pathname copy-in
teach filename_lookup() to treat NULL filename as ""
all places that use anything defined in it (vgacon, mdacon and
vga16fb) are built only on architectures that have all that
stuff in their native asm/vga.h
allows to kill stub asm/vga.h on sh, while we are at it...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>