diff --git a/.clippy.toml b/.clippy.toml index e4c4eef10b28c1..815c94732ed785 100644 --- a/.clippy.toml +++ b/.clippy.toml @@ -1,5 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 +msrv = "1.78.0" + check-private-items = true disallowed-macros = [ diff --git a/.gitignore b/.gitignore index 6839cf84acda0d..5937c74d3dc1e4 100644 --- a/.gitignore +++ b/.gitignore @@ -22,6 +22,7 @@ *.dtb.S *.dtbo.S *.dwo +*.dylib *.elf *.gcno *.gcda diff --git a/.mailmap b/.mailmap index 399322897938d6..ae0adc499f4acb 100644 --- a/.mailmap +++ b/.mailmap @@ -83,6 +83,13 @@ Anirudh Ghayal Antoine Tenart Antoine Tenart Antonio Ospite +Antonio Quartulli +Antonio Quartulli +Antonio Quartulli +Antonio Quartulli +Antonio Quartulli +Antonio Quartulli +Antonio Quartulli Anup Patel Archit Taneja Ard Biesheuvel @@ -135,13 +142,17 @@ Boris Brezillon Boris Brezillon Brendan Higgins Brian Avery +Brian Cain +Brian Cain Brian King Brian Silverman Bryan Tan Cai Huoqing Can Guo Carl Huang -Carlos Bilbao +Carlos Bilbao +Carlos Bilbao +Carlos Bilbao Changbin Du Changbin Du Chao Yu @@ -158,6 +169,7 @@ Christian Brauner Christian Brauner Christian Marangi Christophe Ricard +Christopher Obbard Christoph Hellwig Chuck Lever Chuck Lever @@ -254,6 +266,7 @@ Guo Ren Guru Das Srinagesh Gustavo Padovan Gustavo Padovan +Hamza Mahfooz Hanjun Guo Hans Verkuil Hans Verkuil @@ -408,6 +421,7 @@ Liam Mark Linas Vepstas Linus Lüssing Linus Lüssing +Linus Lüssing Li Yang Li Yang @@ -430,6 +444,8 @@ Marcin Nowakowski Marc Zyngier Marek Behún Marek Behún Marek Behun +Marek Lindner +Marek Lindner Mark Brown Mark Starovoytov Markus Schneider-Pargmann @@ -532,6 +548,8 @@ Oleksij Rempel Oleksij Rempel Oleksij Rempel Oleksij Rempel +Oliver Hartkopp +Oliver Hartkopp Oliver Upton Ondřej Jirman Oza Pawandeep @@ -643,6 +661,11 @@ Simona Vetter Simon Horman Simon Horman Simon Kelley +Simon Wunderlich +Simon Wunderlich +Simon Wunderlich +Simon Wunderlich +Simon Wunderlich Sricharan Ramabadhran Srinivas Ramana Sriram R @@ -663,6 +686,11 @@ Sudarshan Rajagopalan Sudeep Holla Sudeep KarkadaNagesha Sumit Semwal Surabhi Vishnoi +Sven Eckelmann +Sven Eckelmann +Sven Eckelmann +Sven Eckelmann +Sven Eckelmann Takashi YOSHII Tamizh Chelvam Raja Taniya Das @@ -739,6 +767,7 @@ Wolfram Sang Yakir Yang Yanteng Si Ying Huang +Yosry Ahmed Yusuke Goda Zack Rusin Zhu Yanjun diff --git a/CREDITS b/CREDITS index cda68f04d5f11a..53d11a46fd698b 100644 --- a/CREDITS +++ b/CREDITS @@ -2515,11 +2515,9 @@ D: SLS distribution D: Initial implementation of VC's, pty's and select() N: Pavel Machek -E: pavel@ucw.cz +E: pavel@kernel.org P: 4096R/92DFCE96 4FA7 9EEF FCD4 C44F C585 B8C7 C060 2241 92DF CE96 -D: Softcursor for vga, hypertech cdrom support, vcsa bugfix, nbd, -D: sun4/330 port, capabilities for elf, speedup for rm on ext2, USB, -D: work on suspend-to-ram/disk, killing duplicates from ioctl32, +D: NBD, Sun4/330 port, USB, work on suspend-to-ram/disk, D: Altera SoCFPGA and Nokia N900 support. S: Czech Republic @@ -4339,7 +4337,7 @@ D: Freescale Highspeed USB device driver D: Freescale QE SoC support and Ethernet driver S: B-1206 Jingmao Guojigongyu S: 16 Baliqiao Nanjie, Beijing 101100 -S: People's Repulic of China +S: People's Republic of China N: Vlad Yasevich E: vyasevich@gmail.com diff --git a/Documentation/ABI/obsolete/sysfs-class-cxl b/Documentation/ABI/obsolete/sysfs-class-cxl new file mode 100644 index 00000000000000..8cba1b626985ea --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-class-cxl @@ -0,0 +1,273 @@ +The cxl driver is no longer maintained, and will be removed from the kernel in +the near future. + +Please note that attributes that are shared between devices are stored in +the directory pointed to by the symlink device/. +For example, the real path of the attribute /sys/class/cxl/afu0.0s/irqs_max is +/sys/class/cxl/afu0.0s/device/irqs_max, i.e. /sys/class/cxl/afu0.0/irqs_max. + + +Slave contexts (eg. /sys/class/cxl/afu0.0s): + +What: /sys/class/cxl//afu_err_buf +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + AFU Error Buffer contents. The contents of this file are + application specific and depends on the AFU being used. + Applications interacting with the AFU can use this attribute + to know about the current error condition and take appropriate + action like logging the event etc. + + +What: /sys/class/cxl//irqs_max +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read/write + Decimal value of maximum number of interrupts that can be + requested by userspace. The default on probe is the maximum + that hardware can support (eg. 2037). Write values will limit + userspace applications to that many userspace interrupts. Must + be >= irqs_min. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//irqs_min +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Decimal value of the minimum number of interrupts that + userspace must request on a CXL_START_WORK ioctl. Userspace may + omit the num_interrupts field in the START_WORK IOCTL to get + this minimum automatically. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//mmio_size +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Decimal value of the size of the MMIO space that may be mmapped + by userspace. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//modes_supported +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + List of the modes this AFU supports. One per line. + Valid entries are: "dedicated_process" and "afu_directed" +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//mode +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read/write + The current mode the AFU is using. Will be one of the modes + given in modes_supported. Writing will change the mode + provided that no user contexts are attached. +Users: https://github.com/ibm-capi/libcxl + + +What: /sys/class/cxl//prefault_mode +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read/write + Set the mode for prefaulting in segments into the segment table + when performing the START_WORK ioctl. Only applicable when + running under hashed page table mmu. + Possible values: + + ======================= ====================================== + none No prefaulting (default) + work_element_descriptor Treat the work element + descriptor as an effective address and + prefault what it points to. + all all segments process calling + START_WORK maps. + ======================= ====================================== + +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//reset +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: write only + Writing 1 here will reset the AFU provided there are not + contexts active on the AFU. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//api_version +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Decimal value of the current version of the kernel/user API. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//api_version_compatible +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Decimal value of the lowest version of the userspace API + this kernel supports. +Users: https://github.com/ibm-capi/libcxl + + +AFU configuration records (eg. /sys/class/cxl/afu0.0/cr0): + +An AFU may optionally export one or more PCIe like configuration records, known +as AFU configuration records, which will show up here (if present). + +What: /sys/class/cxl//cr/vendor +Date: February 2015 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Hexadecimal value of the vendor ID found in this AFU + configuration record. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//cr/device +Date: February 2015 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Hexadecimal value of the device ID found in this AFU + configuration record. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//cr/class +Date: February 2015 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Hexadecimal value of the class code found in this AFU + configuration record. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//cr/config +Date: February 2015 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + This binary file provides raw access to the AFU configuration + record. The format is expected to match the either the standard + or extended configuration space defined by the PCIe + specification. +Users: https://github.com/ibm-capi/libcxl + + + +Master contexts (eg. /sys/class/cxl/afu0.0m) + +What: /sys/class/cxl/m/mmio_size +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Decimal value of the size of the MMIO space that may be mmapped + by userspace. This includes all slave contexts space also. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl/m/pp_mmio_len +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Decimal value of the Per Process MMIO space length. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl/m/pp_mmio_off +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + (not in a guest) + Decimal value of the Per Process MMIO space offset. +Users: https://github.com/ibm-capi/libcxl + + +Card info (eg. /sys/class/cxl/card0) + +What: /sys/class/cxl//caia_version +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Identifies the CAIA Version the card implements. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//psl_revision +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Identifies the revision level of the PSL. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//base_image +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + (not in a guest) + Identifies the revision level of the base image for devices + that support loadable PSLs. For FPGAs this field identifies + the image contained in the on-adapter flash which is loaded + during the initial program load. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//image_loaded +Date: September 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + (not in a guest) + Will return "user" or "factory" depending on the image loaded + onto the card. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//load_image_on_perst +Date: December 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read/write + (not in a guest) + Valid entries are "none", "user", and "factory". + "none" means PERST will not cause image to be loaded to the + card. A power cycle is required to load the image. + "none" could be useful for debugging because the trace arrays + are preserved. + + "user" and "factory" means PERST will cause either the user or + user or factory image to be loaded. + Default is to reload on PERST whichever image the card has + loaded. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//reset +Date: October 2014 +Contact: linuxppc-dev@lists.ozlabs.org +Description: write only + Writing 1 will issue a PERST to card provided there are no + contexts active on any one of the card AFUs. This may cause + the card to reload the FPGA depending on load_image_on_perst. + Writing -1 will do a force PERST irrespective of any active + contexts on the card AFUs. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//perst_reloads_same_image +Date: July 2015 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read/write + (not in a guest) + Trust that when an image is reloaded via PERST, it will not + have changed. + + == ================================================= + 0 don't trust, the image may be different (default) + 1 trust that the image will not change. + == ================================================= +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//psl_timebase_synced +Date: March 2016 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Returns 1 if the psl timebase register is synchronized + with the core timebase register, 0 otherwise. +Users: https://github.com/ibm-capi/libcxl + +What: /sys/class/cxl//tunneled_ops_supported +Date: May 2018 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Returns 1 if tunneled operations are supported in capi mode, + 0 otherwise. +Users: https://github.com/ibm-capi/libcxl diff --git a/Documentation/ABI/stable/sysfs-class-bluetooth b/Documentation/ABI/stable/sysfs-class-bluetooth new file mode 100644 index 00000000000000..36be0247117471 --- /dev/null +++ b/Documentation/ABI/stable/sysfs-class-bluetooth @@ -0,0 +1,9 @@ +What: /sys/class/bluetooth/hci/reset +Date: 14-Jan-2025 +KernelVersion: 6.13 +Contact: linux-bluetooth@vger.kernel.org +Description: This write-only attribute allows users to trigger the vendor reset + method on the Bluetooth device when arbitrary data is written. + The reset may or may not be done through the device transport + (e.g., UART/USB), and can also be done through an out-of-band + approach such as GPIO. diff --git a/Documentation/ABI/testing/sysfs-bus-coresight-devices-dummy-source b/Documentation/ABI/testing/sysfs-bus-coresight-devices-dummy-source new file mode 100644 index 00000000000000..0830661ef65684 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-coresight-devices-dummy-source @@ -0,0 +1,15 @@ +What: /sys/bus/coresight/devices/dummy_source/enable_source +Date: Dec 2024 +KernelVersion: 6.14 +Contact: Mao Jinlong +Description: (RW) Enable/disable tracing of dummy source. A sink should be activated + before enabling the source. The path of coresight components linking + the source to the sink is configured and managed automatically by the + coresight framework. + +What: /sys/bus/coresight/devices/dummy_source/traceid +Date: Dec 2024 +KernelVersion: 6.14 +Contact: Mao Jinlong +Description: (R) Show the trace ID that will appear in the trace stream + coming from this trace entity. diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices b/Documentation/ABI/testing/sysfs-bus-event_source-devices new file mode 100644 index 00000000000000..79b268319df183 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices @@ -0,0 +1,24 @@ +What: /sys/bus/event_source/devices/ +Date: 2014/02/24 +Contact: Linux kernel mailing list +Description: Performance Monitoring Unit () + + Each directory, for a PMU device, is a name + optionally followed by an underscore and then either a + decimal or hexadecimal number. For example, cpu is a + PMU name without a suffix as is intel_bts, + uncore_imc_0 is a PMU name with a 0 numeric suffix, + ddr_pmu_87e1b0000000 is a PMU name with a hex + suffix. The hex suffix must be more than two + characters long to avoid ambiguity with PMUs like the + S390 cpum_cf. + + Tools can treat PMUs with the same name that differ by + suffix as instances of the same PMU for the sake of, + for example, opening an event. For example, the PMUs + uncore_imc_free_running_0 and + uncore_imc_free_running_1 have an event data_read; + opening the data_read event on a PMU specified as + uncore_imc_free_running should be treated as opening + the data_read event on PMU uncore_imc_free_running_0 + and PMU uncore_imc_free_running_1. diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events index e7efeab2ee83ce..0fe1b948720258 100644 --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events @@ -37,11 +37,13 @@ Description: Per-pmu performance monitoring events specific to the running syste performance monitoring event supported by the . The name of the file is the name of the event. - As performance monitoring event names are case - insensitive in the perf tool, the perf tool only looks - for lower or upper case event names in sysfs to avoid + As performance monitoring event names are case insensitive + in the perf tool, the perf tool only looks for all lower + case or all upper case event names in sysfs to avoid scanning the directory. It is therefore required the - name of the event here is either lower or upper case. + name of the event here is either completely lower or upper + case, with no mixed-case characters. Numbers, '.', '_', and + '-' are also allowed. File contents: diff --git a/Documentation/ABI/testing/sysfs-bus-iio b/Documentation/ABI/testing/sysfs-bus-iio index f83bd6829285cd..25d366d452a552 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio +++ b/Documentation/ABI/testing/sysfs-bus-iio @@ -168,18 +168,6 @@ Description: is required is a consistent labeling. Units after application of scale and offset are millivolts. -What: /sys/bus/iio/devices/iio:deviceX/in_currentY_raw -What: /sys/bus/iio/devices/iio:deviceX/in_currentY_supply_raw -KernelVersion: 3.17 -Contact: linux-iio@vger.kernel.org -Description: - Raw (unscaled no bias removal etc.) current measurement from - channel Y. In special cases where the channel does not - correspond to externally available input one of the named - versions may be used. The number must always be specified and - unique to allow association with event codes. Units after - application of scale and offset are milliamps. - What: /sys/bus/iio/devices/iio:deviceX/in_powerY_raw KernelVersion: 4.5 Contact: linux-iio@vger.kernel.org @@ -227,7 +215,7 @@ Description: same scaling as _raw. What: /sys/bus/iio/devices/iio:deviceX/in_temp_raw -What: /sys/bus/iio/devices/iio:deviceX/in_tempX_raw +What: /sys/bus/iio/devices/iio:deviceX/in_tempY_raw What: /sys/bus/iio/devices/iio:deviceX/in_temp_x_raw What: /sys/bus/iio/devices/iio:deviceX/in_temp_y_raw What: /sys/bus/iio/devices/iio:deviceX/in_temp_ambient_raw @@ -416,11 +404,11 @@ Contact: linux-iio@vger.kernel.org Description: Scaled humidity measurement in milli percent. -What: /sys/bus/iio/devices/iio:deviceX/in_X_mean_raw +What: /sys/bus/iio/devices/iio:deviceX/in_Y_mean_raw KernelVersion: 3.5 Contact: linux-iio@vger.kernel.org Description: - Averaged raw measurement from channel X. The number of values + Averaged raw measurement from channel Y. The number of values used for averaging is device specific. The converting rules for normal raw values also applies to the averaged raw values. @@ -448,7 +436,7 @@ What: /sys/bus/iio/devices/iio:deviceX/in_humidityrelative_offset What: /sys/bus/iio/devices/iio:deviceX/in_magn_offset What: /sys/bus/iio/devices/iio:deviceX/in_rot_offset What: /sys/bus/iio/devices/iio:deviceX/in_angl_offset -What: /sys/bus/iio/devices/iio:deviceX/in_capacitanceX_offset +What: /sys/bus/iio/devices/iio:deviceX/in_capacitanceY_offset KernelVersion: 2.6.35 Contact: linux-iio@vger.kernel.org Description: @@ -508,6 +496,9 @@ What: /sys/bus/iio/devices/iio:deviceX/in_angl_scale What: /sys/bus/iio/devices/iio:deviceX/in_intensity_x_scale What: /sys/bus/iio/devices/iio:deviceX/in_intensity_y_scale What: /sys/bus/iio/devices/iio:deviceX/in_intensity_z_scale +What: /sys/bus/iio/devices/iio:deviceX/in_intensity_red_scale +What: /sys/bus/iio/devices/iio:deviceX/in_intensity_green_scale +What: /sys/bus/iio/devices/iio:deviceX/in_intensity_blue_scale What: /sys/bus/iio/devices/iio:deviceX/in_concentration_co2_scale KernelVersion: 2.6.35 Contact: linux-iio@vger.kernel.org @@ -660,10 +651,10 @@ What: /sys/.../iio:deviceX/in_magn_scale_available What: /sys/.../iio:deviceX/in_illuminance_scale_available What: /sys/.../iio:deviceX/in_intensity_scale_available What: /sys/.../iio:deviceX/in_proximity_scale_available -What: /sys/.../iio:deviceX/in_voltageX_scale_available +What: /sys/.../iio:deviceX/in_voltageY_scale_available What: /sys/.../iio:deviceX/in_voltage-voltage_scale_available -What: /sys/.../iio:deviceX/out_voltageX_scale_available -What: /sys/.../iio:deviceX/out_altvoltageX_scale_available +What: /sys/.../iio:deviceX/out_voltageY_scale_available +What: /sys/.../iio:deviceX/out_altvoltageY_scale_available What: /sys/.../iio:deviceX/in_capacitance_scale_available What: /sys/.../iio:deviceX/in_pressure_scale_available What: /sys/.../iio:deviceX/in_pressureY_scale_available @@ -681,6 +672,7 @@ What: /sys/bus/iio/devices/iio:deviceX/in_intensity_red_hardwaregain What: /sys/bus/iio/devices/iio:deviceX/in_intensity_green_hardwaregain What: /sys/bus/iio/devices/iio:deviceX/in_intensity_blue_hardwaregain What: /sys/bus/iio/devices/iio:deviceX/in_intensity_clear_hardwaregain +What: /sys/bus/iio/devices/iio:deviceX/in_illuminance_hardwaregain KernelVersion: 2.6.35 Contact: linux-iio@vger.kernel.org Description: @@ -1562,7 +1554,7 @@ Description: This attribute is used to read the amount of quadrature error present in the device at a given time. -What: /sys/.../iio:deviceX/in_accelX_power_mode +What: /sys/.../iio:deviceX/in_accelY_power_mode KernelVersion: 3.11 Contact: linux-iio@vger.kernel.org Description: @@ -1633,6 +1625,10 @@ What: /sys/.../iio:deviceX/in_intensityY_uv_raw What: /sys/.../iio:deviceX/in_intensityY_uva_raw What: /sys/.../iio:deviceX/in_intensityY_uvb_raw What: /sys/.../iio:deviceX/in_intensityY_duv_raw +What: /sys/.../iio:deviceX/in_intensity_red_raw +What: /sys/.../iio:deviceX/in_intensity_green_raw +What: /sys/.../iio:deviceX/in_intensity_blue_raw +What: /sys/.../iio:deviceX/in_intensity_clear_raw KernelVersion: 3.4 Contact: linux-iio@vger.kernel.org Description: @@ -1691,16 +1687,19 @@ Description: Raw value of rotation from true/magnetic north measured with or without compensation from tilt sensors. -What: /sys/bus/iio/devices/iio:deviceX/in_currentX_raw -What: /sys/bus/iio/devices/iio:deviceX/in_currentX_i_raw -What: /sys/bus/iio/devices/iio:deviceX/in_currentX_q_raw -KernelVersion: 3.18 +What: /sys/bus/iio/devices/iio:deviceX/in_currentY_raw +What: /sys/bus/iio/devices/iio:deviceX/in_currentY_supply_raw +What: /sys/bus/iio/devices/iio:deviceX/in_currentY_i_raw +What: /sys/bus/iio/devices/iio:deviceX/in_currentY_q_raw +KernelVersion: 3.17 Contact: linux-iio@vger.kernel.org Description: - Raw current measurement from channel X. Units are in milliamps + Raw current measurement from channel Y. Units are in milliamps after application of scale and offset. If no offset or scale is present, output should be considered as processed with the - unit in milliamps. + unit in milliamps. In special cases where the channel does not + correspond to externally available input one of the named + versions may be used. Channels with 'i' and 'q' modifiers always exist in pairs and both channels refer to the same signal. The 'i' channel contains the in-phase @@ -1864,9 +1863,9 @@ Description: hardware fifo watermark level. What: /sys/bus/iio/devices/iio:deviceX/in_temp_calibemissivity -What: /sys/bus/iio/devices/iio:deviceX/in_tempX_calibemissivity +What: /sys/bus/iio/devices/iio:deviceX/in_tempY_calibemissivity What: /sys/bus/iio/devices/iio:deviceX/in_temp_object_calibemissivity -What: /sys/bus/iio/devices/iio:deviceX/in_tempX_object_calibemissivity +What: /sys/bus/iio/devices/iio:deviceX/in_tempY_object_calibemissivity KernelVersion: 4.1 Contact: linux-iio@vger.kernel.org Description: @@ -1887,17 +1886,17 @@ Description: is considered as one sample for [_name]_sampling_frequency. What: /sys/bus/iio/devices/iio:deviceX/in_concentration_raw -What: /sys/bus/iio/devices/iio:deviceX/in_concentrationX_raw +What: /sys/bus/iio/devices/iio:deviceX/in_concentrationY_raw What: /sys/bus/iio/devices/iio:deviceX/in_concentration_co2_raw -What: /sys/bus/iio/devices/iio:deviceX/in_concentrationX_co2_raw +What: /sys/bus/iio/devices/iio:deviceX/in_concentrationY_co2_raw What: /sys/bus/iio/devices/iio:deviceX/in_concentration_ethanol_raw -What: /sys/bus/iio/devices/iio:deviceX/in_concentrationX_ethanol_raw +What: /sys/bus/iio/devices/iio:deviceX/in_concentrationY_ethanol_raw What: /sys/bus/iio/devices/iio:deviceX/in_concentration_h2_raw -What: /sys/bus/iio/devices/iio:deviceX/in_concentrationX_h2_raw +What: /sys/bus/iio/devices/iio:deviceX/in_concentrationY_h2_raw What: /sys/bus/iio/devices/iio:deviceX/in_concentration_o2_raw -What: /sys/bus/iio/devices/iio:deviceX/in_concentrationX_o2_raw +What: /sys/bus/iio/devices/iio:deviceX/in_concentrationY_o2_raw What: /sys/bus/iio/devices/iio:deviceX/in_concentration_voc_raw -What: /sys/bus/iio/devices/iio:deviceX/in_concentrationX_voc_raw +What: /sys/bus/iio/devices/iio:deviceX/in_concentrationY_voc_raw KernelVersion: 4.3 Contact: linux-iio@vger.kernel.org Description: @@ -1905,9 +1904,9 @@ Description: after application of scale and offset are percents. What: /sys/bus/iio/devices/iio:deviceX/in_resistance_raw -What: /sys/bus/iio/devices/iio:deviceX/in_resistanceX_raw +What: /sys/bus/iio/devices/iio:deviceX/in_resistanceY_raw What: /sys/bus/iio/devices/iio:deviceX/out_resistance_raw -What: /sys/bus/iio/devices/iio:deviceX/out_resistanceX_raw +What: /sys/bus/iio/devices/iio:deviceX/out_resistanceY_raw KernelVersion: 4.3 Contact: linux-iio@vger.kernel.org Description: @@ -2096,7 +2095,7 @@ Description: One of the following thermocouple types: B, E, J, K, N, R, S, T. What: /sys/bus/iio/devices/iio:deviceX/in_temp_object_calibambient -What: /sys/bus/iio/devices/iio:deviceX/in_tempX_object_calibambient +What: /sys/bus/iio/devices/iio:deviceX/in_tempY_object_calibambient KernelVersion: 5.10 Contact: linux-iio@vger.kernel.org Description: @@ -2172,9 +2171,9 @@ Description: - a range specified as "[min step max]" -What: /sys/bus/iio/devices/iio:deviceX/in_voltageX_sampling_frequency +What: /sys/bus/iio/devices/iio:deviceX/in_voltageY_sampling_frequency What: /sys/bus/iio/devices/iio:deviceX/in_powerY_sampling_frequency -What: /sys/bus/iio/devices/iio:deviceX/in_currentZ_sampling_frequency +What: /sys/bus/iio/devices/iio:deviceX/in_currentY_sampling_frequency KernelVersion: 5.20 Contact: linux-iio@vger.kernel.org Description: diff --git a/Documentation/ABI/testing/sysfs-bus-iio-adc-ad-sigma-delta b/Documentation/ABI/testing/sysfs-bus-iio-adc-ad-sigma-delta new file mode 100644 index 00000000000000..a5a8a579f4f34d --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-iio-adc-ad-sigma-delta @@ -0,0 +1,23 @@ +What: /sys/bus/iio/devices/iio:deviceX/in_voltageY_sys_calibration +KernelVersion: 5.5 +Contact: linux-iio@vger.kernel.org +Description: + This attribute, if available, initiates the system calibration procedure. This is done on a + single channel at a time. Write '1' to start the calibration. + +What: /sys/bus/iio/devices/iio:deviceX/in_voltageY_sys_calibration_mode_available +KernelVersion: 5.5 +Contact: linux-iio@vger.kernel.org +Description: + This attribute, if available, returns a list with the possible calibration modes. + There are two available options: + "zero_scale" - calibrate to zero scale + "full_scale" - calibrate to full scale + +What: /sys/bus/iio/devices/iio:deviceX/in_voltageY_sys_calibration_mode +KernelVersion: 5.5 +Contact: linux-iio@vger.kernel.org +Description: + This attribute, if available, sets up the calibration mode used in the system calibration + procedure. Reading returns the current calibration mode. + Writing sets the system calibration mode. diff --git a/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 b/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 index f8315202c8f0df..28be1cabf1124a 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 +++ b/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 @@ -19,33 +19,9 @@ Description: the bridge can be disconnected (when it is not being used using the bridge_switch_en attribute. -What: /sys/bus/iio/devices/iio:deviceX/in_voltagex_sys_calibration -KernelVersion: -Contact: linux-iio@vger.kernel.org -Description: - Initiates the system calibration procedure. This is done on a - single channel at a time. Write '1' to start the calibration. - What: /sys/bus/iio/devices/iio:deviceX/in_voltage2-voltage2_shorted_raw KernelVersion: Contact: linux-iio@vger.kernel.org Description: Measure voltage from AIN2 pin connected to AIN(+) and AIN(-) shorted. - -What: /sys/bus/iio/devices/iio:deviceX/in_voltagex_sys_calibration_mode_available -KernelVersion: -Contact: linux-iio@vger.kernel.org -Description: - Reading returns a list with the possible calibration modes. - There are two available options: - "zero_scale" - calibrate to zero scale - "full_scale" - calibrate to full scale - -What: /sys/bus/iio/devices/iio:deviceX/in_voltagex_sys_calibration_mode -KernelVersion: -Contact: linux-iio@vger.kernel.org -Description: - Sets up the calibration mode used in the system calibration - procedure. Reading returns the current calibration mode. - Writing sets the system calibration mode. diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl deleted file mode 100644 index cfc48a87706ba2..00000000000000 --- a/Documentation/ABI/testing/sysfs-class-cxl +++ /dev/null @@ -1,270 +0,0 @@ -Please note that attributes that are shared between devices are stored in -the directory pointed to by the symlink device/. -For example, the real path of the attribute /sys/class/cxl/afu0.0s/irqs_max is -/sys/class/cxl/afu0.0s/device/irqs_max, i.e. /sys/class/cxl/afu0.0/irqs_max. - - -Slave contexts (eg. /sys/class/cxl/afu0.0s): - -What: /sys/class/cxl//afu_err_buf -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - AFU Error Buffer contents. The contents of this file are - application specific and depends on the AFU being used. - Applications interacting with the AFU can use this attribute - to know about the current error condition and take appropriate - action like logging the event etc. - - -What: /sys/class/cxl//irqs_max -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read/write - Decimal value of maximum number of interrupts that can be - requested by userspace. The default on probe is the maximum - that hardware can support (eg. 2037). Write values will limit - userspace applications to that many userspace interrupts. Must - be >= irqs_min. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//irqs_min -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Decimal value of the minimum number of interrupts that - userspace must request on a CXL_START_WORK ioctl. Userspace may - omit the num_interrupts field in the START_WORK IOCTL to get - this minimum automatically. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//mmio_size -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Decimal value of the size of the MMIO space that may be mmapped - by userspace. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//modes_supported -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - List of the modes this AFU supports. One per line. - Valid entries are: "dedicated_process" and "afu_directed" -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//mode -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read/write - The current mode the AFU is using. Will be one of the modes - given in modes_supported. Writing will change the mode - provided that no user contexts are attached. -Users: https://github.com/ibm-capi/libcxl - - -What: /sys/class/cxl//prefault_mode -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read/write - Set the mode for prefaulting in segments into the segment table - when performing the START_WORK ioctl. Only applicable when - running under hashed page table mmu. - Possible values: - - ======================= ====================================== - none No prefaulting (default) - work_element_descriptor Treat the work element - descriptor as an effective address and - prefault what it points to. - all all segments process calling - START_WORK maps. - ======================= ====================================== - -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//reset -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: write only - Writing 1 here will reset the AFU provided there are not - contexts active on the AFU. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//api_version -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Decimal value of the current version of the kernel/user API. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//api_version_compatible -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Decimal value of the lowest version of the userspace API - this kernel supports. -Users: https://github.com/ibm-capi/libcxl - - -AFU configuration records (eg. /sys/class/cxl/afu0.0/cr0): - -An AFU may optionally export one or more PCIe like configuration records, known -as AFU configuration records, which will show up here (if present). - -What: /sys/class/cxl//cr/vendor -Date: February 2015 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Hexadecimal value of the vendor ID found in this AFU - configuration record. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//cr/device -Date: February 2015 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Hexadecimal value of the device ID found in this AFU - configuration record. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//cr/class -Date: February 2015 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Hexadecimal value of the class code found in this AFU - configuration record. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//cr/config -Date: February 2015 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - This binary file provides raw access to the AFU configuration - record. The format is expected to match the either the standard - or extended configuration space defined by the PCIe - specification. -Users: https://github.com/ibm-capi/libcxl - - - -Master contexts (eg. /sys/class/cxl/afu0.0m) - -What: /sys/class/cxl/m/mmio_size -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Decimal value of the size of the MMIO space that may be mmapped - by userspace. This includes all slave contexts space also. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl/m/pp_mmio_len -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Decimal value of the Per Process MMIO space length. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl/m/pp_mmio_off -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - (not in a guest) - Decimal value of the Per Process MMIO space offset. -Users: https://github.com/ibm-capi/libcxl - - -Card info (eg. /sys/class/cxl/card0) - -What: /sys/class/cxl//caia_version -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Identifies the CAIA Version the card implements. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//psl_revision -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Identifies the revision level of the PSL. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//base_image -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - (not in a guest) - Identifies the revision level of the base image for devices - that support loadable PSLs. For FPGAs this field identifies - the image contained in the on-adapter flash which is loaded - during the initial program load. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//image_loaded -Date: September 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - (not in a guest) - Will return "user" or "factory" depending on the image loaded - onto the card. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//load_image_on_perst -Date: December 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read/write - (not in a guest) - Valid entries are "none", "user", and "factory". - "none" means PERST will not cause image to be loaded to the - card. A power cycle is required to load the image. - "none" could be useful for debugging because the trace arrays - are preserved. - - "user" and "factory" means PERST will cause either the user or - user or factory image to be loaded. - Default is to reload on PERST whichever image the card has - loaded. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//reset -Date: October 2014 -Contact: linuxppc-dev@lists.ozlabs.org -Description: write only - Writing 1 will issue a PERST to card provided there are no - contexts active on any one of the card AFUs. This may cause - the card to reload the FPGA depending on load_image_on_perst. - Writing -1 will do a force PERST irrespective of any active - contexts on the card AFUs. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//perst_reloads_same_image -Date: July 2015 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read/write - (not in a guest) - Trust that when an image is reloaded via PERST, it will not - have changed. - - == ================================================= - 0 don't trust, the image may be different (default) - 1 trust that the image will not change. - == ================================================= -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//psl_timebase_synced -Date: March 2016 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Returns 1 if the psl timebase register is synchronized - with the core timebase register, 0 otherwise. -Users: https://github.com/ibm-capi/libcxl - -What: /sys/class/cxl//tunneled_ops_supported -Date: May 2018 -Contact: linuxppc-dev@lists.ozlabs.org -Description: read only - Returns 1 if tunneled operations are supported in capi mode, - 0 otherwise. -Users: https://github.com/ibm-capi/libcxl diff --git a/Documentation/ABI/testing/sysfs-class-platform-profile b/Documentation/ABI/testing/sysfs-class-platform-profile new file mode 100644 index 00000000000000..dc72adfb830a4f --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-platform-profile @@ -0,0 +1,48 @@ +What: /sys/class/platform-profile/platform-profile-X/name +Date: March 2025 +KernelVersion: 6.14 +Description: Name of the class device given by the driver. + + RO + +What: /sys/class/platform-profile/platform-profile-X/choices +Date: March 2025 +KernelVersion: 6.14 +Description: This file contains a space-separated list of profiles supported + for this device. + + Drivers must use the following standard profile-names: + + ==================== ======================================== + low-power Low power consumption + cool Cooler operation + quiet Quieter operation + balanced Balance between low power consumption + and performance + balanced-performance Balance between performance and low + power consumption with a slight bias + towards performance + performance High performance operation + custom Driver defined custom profile + ==================== ======================================== + + RO + +What: /sys/class/platform-profile/platform-profile-X/profile +Date: March 2025 +KernelVersion: 6.14 +Description: Reading this file gives the current selected profile for this + device. Writing this file with one of the strings from + platform_profile_choices changes the profile to the new value. + + This file can be monitored for changes by polling for POLLPRI, + POLLPRI will be signaled on any changes, independent of those + changes coming from a userspace write; or coming from another + source such as e.g. a hotkey triggered profile change handled + either directly by the embedded-controller or fully handled + inside the kernel. + + This file may also emit the string 'custom' to indicate + that the driver is using a driver defined custom profile. + + RW diff --git a/Documentation/ABI/testing/sysfs-class-power b/Documentation/ABI/testing/sysfs-class-power index 45180b62d42686..2a5c1a09a28f91 100644 --- a/Documentation/ABI/testing/sysfs-class-power +++ b/Documentation/ABI/testing/sysfs-class-power @@ -407,10 +407,30 @@ Description: Access: Read, Write + Reading this returns the current active value, e.g. 'Standard'. + Check charge_types to get the values supported by the battery. + Valid values: "Unknown", "N/A", "Trickle", "Fast", "Standard", "Adaptive", "Custom", "Long Life", "Bypass" +What: /sys/class/power_supply//charge_types +Date: December 2024 +Contact: linux-pm@vger.kernel.org +Description: + Identical to charge_type but reading returns a list of supported + charge-types with the currently active type surrounded by square + brackets, e.g.: "Fast [Standard] Long_Life". + + power_supply class devices may support both charge_type and + charge_types for backward compatibility. In this case both will + always have the same active value and the active value can be + changed by writing either property. + + Note charge-types which contain a space such as "Long Life" will + have the space replaced by a '_' resulting in e.g. "Long_Life". + When writing charge-types both variants are accepted. + What: /sys/class/power_supply//charge_term_current Date: July 2014 Contact: linux-pm@vger.kernel.org @@ -433,7 +453,7 @@ Description: Valid values: "Unknown", "Good", "Overheat", "Dead", - "Over voltage", "Unspecified failure", "Cold", + "Over voltage", "Under voltage", "Unspecified failure", "Cold", "Watchdog timer expire", "Safety timer expire", "Over current", "Calibration required", "Warm", "Cool", "Hot", "No battery" @@ -793,3 +813,12 @@ Description: Access: Read Valid values: 1-31 + +What: /sys/class/power_supply//extensions/ +Date: March 2025 +Contact: linux-pm@vger.kernel.org +Description: + Reports the extensions registered to the power supply. + Each entry is a link to the device which registered the extension. + + Access: Read diff --git a/Documentation/ABI/testing/sysfs-class-power-max1720x b/Documentation/ABI/testing/sysfs-class-power-max1720x new file mode 100644 index 00000000000000..7d895bfda9ced2 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-power-max1720x @@ -0,0 +1,32 @@ +What: /sys/class/power_supply/max1720x/temp_ain1 +Date: January 2025 +KernelVersion: 6.14 +Contact: Dimitri Fedrau +Description: + Reports the current temperature reading from AIN1 thermistor. + + Access: Read + + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/max1720x/temp_ain2 +Date: January 2025 +KernelVersion: 6.14 +Contact: Dimitri Fedrau +Description: + Reports the current temperature reading from AIN2 thermistor. + + Access: Read + + Valid values: Represented in 1/10 Degrees Celsius + +What: /sys/class/power_supply/max1720x/temp_int +Date: January 2025 +KernelVersion: 6.14 +Contact: Dimitri Fedrau +Description: + Reports the current temperature reading from internal die. + + Access: Read + + Valid values: Represented in 1/10 Degrees Celsius diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch b/Documentation/ABI/testing/sysfs-kernel-livepatch index 3735d868013d01..3c3f36b32b579e 100644 --- a/Documentation/ABI/testing/sysfs-kernel-livepatch +++ b/Documentation/ABI/testing/sysfs-kernel-livepatch @@ -55,6 +55,15 @@ Description: An attribute which indicates whether the patch supports atomic-replace. +What: /sys/kernel/livepatch//stack_order +Date: Jan 2025 +KernelVersion: 6.14.0 +Description: + This attribute specifies the sequence in which live patch modules + are applied to the system. If multiple live patches modify the same + function, the implementation with the biggest 'stack_order' number + is used, unless a transition is currently in progress. + What: /sys/kernel/livepatch// Date: Nov 2014 KernelVersion: 3.19.0 diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index f1b90cf1249b5f..b057eddefbfc9a 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -355,10 +355,15 @@ Description: If 'target' is written to the 'type' file, writing to or What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//filters//matching Date: Dec 2022 Contact: SeongJae Park -Description: Writing 'Y' or 'N' to this file sets whether to filter out - pages that do or do not match to the 'type' and 'memcg_path', - respectively. Filter out means the action of the scheme will - not be applied to. +Description: Writing 'Y' or 'N' to this file sets whether the filter is for + the memory of the 'type', or all except the 'type'. + +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//filters//allow +Date: Jan 2025 +Contact: SeongJae Park +Description: Writing 'Y' or 'N' to this file sets whether to allow or reject + applying the scheme's action to the memory that satisfies the + 'type' and the 'matching' of the directory. What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//stats/nr_tried Date: Mar 2022 @@ -384,6 +389,12 @@ Contact: SeongJae Park Description: Reading this file returns the total size of regions that the action of the scheme has successfully applied in bytes. +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//stats/sz_ops_filter_passed +Date: Dec 2024 +Contact: SeongJae Park +Description: Reading this file returns the total size of memory that passed + DAMON operations layer-handled filters of the scheme in bytes. + What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//stats/qt_exceeds Date: Mar 2022 Contact: SeongJae Park @@ -424,3 +435,10 @@ Contact: SeongJae Park Description: Reading this file returns the 'age' of a memory region that corresponding DAMON-based Operation Scheme's action has tried to be applied. + +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//tried_regions//sz_filter_passed +Date: Dec 2024 +Contact: SeongJae Park +Description: Reading this file returns the size of the memory in the region + that passed DAMON operations layer-handled filters of the + scheme in bytes. diff --git a/Documentation/ABI/testing/sysfs-platform-mellanox-pmc b/Documentation/ABI/testing/sysfs-platform-mellanox-pmc new file mode 100644 index 00000000000000..29b3f9c58e0060 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-platform-mellanox-pmc @@ -0,0 +1,64 @@ +HID Driver Description +MLNXBFD0 mlxbf-pmc Performance counters (BlueField-1) +MLNXBFD1 mlxbf-pmc Performance counters (BlueField-2) +MLNXBFD2 mlxbf-pmc Performance counters (BlueField-3) + +What: /sys/bus/platform/devices//hwmon/hwmonX//event_list +Date: Dec 2020 +KernelVersion: 5.10 +Contact: "Shravan Kumar Ramani " +Description: + List of events supported by the counters in the specific block. + It is used to extract the event number or ID associated with + each event. + +What: /sys/bus/platform/devices//hwmon/hwmonX//event +Date: Dec 2020 +KernelVersion: 5.10 +Contact: "Shravan Kumar Ramani " +Description: + Event monitored by corresponding counter. This is used to + program or read back the event that should be or is currently + being monitored by counter. + +What: /sys/bus/platform/devices//hwmon/hwmonX//counter +Date: Dec 2020 +KernelVersion: 5.10 +Contact: "Shravan Kumar Ramani " +Description: + Counter value of the event being monitored. This is used to + read the counter value of the event which was programmed using + event. This is also used to clear or reset the counter value + by writing 0 to the counter sysfs. + +What: /sys/bus/platform/devices//hwmon/hwmonX//enable +Date: Dec 2020 +KernelVersion: 5.10 +Contact: "Shravan Kumar Ramani " +Description: + Start or stop counters. This is used to start the counters + for monitoring the programmed events and also to stop the + counters after the desired duration. Writing value 1 will + start all the counters in the block, and writing 0 will + stop all the counters together. + +What: /sys/bus/platform/devices//hwmon/hwmonX// +Date: Dec 2020 +KernelVersion: 5.10 +Contact: "Shravan Kumar Ramani " +Description: + Value of register. This is used to read or reset the registers + where various performance statistics are counted for each block. + Writing 0 to the sysfs will clear the counter, writing any other + value is not allowed. + +What: /sys/bus/platform/devices//hwmon/hwmonX//count_clock +Date: Mar 2025 +KernelVersion: 6.14 +Contact: "Shravan Kumar Ramani " +Description: + Use a counter for counting cycles. This is used to repurpose/dedicate + any of the counters in the block to counting cycles. Each counter is + represented by a bit (bit 0 for counter0, bit1 for counter1 and so on) + and setting the corresponding bit will reserve that specific counter + for counting cycles and override the event setting. diff --git a/Documentation/ABI/testing/sysfs-platform_profile b/Documentation/ABI/testing/sysfs-platform_profile index baf1d125f9f83a..125324ab53a966 100644 --- a/Documentation/ABI/testing/sysfs-platform_profile +++ b/Documentation/ABI/testing/sysfs-platform_profile @@ -33,3 +33,8 @@ Description: Reading this file gives the current selected profile for this source such as e.g. a hotkey triggered profile change handled either directly by the embedded-controller or fully handled inside the kernel. + + This file may also emit the string 'custom' to indicate + that multiple platform profiles drivers are in use but + have different values. This string can not be written to + this interface and is solely for informational purposes. diff --git a/Documentation/ABI/testing/sysfs-pps-gen b/Documentation/ABI/testing/sysfs-pps-gen new file mode 100644 index 00000000000000..2519207b88fdff --- /dev/null +++ b/Documentation/ABI/testing/sysfs-pps-gen @@ -0,0 +1,43 @@ +What: /sys/class/pps-gen/ +Date: February 2025 +KernelVersion: 6.13 +Contact: Rodolfo Giometti +Description: + The /sys/class/pps-gen/ directory contains files and + directories that provide a unified interface to the PPS + generators. + +What: /sys/class/pps-gen/pps-genX/ +Date: February 2025 +KernelVersion: 6.13 +Contact: Rodolfo Giometti +Description: + The /sys/class/pps-gen/pps-genX/ directory is related to X-th + PPS generator in the system. Each directory contain files to + manage and control its PPS generator. + +What: /sys/class/pps-gen/pps-genX/enable +Date: February 2025 +KernelVersion: 6.13 +Contact: Rodolfo Giometti +Description: + This write-only file enables or disables generation of the + PPS signal. + +What: /sys/class/pps-gen/pps-genX/system +Date: February 2025 +KernelVersion: 6.13 +Contact: Rodolfo Giometti +Description: + This read-only file returns "1" if the generator takes the + timing from the system clock, while it returns "0" if not + (i.e. from a peripheral device clock). + +What: /sys/class/pps-gen/pps-genX/time +Date: February 2025 +KernelVersion: 6.13 +Contact: Rodolfo Giometti +Description: + This read-only file contains the current time stored into the + generator clock as two integers representing the current time + seconds and nanoseconds. diff --git a/Documentation/Makefile b/Documentation/Makefile index fa71602ec96162..52c6c5a3efa99d 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -104,7 +104,7 @@ quiet_cmd_sphinx = SPHINX $@ --> file://$(abspath $(BUILDDIR)/$3/$4) YNL_INDEX:=$(srctree)/Documentation/networking/netlink_spec/index.rst YNL_RST_DIR:=$(srctree)/Documentation/networking/netlink_spec YNL_YAML_DIR:=$(srctree)/Documentation/netlink/specs -YNL_TOOL:=$(srctree)/tools/net/ynl/ynl-gen-rst.py +YNL_TOOL:=$(srctree)/tools/net/ynl/pyynl/ynl_gen_rst.py YNL_RST_FILES_TMP := $(patsubst %.yaml,%.rst,$(wildcard $(YNL_YAML_DIR)/*.yaml)) YNL_RST_FILES := $(patsubst $(YNL_YAML_DIR)%,$(YNL_RST_DIR)%, $(YNL_RST_FILES_TMP)) diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst index 4d2333e7ae0671..dd1f62e731c9c3 100644 --- a/Documentation/PCI/endpoint/index.rst +++ b/Documentation/PCI/endpoint/index.rst @@ -15,6 +15,7 @@ PCI Endpoint Framework pci-ntb-howto pci-vntb-function pci-vntb-howto + pci-nvme-function function/binding/pci-test function/binding/pci-ntb diff --git a/Documentation/PCI/endpoint/pci-nvme-function.rst b/Documentation/PCI/endpoint/pci-nvme-function.rst new file mode 100644 index 00000000000000..df57b8e7d066be --- /dev/null +++ b/Documentation/PCI/endpoint/pci-nvme-function.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +PCI NVMe Function +================= + +:Author: Damien Le Moal + +The PCI NVMe endpoint function implements a PCI NVMe controller using the NVMe +subsystem target core code. The driver for this function resides with the NVMe +subsystem as drivers/nvme/target/nvmet-pciep.c. + +See Documentation/nvme/nvme-pci-endpoint-target.rst for more details. diff --git a/Documentation/PCI/endpoint/pci-test-howto.rst b/Documentation/PCI/endpoint/pci-test-howto.rst index 909f770a07d654..aafc17ef3fd31f 100644 --- a/Documentation/PCI/endpoint/pci-test-howto.rst +++ b/Documentation/PCI/endpoint/pci-test-howto.rst @@ -81,8 +81,8 @@ device, the following commands can be used:: # echo 0x104c > functions/pci_epf_test/func1/vendorid # echo 0xb500 > functions/pci_epf_test/func1/deviceid - # echo 16 > functions/pci_epf_test/func1/msi_interrupts - # echo 8 > functions/pci_epf_test/func1/msix_interrupts + # echo 32 > functions/pci_epf_test/func1/msi_interrupts + # echo 2048 > functions/pci_epf_test/func1/msix_interrupts Binding pci-epf-test Device to EP Controller @@ -123,113 +123,83 @@ above:: Using Endpoint Test function Device ----------------------------------- -pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint -tests. To compile this tool the following commands should be used:: +Kselftest added in tools/testing/selftests/pci_endpoint can be used to run all +the default PCI endpoint tests. To build the Kselftest for PCI endpoint +subsystem, the following commands should be used:: # cd - # make -C tools/pci + # make -C tools/testing/selftests/pci_endpoint or if you desire to compile and install in your system:: # cd - # make -C tools/pci install + # make -C tools/testing/selftests/pci_endpoint INSTALL_PATH=/usr/bin install -The tool and script will be located in /usr/bin/ +The test will be located in /usr/bin/ - -pcitest.sh Output -~~~~~~~~~~~~~~~~~ +Kselftest Output +~~~~~~~~~~~~~~~~ :: - # pcitest.sh - BAR tests - - BAR0: OKAY - BAR1: OKAY - BAR2: OKAY - BAR3: OKAY - BAR4: NOT OKAY - BAR5: NOT OKAY - - Interrupt tests - - SET IRQ TYPE TO LEGACY: OKAY - LEGACY IRQ: NOT OKAY - SET IRQ TYPE TO MSI: OKAY - MSI1: OKAY - MSI2: OKAY - MSI3: OKAY - MSI4: OKAY - MSI5: OKAY - MSI6: OKAY - MSI7: OKAY - MSI8: OKAY - MSI9: OKAY - MSI10: OKAY - MSI11: OKAY - MSI12: OKAY - MSI13: OKAY - MSI14: OKAY - MSI15: OKAY - MSI16: OKAY - MSI17: NOT OKAY - MSI18: NOT OKAY - MSI19: NOT OKAY - MSI20: NOT OKAY - MSI21: NOT OKAY - MSI22: NOT OKAY - MSI23: NOT OKAY - MSI24: NOT OKAY - MSI25: NOT OKAY - MSI26: NOT OKAY - MSI27: NOT OKAY - MSI28: NOT OKAY - MSI29: NOT OKAY - MSI30: NOT OKAY - MSI31: NOT OKAY - MSI32: NOT OKAY - SET IRQ TYPE TO MSI-X: OKAY - MSI-X1: OKAY - MSI-X2: OKAY - MSI-X3: OKAY - MSI-X4: OKAY - MSI-X5: OKAY - MSI-X6: OKAY - MSI-X7: OKAY - MSI-X8: OKAY - MSI-X9: NOT OKAY - MSI-X10: NOT OKAY - MSI-X11: NOT OKAY - MSI-X12: NOT OKAY - MSI-X13: NOT OKAY - MSI-X14: NOT OKAY - MSI-X15: NOT OKAY - MSI-X16: NOT OKAY - [...] - MSI-X2047: NOT OKAY - MSI-X2048: NOT OKAY - - Read Tests - - SET IRQ TYPE TO MSI: OKAY - READ ( 1 bytes): OKAY - READ ( 1024 bytes): OKAY - READ ( 1025 bytes): OKAY - READ (1024000 bytes): OKAY - READ (1024001 bytes): OKAY - - Write Tests - - WRITE ( 1 bytes): OKAY - WRITE ( 1024 bytes): OKAY - WRITE ( 1025 bytes): OKAY - WRITE (1024000 bytes): OKAY - WRITE (1024001 bytes): OKAY - - Copy Tests - - COPY ( 1 bytes): OKAY - COPY ( 1024 bytes): OKAY - COPY ( 1025 bytes): OKAY - COPY (1024000 bytes): OKAY - COPY (1024001 bytes): OKAY + # pci_endpoint_test + TAP version 13 + 1..16 + # Starting 16 tests from 9 test cases. + # RUN pci_ep_bar.BAR0.BAR_TEST ... + # OK pci_ep_bar.BAR0.BAR_TEST + ok 1 pci_ep_bar.BAR0.BAR_TEST + # RUN pci_ep_bar.BAR1.BAR_TEST ... + # OK pci_ep_bar.BAR1.BAR_TEST + ok 2 pci_ep_bar.BAR1.BAR_TEST + # RUN pci_ep_bar.BAR2.BAR_TEST ... + # OK pci_ep_bar.BAR2.BAR_TEST + ok 3 pci_ep_bar.BAR2.BAR_TEST + # RUN pci_ep_bar.BAR3.BAR_TEST ... + # OK pci_ep_bar.BAR3.BAR_TEST + ok 4 pci_ep_bar.BAR3.BAR_TEST + # RUN pci_ep_bar.BAR4.BAR_TEST ... + # OK pci_ep_bar.BAR4.BAR_TEST + ok 5 pci_ep_bar.BAR4.BAR_TEST + # RUN pci_ep_bar.BAR5.BAR_TEST ... + # OK pci_ep_bar.BAR5.BAR_TEST + ok 6 pci_ep_bar.BAR5.BAR_TEST + # RUN pci_ep_basic.CONSECUTIVE_BAR_TEST ... + # OK pci_ep_basic.CONSECUTIVE_BAR_TEST + ok 7 pci_ep_basic.CONSECUTIVE_BAR_TEST + # RUN pci_ep_basic.LEGACY_IRQ_TEST ... + # OK pci_ep_basic.LEGACY_IRQ_TEST + ok 8 pci_ep_basic.LEGACY_IRQ_TEST + # RUN pci_ep_basic.MSI_TEST ... + # OK pci_ep_basic.MSI_TEST + ok 9 pci_ep_basic.MSI_TEST + # RUN pci_ep_basic.MSIX_TEST ... + # OK pci_ep_basic.MSIX_TEST + ok 10 pci_ep_basic.MSIX_TEST + # RUN pci_ep_data_transfer.memcpy.READ_TEST ... + # OK pci_ep_data_transfer.memcpy.READ_TEST + ok 11 pci_ep_data_transfer.memcpy.READ_TEST + # RUN pci_ep_data_transfer.memcpy.WRITE_TEST ... + # OK pci_ep_data_transfer.memcpy.WRITE_TEST + ok 12 pci_ep_data_transfer.memcpy.WRITE_TEST + # RUN pci_ep_data_transfer.memcpy.COPY_TEST ... + # OK pci_ep_data_transfer.memcpy.COPY_TEST + ok 13 pci_ep_data_transfer.memcpy.COPY_TEST + # RUN pci_ep_data_transfer.dma.READ_TEST ... + # OK pci_ep_data_transfer.dma.READ_TEST + ok 14 pci_ep_data_transfer.dma.READ_TEST + # RUN pci_ep_data_transfer.dma.WRITE_TEST ... + # OK pci_ep_data_transfer.dma.WRITE_TEST + ok 15 pci_ep_data_transfer.dma.WRITE_TEST + # RUN pci_ep_data_transfer.dma.COPY_TEST ... + # OK pci_ep_data_transfer.dma.COPY_TEST + ok 16 pci_ep_data_transfer.dma.COPY_TEST + # PASSED: 16 / 16 tests passed. + # Totals: pass:16 fail:0 xfail:0 xpass:0 skip:0 error:0 + + +Testcase 16 (pci_ep_data_transfer.dma.COPY_TEST) will fail for most of the DMA +capable endpoint controllers due to the absence of the MEMCPY over DMA. For such +controllers, it is advisable to skip this testcase using this +command:: + + # pci_endpoint_test -f pci_ep_bar -f pci_ep_basic -v memcpy -T COPY_TEST -v dma diff --git a/Documentation/accel/amdxdna/amdnpu.rst b/Documentation/accel/amdxdna/amdnpu.rst new file mode 100644 index 00000000000000..fbe0a75853452f --- /dev/null +++ b/Documentation/accel/amdxdna/amdnpu.rst @@ -0,0 +1,281 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +.. include:: + +========= + AMD NPU +========= + +:Copyright: |copy| 2024 Advanced Micro Devices, Inc. +:Author: Sonal Santan + +Overview +======== + +AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator +integrated into AMD client APU. NPU enables efficient execution of Machine +Learning applications like CNN, LLM, etc. NPU is based on +`AMD XDNA Architecture`_. NPU is managed by **amdxdna** driver. + + +Hardware Description +==================== + +AMD NPU consists of the following hardware components: + +AMD XDNA Array +-------------- + +AMD XDNA Array comprises of 2D array of compute and memory tiles built with +`AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1 +row of memory tile. Each compute tile contains a VLIW processor with its own +dedicated program and data memory. The memory tile acts as L2 memory. The 2D +array can be partitioned at a column boundary creating a spatially isolated +partition which can be bound to a workload context. + +Each column also has dedicated DMA engines to move data between host DDR and +memory tile. + +AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of +compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8 +topology, i.e., 4 rows of compute tiles arranged into 8 columns. + +Shared L2 Memory +---------------- + +The single row of memory tiles create a pool of software managed on chip L2 +memory. DMA engines are used to move data between host DDR and memory tiles. +AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory. +AMD Strix Point NPU has a total of 4096 KB of L2 memory. + +Microcontroller +--------------- + +A microcontroller runs NPU Firmware which is responsible for command processing, +XDNA Array partition setup, XDNA Array configuration, workload context +management and workload orchestration. + +NPU Firmware uses a dedicated instance of an isolated non-privileged context +called ERT to service each workload context. ERT is also used to execute user +provided ``ctrlcode`` associated with the workload context. + +NPU Firmware uses a single isolated privileged context called MERT to service +management commands from the amdxdna driver. + +Mailboxes +--------- + +The microcontroller and amdxdna driver use a privileged channel for management +tasks like setting up of contexts, telemetry, query, error handling, setting up +user channel, etc. As mentioned before, privileged channel requests are +serviced by MERT. The privileged channel is bound to a single mailbox. + +The microcontroller and amdxdna driver use a dedicated user channel per +workload context. The user channel is primarily used for submitting work to +the NPU. As mentioned before, a user channel requests are serviced by an +instance of ERT. Each user channel is bound to its own dedicated mailbox. + +PCIe EP +------- + +NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some +MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric +for reading or writing into host memory. Each instance of ERT gets its own +dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt. + +The number of PCIe BARs varies depending on the specific device. Based on their +functions, PCIe BARs can generally be categorized into the following types. + +* PSP BAR: Expose the AMD PSP (Platform Security Processor) function +* SMU BAR: Expose the AMD SMU (System Management Unit) function +* SRAM BAR: Expose ring buffers for the mailbox +* Mailbox BAR: Expose the mailbox control registers (head, tail and ISR + registers etc.) +* Public Register BAR: Expose public registers + +On specific devices, the above-mentioned BAR type might be combined into a +single physical PCIe BAR. Or a module might require two physical PCIe BARs to +be fully functional. For example, + +* On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0. +* On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR + index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR) + and PCIe BAR index 4 (PSP BAR). + +Process Isolation Hardware +-------------------------- + +As explained before, XDNA Array can be dynamically divided into isolated +spatial partitions, each of which may have one or more columns. The spatial +partition is setup by programming the column isolation registers by the +microcontroller. Each spatial partition is associated with a PASID which is +also programmed by the microcontroller. Hence multiple spatial partitions in +the NPU can make concurrent host access protected by PASID. + +The NPU FW itself uses microcontroller MMU enforced isolated contexts for +servicing user and privileged channel requests. + + +Mixed Spatial and Temporal Scheduling +===================================== + +AMD XDNA architecture supports mixed spatial and temporal (time sharing) +scheduling of 2D array. This means that spatial partitions may be setup and +torn down dynamically to accommodate various workloads. A *spatial* partition +may be *exclusively* bound to one workload context while another partition may +be *temporarily* bound to more than one workload contexts. The microcontroller +updates the PASID for a temporarily shared partition to match the context that +has been bound to the partition at any moment. + +Resource Solver +--------------- + +The Resource Solver component of the amdxdna driver manages the allocation +of 2D array among various workloads. Every workload describes the number +of columns required to run the NPU binary in its metadata. The Resource Solver +component uses hints passed by the workload and its own heuristics to +decide 2D array (re)partition strategy and mapping of workloads for spatial and +temporal sharing of columns. The FW enforces the context-to-column(s) resource +binding decisions made by the Resource Solver. + +AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload +contexts. AMD Strix Point can support 16 concurrent workload contexts. + + +Application Binaries +==================== + +A NPU application workload is comprised of two separate binaries which are +generated by the NPU compiler. + +1. AMD XDNA Array overlay, which is used to configure a NPU spatial partition. + The overlay contains instructions for setting up the stream switch + configuration and ELF for the compute tiles. The overlay is loaded on the + spatial partition bound to the workload by the associated ERT instance. + Refer to the + `Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details. + +2. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial + partition. ``ctrlcode`` is executed by the ERT running in protected mode on + the microcontroller in the context of the workload. ``ctrlcode`` is made up + of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the + `AI Engine Run Time`_ for more details. + + +Special Host Buffers +==================== + +Per-context Instruction Buffer +------------------------------ + +Every workload context uses a host resident 64 MB buffer which is memory +mapped into the ERT instance created to service the workload. The ``ctrlcode`` +used by the workload is copied into this special memory. This buffer is +protected by PASID like all other input/output buffers used by that workload. +Instruction buffer is also mapped into the user space of the workload. + +Global Privileged Buffer +------------------------ + +In addition, the driver also allocates a single buffer for maintenance tasks +like recording errors from MERT. This global buffer uses the global IOMMU +domain and is only accessible by MERT. + + +High-level Use Flow +=================== + +Here are the steps to run a workload on AMD NPU: + +1. Compile the workload into an overlay and a ``ctrlcode`` binary. +2. Userspace opens a context in the driver and provides the overlay. +3. The driver checks with the Resource Solver for provisioning a set of columns + for the workload. +4. The driver then asks MERT to create a context on the device with the desired + columns. +5. MERT then creates an instance of ERT. MERT also maps the Instruction Buffer + into ERT memory. +6. The userspace then copies the ``ctrlcode`` to the Instruction Buffer. +7. Userspace then creates a command buffer with pointers to input, output, and + instruction buffer; it then submits command buffer with the driver and goes + to sleep waiting for completion. +8. The driver sends the command over the Mailbox to ERT. +9. ERT *executes* the ``ctrlcode`` in the instruction buffer. +10. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while + AMD XDNA Array is running. +11. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion + signal to the driver which then wakes up the waiting workload. + + +Boot Flow +========= + +amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot +of the NPU microcontroller. amdxdna driver then waits for the alive signal in +a special location on BAR 0. The NPU is switched off during SoC suspend and +turned on after resume where the NPU FW is reloaded, and the handshake is +performed again. + + +Userspace components +==================== + +Compiler +-------- + +Peano is an LLVM based open-source compiler for AMD XDNA Array compute tile +available at: +https://github.com/Xilinx/llvm-aie + +The open-source IREE compiler supports graph compilation of ML models for AMD +NPU and uses Peano underneath. It is available at: +https://github.com/nod-ai/iree-amd-aie + +Usermode Driver (UMD) +--------------------- + +The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT +can be found at: +https://github.com/Xilinx/XRT + +The open-source XRT shim for NPU is can be found at: +https://github.com/amd/xdna-driver + + +DMA Operation +============= + +DMA operation instructions are encoded in the ``ctrlcode`` as +``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA +operations between host DDR and L2 memory are effected. + + +Error Handling +============== + +When MERT detects an error in AMD XDNA Array, it pauses execution for that +workload context and sends an asynchronous message to the driver over the +privileged channel. The driver then sends a buffer pointer to MERT to capture +the register states for the partition bound to faulting workload context. The +driver then decodes the error by reading the contents of the buffer pointer. + + +Telemetry +========= + +MERT can report various kinds of telemetry information like the following: + +* L1 interrupt counter +* DMA counter +* Deep Sleep counter +* etc. + + +References +========== + +- `AMD XDNA Architecture `_ +- `AMD AI Engine Technology `_ +- `Peano `_ +- `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) `_ +- `AI Engine Run Time `_ diff --git a/Documentation/accel/amdxdna/index.rst b/Documentation/accel/amdxdna/index.rst new file mode 100644 index 00000000000000..38c16939f1fce2 --- /dev/null +++ b/Documentation/accel/amdxdna/index.rst @@ -0,0 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +===================================== + accel/amdxdna NPU driver +===================================== + +The accel/amdxdna driver supports the AMD NPU (Neural Processing Unit). + +.. toctree:: + + amdnpu diff --git a/Documentation/accel/index.rst b/Documentation/accel/index.rst index e94a0160b6a04b..bc85f26533d888 100644 --- a/Documentation/accel/index.rst +++ b/Documentation/accel/index.rst @@ -8,6 +8,7 @@ Compute Accelerators :maxdepth: 1 introduction + amdxdna/index qaic/index .. only:: subproject and html diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst index f61c01fc376e76..210c194d4a7b53 100644 --- a/Documentation/accounting/delay-accounting.rst +++ b/Documentation/accounting/delay-accounting.rst @@ -100,29 +100,29 @@ Get delays, since system boot, for pid 10:: # ./getdelays -d -p 10 (output similar to next case) -Get sum of delays, since system boot, for all pids with tgid 5:: +Get sum and peak of delays, since system boot, for all pids with tgid 242:: - # ./getdelays -d -t 5 + bash-4.4# ./getdelays -d -t 242 print delayacct stats ON - TGID 5 - - - CPU count real total virtual total delay total delay average - 8 7000000 6872122 3382277 0.423ms - IO count delay total delay average - 0 0 0.000ms - SWAP count delay total delay average - 0 0 0.000ms - RECLAIM count delay total delay average - 0 0 0.000ms - THRASHING count delay total delay average - 0 0 0.000ms - COMPACT count delay total delay average - 0 0 0.000ms - WPCOPY count delay total delay average - 0 0 0.000ms - IRQ count delay total delay average - 0 0 0.000ms + TGID 242 + + + CPU count real total virtual total delay total delay average delay max delay min + 39 156000000 156576579 2111069 0.054ms 0.212296ms 0.031307ms + IO count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms + SWAP count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms + RECLAIM count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms + THRASHING count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms + COMPACT count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms + WPCOPY count delay total delay average delay max delay min + 156 11215873 0.072ms 0.207403ms 0.033913ms + IRQ count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms Get IO accounting for pid 1, it works only with -p:: diff --git a/Documentation/accounting/taskstats-struct.rst b/Documentation/accounting/taskstats-struct.rst index ca90fd489c9a3c..acca51c34157b1 100644 --- a/Documentation/accounting/taskstats-struct.rst +++ b/Documentation/accounting/taskstats-struct.rst @@ -47,7 +47,7 @@ should not change the relative position of each field within the struct. 1) Common and basic accounting fields:: /* The version number of this struct. This field is always set to - * TAKSTATS_VERSION, which is defined in . + * TASKSTATS_VERSION, which is defined in . * Each time the struct is changed, the value should be incremented. */ __u16 version; diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst index f2bebff6a733f6..eb945266890913 100644 --- a/Documentation/admin-guide/README.rst +++ b/Documentation/admin-guide/README.rst @@ -356,5 +356,5 @@ instructions at 'Documentation/admin-guide/reporting-issues.rst'. Hints on understanding kernel bug reports are in 'Documentation/admin-guide/bug-hunting.rst'. More on debugging the kernel -with gdb is in 'Documentation/dev-tools/gdb-kernel-debugging.rst' and -'Documentation/dev-tools/kgdb.rst'. +with gdb is in 'Documentation/process/debugging/gdb-kernel-debugging.rst' and +'Documentation/process/debugging/kgdb.rst'. diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 714a5171bfc0b8..1576fb93f06c4a 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -121,14 +121,14 @@ compression algorithm to use external pre-trained dictionary, pass full path to the `dict` along with other parameters:: #pass path to pre-trained zstd dictionary - echo "algo=zstd dict=/etc/dictioary" > /sys/block/zram0/algorithm_params + echo "algo=zstd dict=/etc/dictionary" > /sys/block/zram0/algorithm_params #same, but using algorithm priority - echo "priority=1 dict=/etc/dictioary" > \ + echo "priority=1 dict=/etc/dictionary" > \ /sys/block/zram0/algorithm_params #pass path to pre-trained zstd dictionary and compression level - echo "algo=zstd level=8 dict=/etc/dictioary" > \ + echo "algo=zstd level=8 dict=/etc/dictionary" > \ /sys/block/zram0/algorithm_params Parameters are algorithm specific: not all algorithms support pre-trained diff --git a/Documentation/admin-guide/braille-console.rst b/Documentation/admin-guide/braille-console.rst index 18e79337dcfd78..153472e93cae27 100644 --- a/Documentation/admin-guide/braille-console.rst +++ b/Documentation/admin-guide/braille-console.rst @@ -21,8 +21,8 @@ override the baud rate to 115200, etc. By default, the braille device will just show the last kernel message (console mode). To review previous messages, press the Insert key to switch to the VT review mode. In review mode, the arrow keys permit to browse in the VT content, -:kbd:`PAGE-UP`/:kbd:`PAGE-DOWN` keys go at the top/bottom of the screen, and -the :kbd:`HOME` key goes back +`PAGE-UP`/`PAGE-DOWN` keys go at the top/bottom of the screen, and +the `HOME` key goes back to the cursor, hence providing very basic screen reviewing facility. Sound feedback can be obtained by adding the ``braille_console.sound=1`` kernel diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index 1d0f8ceb30754e..ce6f4e8ca48761 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -368,12 +368,3 @@ processed by ``klogd``:: Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128] Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3 ---------------------------------------------------------------------------- - -:: - - Dr. G.W. Wettstein Oncology Research Div. Computing Facility - Roger Maris Cancer Center INTERNET: greg@wind.rmcc.com - 820 4th St. N. - Fargo, ND 58122 - Phone: 701-234-7556 diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 315ede811c9d0d..cb1b4e759b7e26 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst . There are also arch-specific kernel-parameters not documented here. -See for example . Note that ALL kernel parameters listed below are CASE SENSITIVE, and that a trailing = on the name of any parameter states that that parameter will diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3872bc6ec49d63..fb8752b42ec858 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -21,6 +21,10 @@ strictly ACPI specification compliant. rsdt -- prefer RSDT over (default) XSDT copy_dsdt -- copy DSDT to memory + nocmcff -- Disable firmware first mode for corrected + errors. This disables parsing the HEST CMC error + source to check if firmware has set the FF flag. This + may result in duplicate corrected error reports. nospcr -- disable console in ACPI SPCR table as default _serial_ console on ARM64 For ARM64, ONLY "acpi=off", "acpi=on", "acpi=force" or @@ -405,6 +409,8 @@ not play well with APC CPU idle - disable it if you have APC and your system crashes randomly. + apic [APIC,X86-64] Use IO-APIC. Default. + apic= [APIC,X86,EARLY] Advanced Programmable Interrupt Controller Change the output verbosity while booting Format: { quiet (default) | verbose | debug } @@ -424,6 +430,10 @@ useful so that a dump capture kernel won't be shot down by NMI + apicpmtimer Do APIC timer calibration using the pmtimer. Implies + apicmaintimer. Useful when your PIT timer is totally + broken. + autoconf= [IPV6] See Documentation/networking/ipv6.rst. @@ -1726,6 +1736,8 @@ off: Disable GDS mitigation. + gbpages [X86] Use GB pages for kernel direct mappings. + gcov_persist= [GCOV] When non-zero (default), profiling data for kernel modules is saved and remains accessible via debugfs, even when the module is unloaded/reloaded. @@ -2008,12 +2020,21 @@ idle= [X86,EARLY] Format: idle=poll, idle=halt, idle=nomwait - Poll forces a polling idle loop that can slightly - improve the performance of waking up a idle CPU, but - will use a lot of power and make the system run hot. - Not recommended. + + idle=poll: Don't do power saving in the idle loop + using HLT, but poll for rescheduling event. This will + make the CPUs eat a lot more power, but may be useful + to get slightly better performance in multiprocessor + benchmarks. It also makes some profiling using + performance counters more accurate. Please note that + on systems with MONITOR/MWAIT support (like Intel + EM64T CPUs) this option has no performance advantage + over the normal idle loop. It may also interact badly + with hyperthreading. + idle=halt: Halt is forced to be used for CPU idle. In such case C2/C3 won't be used again. + idle=nomwait: Disable mwait for CPU C-states idxd.sva= [HW] @@ -2311,20 +2332,73 @@ relaxed iommu= [X86,EARLY] + off + Don't initialize and use any kind of IOMMU. + force + Force the use of the hardware IOMMU even when + it is not actually needed (e.g. because < 3 GB + memory). + noforce + Don't force hardware IOMMU usage when it is not + needed. (default). + biomerge panic nopanic merge nomerge + soft - pt [X86] - nopt [X86] - nobypass [PPC/POWERNV] + Use software bounce buffering (SWIOTLB) (default for + Intel machines). This can be used to prevent the usage + of an available hardware IOMMU. + + [X86] + pt + [X86] + nopt + [PPC/POWERNV] + nobypass Disable IOMMU bypass, using IOMMU for PCI devices. + [X86] + AMD Gart HW IOMMU-specific options: + + + Set the size of the remapping area in bytes. + + allowed + Overwrite iommu off workarounds for specific chipsets + + fullflush + Flush IOMMU on each allocation (default). + + nofullflush + Don't use IOMMU fullflush. + + memaper[=] + Allocate an own aperture over RAM with size + 32MB<1 and 1=>0 transitions of the number of VMs. - Enabling virtualization at module lode avoids potential + Enabling virtualization at module load avoids potential latency for creation of the 0=>1 VM, as KVM serializes virtualization enabling across all online CPUs. The "cost" of enabling virtualization when KVM is loaded, @@ -2748,17 +2824,21 @@ nvhe: Standard nVHE-based mode, without support for protected guests. - protected: nVHE-based mode with support for guests whose - state is kept private from the host. + protected: Mode with support for guests whose state is + kept private from the host, using VHE or + nVHE depending on HW support. nested: VHE-based mode with support for nested - virtualization. Requires at least ARMv8.3 - hardware. + virtualization. Requires at least ARMv8.4 + hardware (with FEAT_NV2). Defaults to VHE/nVHE based on hardware support. Setting mode to "protected" will disable kexec and hibernation - for the host. "nested" is experimental and should be - used with extreme caution. + for the host. To force nVHE on VHE hardware, add + "arm64_sw.hvhe=0 id_aa64mmfr1.vh=0" to the + command-line. + "nested" is experimental and should be used with + extreme caution. kvm-arm.vgic_v3_group0_trap= [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0 @@ -3259,9 +3339,77 @@ devices can be requested on-demand with the /dev/loop-control interface. - mce [X86-32] Machine Check Exception + mce= [X86-{32,64}] + + Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables. + + off + disable machine check + + no_cmci + disable CMCI(Corrected Machine Check Interrupt) that + Intel processor supports. Usually this disablement is + not recommended, but it might be handy if your + hardware is misbehaving. + + Note that you'll get more problems without CMCI than + with due to the shared banks, i.e. you might get + duplicated error logs. + + dont_log_ce + don't make logs for corrected errors. All events + reported as corrected are silently cleared by OS. This + option will be useful if you have no interest in any + of corrected errors. + + ignore_ce + disable features for corrected errors, e.g. + polling timer and CMCI. All events reported as + corrected are not cleared by OS and remained in its + error banks. + + Usually this disablement is not recommended, however + if there is an agent checking/clearing corrected + errors (e.g. BIOS or hardware monitoring + applications), conflicting with OS's error handling, + and you cannot deactivate the agent, then this option + will be a help. + + no_lmce + do not opt-in to Local MCE delivery. Use legacy method + to broadcast MCEs. + + bootlog + enable logging of machine checks left over from + booting. Disabled by default on AMD Fam10h and older + because some BIOS leave bogus ones. + + If your BIOS doesn't do that it's a good idea to + enable though to make sure you log even machine check + events that result in a reboot. On Intel systems it is + enabled by default. + + nobootlog + disable boot machine check logging. + + monarchtimeout (number) + sets the time in us to wait for other CPUs on machine + checks. 0 to disable. + + bios_cmci_threshold + don't overwrite the bios-set CMCI threshold. This boot + option prevents Linux from overwriting the CMCI + threshold set by the bios. Without this option, Linux + always sets the CMCI threshold to 1. Enabling this may + make memory predictive failure analysis less effective + if the bios sets thresholds for memory errors since we + will not see details for all errors. + + recovery + force-enable recoverable machine check code paths + + Everything else is in sysfs now. - mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst md= [HW] RAID subsystems devices and level See Documentation/admin-guide/md.rst. @@ -3351,8 +3499,8 @@ [KNL] Set the initial state for the memory hotplug onlining policy. If not specified, the default value is set according to the - CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config - option. + CONFIG_MHP_DEFAULT_ONLINE_TYPE kernel config + options. See Documentation/admin-guide/mm/memory-hotplug.rst. memmap=exactmap [KNL,X86,EARLY] Enable setting of an exact @@ -3887,6 +4035,8 @@ noapic [SMP,APIC,EARLY] Tells the kernel to not make use of any IOAPICs that may be present in the system. + noapictimer [APIC,X86] Don't set up the APIC timer + noautogroup Disable scheduler automatic task group creation. nocache [ARM,EARLY] @@ -3934,6 +4084,8 @@ register save and restore. The kernel will only save legacy floating-point registers on task switch. + nogbpages [X86] Do not use GB pages for kernel direct mappings. + no_hash_pointers [KNL,EARLY] Force pointers printed to the console or buffers to be @@ -3960,6 +4112,8 @@ the impact of the sleep instructions. This is also useful when using JTAG debugger. + nohpet [X86] Don't use the HPET timer. + nohugeiomap [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge I/O mappings. nohugevmalloc [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge vmalloc mappings. @@ -4111,8 +4265,10 @@ nosync [HW,M68K] Disables sync negotiation for all devices. - no_timer_check [X86,APIC] Disables the code which tests for - broken timer IRQ sources. + no_timer_check [X86,APIC] Disables the code which tests for broken + timer IRQ sources, i.e., the IO-APIC timer. This can + work around problems with incorrect timer + initialization on some boards. no_uaccess_flush [PPC,EARLY] Don't flush the L1-D cache after accessing user data. @@ -4192,6 +4348,11 @@ If given as an integer followed by 'U', it will divide each physical node into N emulated nodes. + numa=noacpi [X86] Don't parse the SRAT table for NUMA setup + + numa=nohmat [X86] Don't parse the HMAT table for NUMA setup, or + soft-reserved memory partitioning. + numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic NUMA balancing. Allowed values are enable and disable @@ -4673,7 +4834,7 @@ '1' – force enabled 'x' – unchanged For example, - pci=config_acs=10x + pci=config_acs=10x@pci:0:0 would configure all devices that support ACS to enable P2P Request Redirect, disable Translation Blocking, and leave Source @@ -5367,7 +5528,42 @@ rcutorture.gp_cond= [KNL] Use conditional/asynchronous update-side - primitives, if available. + normal-grace-period primitives, if available. + + rcutorture.gp_cond_exp= [KNL] + Use conditional/asynchronous update-side + expedited-grace-period primitives, if available. + + rcutorture.gp_cond_full= [KNL] + Use conditional/asynchronous update-side + normal-grace-period primitives that also take + concurrent expedited grace periods into account, + if available. + + rcutorture.gp_cond_exp_full= [KNL] + Use conditional/asynchronous update-side + expedited-grace-period primitives that also take + concurrent normal grace periods into account, + if available. + + rcutorture.gp_cond_wi= [KNL] + Nominal wait interval for normal conditional + grace periods (specified by rcutorture's + gp_cond and gp_cond_full module parameters), + in microseconds. The actual wait interval will + be randomly selected to nanosecond granularity up + to this wait interval. Defaults to 16 jiffies, + for example, 16,000 microseconds on a system + with HZ=1000. + + rcutorture.gp_cond_wi_exp= [KNL] + Nominal wait interval for expedited conditional + grace periods (specified by rcutorture's + gp_cond_exp and gp_cond_exp_full module + parameters), in microseconds. The actual wait + interval will be randomly selected to nanosecond + granularity up to this wait interval. Defaults to + 128 microseconds. rcutorture.gp_exp= [KNL] Use expedited update-side primitives, if available. @@ -5376,6 +5572,43 @@ Use normal (non-expedited) asynchronous update-side primitives, if available. + rcutorture.gp_poll= [KNL] + Use polled update-side normal-grace-period + primitives, if available. + + rcutorture.gp_poll_exp= [KNL] + Use polled update-side expedited-grace-period + primitives, if available. + + rcutorture.gp_poll_full= [KNL] + Use polled update-side normal-grace-period + primitives that also take concurrent expedited + grace periods into account, if available. + + rcutorture.gp_poll_exp_full= [KNL] + Use polled update-side expedited-grace-period + primitives that also take concurrent normal + grace periods into account, if available. + + rcutorture.gp_poll_wi= [KNL] + Nominal wait interval for normal conditional + grace periods (specified by rcutorture's + gp_poll and gp_poll_full module parameters), + in microseconds. The actual wait interval will + be randomly selected to nanosecond granularity up + to this wait interval. Defaults to 16 jiffies, + for example, 16,000 microseconds on a system + with HZ=1000. + + rcutorture.gp_poll_wi_exp= [KNL] + Nominal wait interval for expedited conditional + grace periods (specified by rcutorture's + gp_poll_exp and gp_poll_exp_full module + parameters), in microseconds. The actual wait + interval will be randomly selected to nanosecond + granularity up to this wait interval. Defaults to + 128 microseconds. + rcutorture.gp_sync= [KNL] Use normal (non-expedited) synchronous update-side primitives, if available. If all @@ -5429,6 +5662,22 @@ Set time (jiffies) between CPU-hotplug operations, or zero to disable CPU-hotplug testing. + rcutorture.preempt_duration= [KNL] + Set duration (in milliseconds) of preemptions + by a high-priority FIFO real-time task. Set to + zero (the default) to disable. The CPUs to + preempt are selected randomly from the set that + are online at a given point in time. Races with + CPUs going offline are ignored, with that attempt + at preemption skipped. + + rcutorture.preempt_interval= [KNL] + Set interval (in milliseconds, defaulting to one + second) between preemptions by a high-priority + FIFO real-time task. This delay is mediated + by an hrtimer and is further fuzzed to avoid + inadvertent synchronizations. + rcutorture.read_exit_burst= [KNL] The number of times in a given read-then-exit episode that a set of read-then-exit kthreads @@ -5715,6 +5964,55 @@ reboot_cpu is s[mp]#### with #### being the processor to be used for rebooting. + acpi + Use the ACPI RESET_REG in the FADT. If ACPI is not + configured or the ACPI reset does not work, the reboot + path attempts the reset using the keyboard controller. + + bios + Use the CPU reboot vector for warm reset + + cold + Set the cold reboot flag + + default + There are some built-in platform specific "quirks" + - you may see: "reboot: series board detected. + Selecting for reboots." In the case where you + think the quirk is in error (e.g. you have newer BIOS, + or newer board) using this option will ignore the + built-in quirk table, and use the generic default + reboot actions. + + efi + Use efi reset_system runtime service. If EFI is not + configured or the EFI reset does not work, the reboot + path attempts the reset using the keyboard controller. + + force + Don't stop other CPUs on reboot. This can make reboot + more reliable in some cases. + + kbd + Use the keyboard controller. cold reset (default) + + pci + Use a write to the PCI config space register 0xcf9 to + trigger reboot. + + triple + Force a triple fault (init) + + warm + Don't set the cold reboot flag + + Using warm reset will be much faster especially on big + memory systems because the BIOS will not go through + the memory check. Disadvantage is that not all + hardware will be completely reinitialized on reboot so + there may be boot problems on some systems. + + refscale.holdoff= [KNL] Set test-start holdoff period. The purpose of this parameter is to delay the start of the @@ -6106,7 +6404,16 @@ serialnumber [BUGS=X86-32] - sev=option[,option...] [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst + sev=option[,option...] [X86-64] + + debug + Enable debug messages. + + nosnp + Do not enable SEV-SNP (applies to host/hypervisor + only). Setting 'nosnp' avoids the RMP check overhead + in memory accesses when users do not want to run + SEV-SNP guests. shapers= [NET] Maximal number of shapers. @@ -6858,6 +7165,14 @@ comma-separated list of trace events to enable. See also Documentation/trace/events.rst + To enable modules, use :mod: keyword: + + trace_event=:mod: + + The value before :mod: will only enable specific events + that are part of the module. See the above mentioned + document for more information. + trace_instance=[instance-info] [FTRACE] Create a ring buffer instance early in boot up. This will be listed in: @@ -6992,6 +7307,13 @@ See Documentation/admin-guide/mm/transhuge.rst for more details. + transparent_hugepage_tmpfs= [KNL] + Format: [always|within_size|advise|never] + Can be used to control the default hugepage allocation policy + for the tmpfs mount. + See Documentation/admin-guide/mm/transhuge.rst + for more details. + trusted.source= [KEYS] Format: This parameter identifies the trust source as a backend @@ -7474,7 +7796,7 @@ vt.cur_default= [VT] Default cursor shape. Format: 0xCCBBAA, where AA, BB, and CC are the same as the parameters of the [?A;B;Cc escape sequence; - see VGA-softcursor.txt. Default: 2 = underline. + see vga-softcursor.rst. Default: 2 = underline. vt.default_blu= [VT] Format: ,,,..., diff --git a/Documentation/admin-guide/media/ipu3.rst b/Documentation/admin-guide/media/ipu3.rst index 83b3cd03b35c36..9c190942932e35 100644 --- a/Documentation/admin-guide/media/ipu3.rst +++ b/Documentation/admin-guide/media/ipu3.rst @@ -98,7 +98,7 @@ frames in packed raw Bayer format to IPU3 CSI2 receiver. # and that ov5670 sensor is connected to i2c bus 10 with address 0x36 export SDEV=$(media-ctl -d $MDEV -e "ov5670 10-0036") - # Establish the link for the media devices using media-ctl [#f3]_ + # Establish the link for the media devices using media-ctl media-ctl -d $MDEV -l "ov5670:0 -> ipu3-csi2 0:0[1]" # Set the format for the media devices @@ -589,12 +589,8 @@ preserved. References ========== -.. [#f5] drivers/staging/media/ipu3/include/uapi/intel-ipu3.h - .. [#f1] https://github.com/intel/nvt .. [#f2] http://git.ideasonboard.org/yavta.git -.. [#f3] http://git.ideasonboard.org/?p=media-ctl.git;a=summary - .. [#f4] ImgU limitation requires an additional 16x16 for all input resolutions diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index c4dddf6733cda5..ede14b679d02ff 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -42,32 +42,45 @@ the execution. :: $ git clone https://github.com/sjp38/masim; cd masim; make $ sudo damo start "./masim ./configs/stairs.cfg --quiet" - $ sudo ./damo show - 0 addr [85.541 TiB , 85.541 TiB ) (57.707 MiB ) access 0 % age 10.400 s - 1 addr [85.541 TiB , 85.542 TiB ) (413.285 MiB) access 0 % age 11.400 s - 2 addr [127.649 TiB , 127.649 TiB) (57.500 MiB ) access 0 % age 1.600 s - 3 addr [127.649 TiB , 127.649 TiB) (32.500 MiB ) access 0 % age 500 ms - 4 addr [127.649 TiB , 127.649 TiB) (9.535 MiB ) access 100 % age 300 ms - 5 addr [127.649 TiB , 127.649 TiB) (8.000 KiB ) access 60 % age 0 ns - 6 addr [127.649 TiB , 127.649 TiB) (6.926 MiB ) access 0 % age 1 s - 7 addr [127.998 TiB , 127.998 TiB) (120.000 KiB) access 0 % age 11.100 s - 8 addr [127.998 TiB , 127.998 TiB) (8.000 KiB ) access 40 % age 100 ms - 9 addr [127.998 TiB , 127.998 TiB) (4.000 KiB ) access 0 % age 11 s - total size: 577.590 MiB - $ sudo ./damo stop + $ sudo damo report access + heatmap: 641111111000000000000000000000000000000000000000000000[...]33333333333333335557984444[...]7 + # min/max temperatures: -1,840,000,000, 370,010,000, column size: 3.925 MiB + 0 addr 86.182 TiB size 8.000 KiB access 0 % age 14.900 s + 1 addr 86.182 TiB size 8.000 KiB access 60 % age 0 ns + 2 addr 86.182 TiB size 3.422 MiB access 0 % age 4.100 s + 3 addr 86.182 TiB size 2.004 MiB access 95 % age 2.200 s + 4 addr 86.182 TiB size 29.688 MiB access 0 % age 14.100 s + 5 addr 86.182 TiB size 29.516 MiB access 0 % age 16.700 s + 6 addr 86.182 TiB size 29.633 MiB access 0 % age 17.900 s + 7 addr 86.182 TiB size 117.652 MiB access 0 % age 18.400 s + 8 addr 126.990 TiB size 62.332 MiB access 0 % age 9.500 s + 9 addr 126.990 TiB size 13.980 MiB access 0 % age 5.200 s + 10 addr 126.990 TiB size 9.539 MiB access 100 % age 3.700 s + 11 addr 126.990 TiB size 16.098 MiB access 0 % age 6.400 s + 12 addr 127.987 TiB size 132.000 KiB access 0 % age 2.900 s + total size: 314.008 MiB + $ sudo damo stop The first command of the above example downloads and builds an artificial memory access generator program called ``masim``. The second command asks DAMO -to execute the artificial generator process start via the given command and -make DAMON monitors the generator process. The third command retrieves the -current snapshot of the monitored access pattern of the process from DAMON and -shows the pattern in a human readable format. - -Each line of the output shows which virtual address range (``addr [XX, XX)``) -of the process is how frequently (``access XX %``) accessed for how long time -(``age XX``). For example, the fifth region of ~9 MiB size is being most -frequently accessed for last 300 milliseconds. Finally, the fourth command -stops DAMON. +to start the program via the given command and make DAMON monitors the newly +started process. The third command retrieves the current snapshot of the +monitored access pattern of the process from DAMON and shows the pattern in a +human readable format. + +The first line of the output shows the relative access temperature (hotness) of +the regions in a single row hetmap format. Each column on the heatmap +represents regions of same size on the monitored virtual address space. The +position of the colun on the row and the number on the column represents the +relative location and access temperature of the region. ``[...]`` means +unmapped huge regions on the virtual address spaces. The second line shows +additional information for better understanding the heatmap. + +Each line of the output from the third line shows which virtual address range +(``addr XX size XX``) of the process is how frequently (``access XX %``) +accessed for how long time (``age XX``). For example, the evelenth region of +~9.5 MiB size is being most frequently accessed for last 3.7 seconds. Finally, +the fourth command stops DAMON. Note that DAMON can monitor not only virtual address spaces but multiple types of address spaces including the physical address space. @@ -95,7 +108,7 @@ Visualizing Recorded Patterns You can visualize the pattern in a heatmap, showing which memory region (x-axis) got accessed when (y-axis) and how frequently (number).:: - $ sudo damo report heats --heatmap stdout + $ sudo damo report heatmap 22222222222222222222222222222222222222211111111111111111111111111111111111111100 44444444444444444444444444444444444444434444444444444444444444444444444444443200 44444444444444444444444444444444444444433444444444444444444444444444444444444200 @@ -160,6 +173,6 @@ Data Access Pattern Aware Memory Management Below command makes every memory region of size >=4K that has not accessed for >=60 seconds in your workload to be swapped out. :: - $ sudo damo schemes --damos_access_rate 0 0 --damos_sz_region 4K max \ - --damos_age 60s max --damos_action pageout \ - + $ sudo damo start --damos_access_rate 0 0 --damos_sz_region 4K max \ + --damos_age 60s max --damos_action pageout \ + diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index d9be9f7caa7db1..47a44bd348abe9 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -26,12 +26,6 @@ DAMON provides below interfaces for different users. writing kernel space DAMON application programs for you. You can even extend DAMON for various address spaces. For detail, please refer to the interface :doc:`document `. -- *debugfs interface. (DEPRECATED!)* - :ref:`This ` is almost identical to :ref:`sysfs interface - `. This is deprecated, so users should move to the - :ref:`sysfs interface `. If you depend on this and cannot - move, please report your usecase to damon@lists.linux.dev and - linux-mm@kvack.org. .. _sysfs_interface: @@ -89,10 +83,10 @@ comma (","). │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value │ │ │ │ │ │ │ :ref:`watermarks `/metric,interval_us,high,mid,low │ │ │ │ │ │ │ :ref:`filters `/nr_filters - │ │ │ │ │ │ │ │ 0/type,matching,memcg_id - │ │ │ │ │ │ │ :ref:`stats `/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds + │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx + │ │ │ │ │ │ │ :ref:`stats `/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds │ │ │ │ │ │ │ :ref:`tried_regions `/total_bytes - │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age + │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ ... @@ -412,59 +406,62 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each filter. The filters are evaluated in the numeric order. -Each filter directory contains six files, namely ``type``, ``matcing``, -``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type`` -file, you can write one of five special keywords: ``anon`` for anonymous pages, -``memcg`` for specific memory cgroup, ``young`` for young pages, ``addr`` for -specific address range (an open-ended interval), or ``target`` for specific -DAMON monitoring target filtering. In case of the memory cgroup filtering, you -can specify the memory cgroup of the interest by writing the path of the memory -cgroup from the cgroups mount point to ``memcg_path`` file. In case of the -address range filtering, you can specify the start and end address of the range -to ``addr_start`` and ``addr_end`` files, respectively. For the DAMON -monitoring target filtering, you can specify the index of the target between -the list of the DAMON context's monitoring targets list to ``target_idx`` file. -You can write ``Y`` or ``N`` to ``matching`` file to filter out pages that does -or does not match to the type, respectively. Then, the scheme's action will -not be applied to the pages that specified to be filtered out. +Each filter directory contains seven files, namely ``type``, ``matching``, +``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. +To ``type`` file, you can write one of five special keywords: ``anon`` for +anonymous pages, ``memcg`` for specific memory cgroup, ``young`` for young +pages, ``addr`` for specific address range (an open-ended interval), or +``target`` for specific DAMON monitoring target filtering. Meaning of the +types are same to the description on the :ref:`design doc +`. + +In case of the memory cgroup filtering, you can specify the memory cgroup of +the interest by writing the path of the memory cgroup from the cgroups mount +point to ``memcg_path`` file. In case of the address range filtering, you can +specify the start and end address of the range to ``addr_start`` and +``addr_end`` files, respectively. For the DAMON monitoring target filtering, +you can specify the index of the target between the list of the DAMON context's +monitoring targets list to ``target_idx`` file. + +You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter +is for memory that matches the ``type``. You can write ``Y`` or ``N`` to +``allow`` file to specify if applying the action to the memory that satisfies +the ``type`` and ``matching`` should be allowed or not. For example, below restricts a DAMOS action to be applied to only non-anonymous pages of all memory cgroups except ``/having_care_already``.:: # echo 2 > nr_filters - # # filter out anonymous pages + # # disallow anonymous pages echo anon > 0/type echo Y > 0/matching + echo N > 0/allow # # further filter out all cgroups except one at '/having_care_already' echo memcg > 1/type echo /having_care_already > 1/memcg_path echo Y > 1/matching + echo N > 1/allow -Note that ``anon`` and ``memcg`` filters are currently supported only when -``paddr`` :ref:`implementation ` is being used. - -Also, memory regions that are filtered out by ``addr`` or ``target`` filters -are not counted as the scheme has tried to those, while regions that filtered -out by other type filters are counted as the scheme has tried to. The -difference is applied to :ref:`stats ` and -:ref:`tried regions `. +Refer to the :ref:`DAMOS filters design documentation +` for more details including how multiple filters +of different ``allow`` works, when each of the filters are supported, and +differences on stats. .. _sysfs_schemes_stats: schemes//stats/ ------------------ -DAMON counts the total number and bytes of regions that each scheme is tried to -be applied, the two numbers for the regions that each scheme is successfully -applied, and the total number of the quota limit exceeds. This statistics can -be used for online analysis or tuning of the schemes. +DAMON counts statistics for each scheme. This statistics can be used for +online analysis or tuning of the schemes. Refer to :ref:`design doc +` for more details about the stats. The statistics can be retrieved by reading the files under ``stats`` directory -(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and -``qt_exceeds``), respectively. The files are not updated in real time, so you -should ask DAMON sysfs interface to update the content of the files for the -stats by writing a special keyword, ``update_schemes_stats`` to the relevant -``kdamonds//state`` file. +(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, +``sz_ops_filter_passed``, and ``qt_exceeds``), respectively. The files are not +updated in real time, so you should ask DAMON sysfs interface to update the +content of the files for the stats by writing a special keyword, +``update_schemes_stats`` to the relevant ``kdamonds//state`` file. .. _sysfs_schemes_tried_regions: @@ -501,10 +498,10 @@ set the ``access pattern`` as their interested pattern that they want to query. tried_regions// ------------------ -In each region directory, you will find four files (``start``, ``end``, -``nr_accesses``, and ``age``). Reading the files will show the start and end -addresses, ``nr_accesses``, and ``age`` of the region that corresponding -DAMON-based operation scheme ``action`` has tried to be applied. +In each region directory, you will find five files (``start``, ``end``, +``nr_accesses``, ``age``, and ``sz_filter_passed``). Reading the files will +show the properties of the region that corresponding DAMON-based operation +scheme ``action`` has tried to be applied. Example ~~~~~~~ @@ -600,306 +597,3 @@ fields are as usual. It shows the index of the DAMON context (``ctx_idx=X``) of the scheme in the list of the contexts of the context's kdamond, the index of the scheme (``scheme_idx=X``) in the list of the schemes of the context, in addition to the output of ``damon_aggregated`` tracepoint. - - -.. _debugfs_interface: - -debugfs Interface (DEPRECATED!) -=============================== - -.. note:: - - THIS IS DEPRECATED! - - DAMON debugfs interface is deprecated, so users should move to the - :ref:`sysfs interface `. If you depend on this and cannot - move, please report your usecase to damon@lists.linux.dev and - linux-mm@kvack.org. - -DAMON exports nine files, ``DEPRECATED``, ``attrs``, ``target_ids``, -``init_regions``, ``schemes``, ``monitor_on_DEPRECATED``, ``kdamond_pid``, -``mk_contexts`` and ``rm_contexts`` under its debugfs directory, -``/damon/``. - - -``DEPRECATED`` is a read-only file for the DAMON debugfs interface deprecation -notice. Reading it returns the deprecation notice, as below:: - - # cat DEPRECATED - DAMON debugfs interface is deprecated, so users should move to DAMON_SYSFS. If you cannot, please report your usecase to damon@lists.linux.dev and linux-mm@kvack.org. - - -Attributes ----------- - -Users can get and set the ``sampling interval``, ``aggregation interval``, -``update interval``, and min/max number of monitoring target regions by -reading from and writing to the ``attrs`` file. To know about the monitoring -attributes in detail, please refer to the :doc:`/mm/damon/design`. For -example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and -1000, and then check it again:: - - # cd /damon - # echo 5000 100000 1000000 10 1000 > attrs - # cat attrs - 5000 100000 1000000 10 1000 - - -Target IDs ----------- - -Some types of address spaces supports multiple monitoring target. For example, -the virtual memory address spaces monitoring can have multiple processes as the -monitoring targets. Users can set the targets by writing relevant id values of -the targets to, and get the ids of the current targets by reading from the -``target_ids`` file. In case of the virtual address spaces monitoring, the -values should be pids of the monitoring target processes. For example, below -commands set processes having pids 42 and 4242 as the monitoring targets and -check it again:: - - # cd /damon - # echo 42 4242 > target_ids - # cat target_ids - 42 4242 - -Users can also monitor the physical memory address space of the system by -writing a special keyword, "``paddr\n``" to the file. Because physical address -space monitoring doesn't support multiple targets, reading the file will show a -fake value, ``42``, as below:: - - # cd /damon - # echo paddr > target_ids - # cat target_ids - 42 - -Note that setting the target ids doesn't start the monitoring. - - -Initial Monitoring Target Regions ---------------------------------- - -In case of the virtual address space monitoring, DAMON automatically sets and -updates the monitoring target regions so that entire memory mappings of target -processes can be covered. However, users can want to limit the monitoring -region to specific address ranges, such as the heap, the stack, or specific -file-mapped area. Or, some users can know the initial access pattern of their -workloads and therefore want to set optimal initial regions for the 'adaptive -regions adjustment'. - -In contrast, DAMON do not automatically sets and updates the monitoring target -regions in case of physical memory monitoring. Therefore, users should set the -monitoring target regions by themselves. - -In such cases, users can explicitly set the initial monitoring target regions -as they want, by writing proper values to the ``init_regions`` file. The input -should be a sequence of three integers separated by white spaces that represent -one region in below form.:: - - - -The ``target idx`` should be the index of the target in ``target_ids`` file, -starting from ``0``, and the regions should be passed in address order. For -example, below commands will set a couple of address ranges, ``1-100`` and -``100-200`` as the initial monitoring target region of pid 42, which is the -first one (index ``0``) in ``target_ids``, and another couple of address -ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one -(index ``1``) in ``target_ids``.:: - - # cd /damon - # cat target_ids - 42 4242 - # echo "0 1 100 \ - 0 100 200 \ - 1 20 40 \ - 1 50 100" > init_regions - -Note that this sets the initial monitoring target regions only. In case of -virtual memory monitoring, DAMON will automatically updates the boundary of the -regions after one ``update interval``. Therefore, users should set the -``update interval`` large enough in this case, if they don't want the -update. - - -Schemes -------- - -Users can get and set the DAMON-based operation :ref:`schemes -` by reading from and writing to ``schemes`` debugfs file. -Reading the file also shows the statistics of each scheme. To the file, each -of the schemes should be represented in each line in below form:: - - - -You can disable schemes by simply writing an empty string to the file. - -Target Access Pattern -~~~~~~~~~~~~~~~~~~~~~ - -The target access :ref:`pattern ` of the -scheme. The ```` is constructed with three ranges in -below form:: - - min-size max-size min-acc max-acc min-age max-age - -Specifically, bytes for the size of regions (``min-size`` and ``max-size``), -number of monitored accesses per aggregate interval for access frequency -(``min-acc`` and ``max-acc``), number of aggregate intervals for the age of -regions (``min-age`` and ``max-age``) are specified. Note that the ranges are -closed interval. - -Action -~~~~~~ - -The ```` is a predefined integer for memory management :ref:`actions -`. The mapping between the ```` values and -the memory management actions is as below. For the detailed meaning of the -action and DAMON operations set supporting each action, please refer to the -list on :ref:`design doc `. - - - 0: ``willneed`` - - 1: ``cold`` - - 2: ``pageout`` - - 3: ``hugepage`` - - 4: ``nohugepage`` - - 5: ``stat`` - -Quota -~~~~~ - -Users can set the :ref:`quotas ` of the given scheme -via the ```` in below form:: - - - -This makes DAMON to try to use only up to ```` milliseconds for applying -the action to memory regions of the ``target access pattern`` within the -```` milliseconds, and to apply the action to only up to -```` bytes of memory regions within the ````. Setting both -```` and ```` zero disables the quota limits. - -For the :ref:`prioritization `, users -can set the weights for the three properties in ```` in below -form:: - - - -Watermarks -~~~~~~~~~~ - -Users can specify :ref:`watermarks ` of the -given scheme via ```` in below form:: - - - -```` is a predefined integer for the metric to be checked. The -supported numbers and their meanings are as below. - - - 0: Ignore the watermarks - - 1: System's free memory rate (per thousand) - -The value of the metric is checked every ```` microseconds. - -If the value is higher than ```` or lower than ````, the -scheme is deactivated. If the value is lower than ````, the scheme -is activated. - -.. _damos_stats: - -Statistics -~~~~~~~~~~ - -It also counts the total number and bytes of regions that each scheme is tried -to be applied, the two numbers for the regions that each scheme is successfully -applied, and the total number of the quota limit exceeds. This statistics can -be used for online analysis or tuning of the schemes. - -The statistics can be shown by reading the ``schemes`` file. Reading the file -will show each scheme you entered in each line, and the five numbers for the -statistics will be added at the end of each line. - -Example -~~~~~~~ - -Below commands applies a scheme saying "If a memory region of size in [4KiB, -8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate -interval in [10, 20], page out the region. For the paging out, use only up to -10ms per second, and also don't page out more than 1GiB per second. Under the -limitation, page out memory regions having longer age first. Also, check the -free memory rate of the system every 5 seconds, start the monitoring and paging -out when the free memory rate becomes lower than 50%, but stop it if the free -memory rate becomes larger than 60%, or lower than 30%".:: - - # cd /damon - # scheme="4096 8192 0 5 10 20 2" # target access pattern and action - # scheme+=" 10 $((1024*1024*1024)) 1000" # quotas - # scheme+=" 0 0 100" # prioritization weights - # scheme+=" 1 5000000 600 500 300" # watermarks - # echo "$scheme" > schemes - - -Turning On/Off --------------- - -Setting the files as described above doesn't incur effect unless you explicitly -start the monitoring. You can start, stop, and check the current status of the -monitoring by writing to and reading from the ``monitor_on_DEPRECATED`` file. -Writing ``on`` to the file starts the monitoring of the targets with the -attributes. Writing ``off`` to the file stops those. DAMON also stops if -every target process is terminated. Below example commands turn on, off, and -check the status of DAMON:: - - # cd /damon - # echo on > monitor_on_DEPRECATED - # echo off > monitor_on_DEPRECATED - # cat monitor_on_DEPRECATED - off - -Please note that you cannot write to the above-mentioned debugfs files while -the monitoring is turned on. If you write to the files while DAMON is running, -an error code such as ``-EBUSY`` will be returned. - - -Monitoring Thread PID ---------------------- - -DAMON does requested monitoring with a kernel thread called ``kdamond``. You -can get the pid of the thread by reading the ``kdamond_pid`` file. When the -monitoring is turned off, reading the file returns ``none``. :: - - # cd /damon - # cat monitor_on_DEPRECATED - off - # cat kdamond_pid - none - # echo on > monitor_on_DEPRECATED - # cat kdamond_pid - 18594 - - -Using Multiple Monitoring Threads ---------------------------------- - -One ``kdamond`` thread is created for each monitoring context. You can create -and remove monitoring contexts for multiple ``kdamond`` required use case using -the ``mk_contexts`` and ``rm_contexts`` files. - -Writing the name of the new context to the ``mk_contexts`` file creates a -directory of the name on the DAMON debugfs directory. The directory will have -DAMON debugfs files for the context. :: - - # cd /damon - # ls foo - # ls: cannot access 'foo': No such file or directory - # echo foo > mk_contexts - # ls foo - # attrs init_regions kdamond_pid schemes target_ids - -If the context is not needed anymore, you can remove it and the corresponding -directory by putting the name of the context to the ``rm_contexts`` file. :: - - # echo foo > rm_contexts - # ls foo - # ls: cannot access 'foo': No such file or directory - -Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on_DEPRECATED`` files -are in the root directory only. diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index cb2c080f400ce4..33c886f3d19839 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -280,8 +280,8 @@ The following files are currently defined: blocks; configure auto-onlining. The default value depends on the - CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration - option. + CONFIG_MHP_DEFAULT_ONLINE_TYPE kernel configuration + options. See the ``state`` property of memory blocks for details. ``block_size_bytes`` read-only: the size in bytes of a memory block. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 8872203df0880a..dff8d5985f0f2e 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -332,6 +332,12 @@ allocation policy for the internal shmem mount by using the kernel parameter seven valid policies for shmem (``always``, ``within_size``, ``advise``, ``never``, ``deny``, and ``force``). +Similarly to ``transparent_hugepage_shmem``, you can control the default +hugepage allocation policy for the tmpfs mount by using the kernel parameter +``transparent_hugepage_tmpfs=``, where ```` is one of the +four valid policies for tmpfs (``always``, ``within_size``, ``advise``, +``never``). The tmpfs mount default policy is ``never``. + In the same manner as ``thp_anon`` controls each supported anonymous THP size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem`` has the same format as ``thp_anon``, but also supports the policy @@ -352,8 +358,21 @@ default to ``never``. Hugepages in tmpfs/shmem ======================== -You can control hugepage allocation policy in tmpfs with mount option -``huge=``. It can have following values: +Traditionally, tmpfs only supported a single huge page size ("PMD"). Today, +it also supports smaller sizes just like anonymous memory, often referred +to as "multi-size THP" (mTHP). Huge pages of any size are commonly +represented in the kernel as "large folios". + +While there is fine control over the huge page sizes to use for the internal +shmem mount (see below), ordinary tmpfs mounts will make use of all available +huge page sizes without any control over the exact sizes, behaving more like +other file systems. + +tmpfs mounts +------------ + +The THP allocation policy for tmpfs mounts can be adjusted using the mount +option: ``huge=``. It can have following values: always Attempt to allocate huge pages every time we need a new page; @@ -363,24 +382,24 @@ never within_size Only allocate huge page if it will be fully within i_size. - Also respect fadvise()/madvise() hints; + Also respect madvise() hints; advise - Only allocate huge pages if requested with fadvise()/madvise(); + Only allocate huge pages if requested with madvise(); + +Remember, that the kernel may use huge pages of all available sizes, and +that no fine control as for the internal tmpfs mount is available. -The default policy is ``never``. +The default policy in the past was ``never``, but it can now be adjusted +using the kernel parameter ``transparent_hugepage_tmpfs=``. ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting ``huge=never`` will not attempt to break up huge pages at all, just stop more from being allocated. -There's also sysfs knob to control hugepage allocation policy for internal -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. - -In addition to policies listed above, shmem_enabled allows two further -values: +In addition to policies listed above, the sysfs knob +/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the +allocation policy of tmpfs mounts, when set to the following values: deny For use in emergencies, to force the huge option off from @@ -388,13 +407,24 @@ deny force Force the huge option on for all - very useful for testing; -Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to -control mTHP allocation: -'/sys/kernel/mm/transparent_hugepage/hugepages-kB/shmem_enabled', -and its value for each mTHP is essentially consistent with the global -setting. An 'inherit' option is added to ensure compatibility with these -global settings. Conversely, the options 'force' and 'deny' are dropped, -which are rather testing artifacts from the old ages. +shmem / internal tmpfs +---------------------- +The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous +mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. + +To control the THP allocation policy for this internal tmpfs mount, the +sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs +per THP size in +'/sys/kernel/mm/transparent_hugepage/hugepages-kB/shmem_enabled' +can be used. + +The global knob has the same semantics as the ``huge=`` mount options +for tmpfs mounts, except that the different huge page sizes can be controlled +individually, and will only use the setting of the global knob when the +per-size knob is set to 'inherit'. + +The options 'force' and 'deny' are dropped for the individual sizes, which +are rather testing artifacts from the old ages. always Attempt to allocate huge pages every time we need a new page; @@ -408,10 +438,10 @@ never within_size Only allocate huge page if it will be fully within i_size. - Also respect fadvise()/madvise() hints; + Also respect madvise() hints; advise - Only allocate huge pages if requested with fadvise()/madvise(); + Only allocate huge pages if requested with madvise(); Need of application restart =========================== @@ -561,6 +591,16 @@ swpin is incremented every time a huge page is swapped in from a non-zswap swap device in one piece. +swpin_fallback + is incremented if swapin fails to allocate or charge a huge page + and instead falls back to using huge pages with lower orders or + small pages. + +swpin_fallback_charge + is incremented if swapin fails to charge a huge page and instead + falls back to using huge pages with lower orders or small pages + even though the allocation was successful. + swpout is incremented every time a huge page is swapped out to a non-zswap swap device in one piece without splitting. diff --git a/Documentation/admin-guide/nvme-multipath.rst b/Documentation/admin-guide/nvme-multipath.rst new file mode 100644 index 00000000000000..97ca1ccef459b9 --- /dev/null +++ b/Documentation/admin-guide/nvme-multipath.rst @@ -0,0 +1,72 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Linux NVMe multipath +==================== + +This document describes NVMe multipath and its path selection policies supported +by the Linux NVMe host driver. + + +Introduction +============ + +The NVMe multipath feature in Linux integrates namespaces with the same +identifier into a single block device. Using multipath enhances the reliability +and stability of I/O access while improving bandwidth performance. When a user +sends I/O to this merged block device, the multipath mechanism selects one of +the underlying block devices (paths) according to the configured policy. +Different policies result in different path selections. + + +Policies +======== + +All policies follow the ANA (Asymmetric Namespace Access) mechanism, meaning +that when an optimized path is available, it will be chosen over a non-optimized +one. Current the NVMe multipath policies include numa(default), round-robin and +queue-depth. + +To set the desired policy (e.g., round-robin), use one of the following methods: + 1. echo -n "round-robin" > /sys/module/nvme_core/parameters/iopolicy + 2. or add the "nvme_core.iopolicy=round-robin" to cmdline. + + +NUMA +---- + +The NUMA policy selects the path closest to the NUMA node of the current CPU for +I/O distribution. This policy maintains the nearest paths to each NUMA node +based on network interface connections. + +When to use the NUMA policy: + 1. Multi-core Systems: Optimizes memory access in multi-core and + multi-processor systems, especially under NUMA architecture. + 2. High Affinity Workloads: Binds I/O processing to the CPU to reduce + communication and data transfer delays across nodes. + + +Round-Robin +----------- + +The round-robin policy distributes I/O requests evenly across all paths to +enhance throughput and resource utilization. Each I/O operation is sent to the +next path in sequence. + +When to use the round-robin policy: + 1. Balanced Workloads: Effective for balanced and predictable workloads with + similar I/O size and type. + 2. Homogeneous Path Performance: Utilizes all paths efficiently when + performance characteristics (e.g., latency, bandwidth) are similar. + + +Queue-Depth +----------- + +The queue-depth policy manages I/O requests based on the current queue depth +of each path, selecting the path with the least number of in-flight I/Os. + +When to use the queue-depth policy: + 1. High load with small I/Os: Effectively balances load across paths when + the load is high, and I/O operations consist of small, relatively + fixed-sized requests. diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst index 39b8e1fdd0cd07..cb376f335f402c 100644 --- a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst @@ -60,7 +60,7 @@ description of available events and configuration options in sysfs, see The "format" directory describes format of the config fields of the perf_event_attr structure. The "events" directory provides configuration templates for all documented events. For example, -"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1". +"rx_pcie_tlp_data_payload" is an equivalent of "eventid=0x21,type=0x0". The "perf list" command shall list the available events from sysfs, e.g.:: @@ -79,8 +79,8 @@ Example usage of counting PCIe RX TLP data payload (Units of bytes):: The average RX/TX bandwidth can be calculated using the following formula: - PCIe RX Bandwidth = Rx_PCIe_TLP_Data_Payload / Measure_Time_Window - PCIe TX Bandwidth = Tx_PCIe_TLP_Data_Payload / Measure_Time_Window + PCIe RX Bandwidth = rx_pcie_tlp_data_payload / Measure_Time_Window + PCIe TX Bandwidth = tx_pcie_tlp_data_payload / Measure_Time_Window Lane Event Usage ------------------------------- diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst index 5cc248d18c63a7..48992a0b8e94f7 100644 --- a/Documentation/admin-guide/perf/hisi-pmu.rst +++ b/Documentation/admin-guide/perf/hisi-pmu.rst @@ -35,7 +35,10 @@ e.g. hisi_sccl1_hha0/rx_operations is RX_OPERATIONS event of HHA index #0 in SCCL ID #1. The driver also provides a "cpumask" sysfs attribute, which shows the CPU core -ID used to count the uncore PMU event. +ID used to count the uncore PMU event. An "associated_cpus" sysfs attribute is +also provided to show the CPUs associated with this PMU. The "cpumask" indicates +the CPUs to open the events, usually as a hint for userspaces tools like perf. +It only contains one associated CPU from the "associated_cpus". Example usage of perf:: diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index a58bd3f7e19079..072b510385c417 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -14,6 +14,8 @@ Performance monitor support qcom_l2_pmu qcom_l3_pmu starfive_starlink_pmu + mrvl-odyssey-ddr-pmu + mrvl-odyssey-tad-pmu arm-ccn arm-cmn arm-ni diff --git a/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst b/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst new file mode 100644 index 00000000000000..2e817593a4d9ee --- /dev/null +++ b/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst @@ -0,0 +1,80 @@ +=================================================================== +Marvell Odyssey DDR PMU Performance Monitoring Unit (PMU UNCORE) +=================================================================== + +Odyssey DRAM Subsystem supports eight counters for monitoring performance +and software can program those counters to monitor any of the defined +performance events. Supported performance events include those counted +at the interface between the DDR controller and the PHY, interface between +the DDR Controller and the CHI interconnect, or within the DDR Controller. + +Additionally DSS also supports two fixed performance event counters, one +for ddr reads and the other for ddr writes. + +The counter will be operating in either manual or auto mode. + +The PMU driver exposes the available events and format options under sysfs:: + + /sys/bus/event_source/devices/mrvl_ddr_pmu_<>/events/ + /sys/bus/event_source/devices/mrvl_ddr_pmu_<>/format/ + +Examples:: + + $ perf list | grep ddr + mrvl_ddr_pmu_<>/ddr_act_bypass_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_bsm_alloc/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_bsm_starvation/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_active_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_mwr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_rd_active_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_rd_or_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_read/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_write/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_capar_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_crit_ref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_ddr_reads/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_ddr_writes/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_cmd_is_retry/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_cycles/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_parity_poison/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_rd_data_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_wr_data_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dqsosc_mpc/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dqsosc_mrr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_mpsm/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_powerdown/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_selfref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_pri_rdaccess/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rd_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rd_or_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rmw_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hpri_sched_rd_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_load_mode/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_lpri_sched_rd_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge_for_other/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge_for_rdwr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_raw_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_bypass_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_crc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_uc_ecc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rdwr_transitions/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_refresh/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_retry_fifo_full/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_spec_ref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_tcr_mrr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_war_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_waw_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_win_limit_reached_rd/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_win_limit_reached_wr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_wr_crc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_wr_trxn_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_write_combine/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqcl/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqlatch/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqstart/ [Kernel PMU event] + + $ perf stat -e ddr_cam_read,ddr_cam_write,ddr_cam_active_access,ddr_cam + rd_or_wr_access,ddr_cam_rd_active_access,ddr_cam_mwr diff --git a/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst b/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst new file mode 100644 index 00000000000000..ad1975b14087bc --- /dev/null +++ b/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst @@ -0,0 +1,37 @@ +==================================================================== +Marvell Odyssey LLC-TAD Performance Monitoring Unit (PMU UNCORE) +==================================================================== + +Each TAD provides eight 64-bit counters for monitoring +cache behavior.The driver always configures the same counter for +all the TADs. The user would end up effectively reserving one of +eight counters in every TAD to look across all TADs. +The occurrences of events are aggregated and presented to the user +at the end of running the workload. The driver does not provide a +way for the user to partition TADs so that different TADs are used for +different applications. + +The performance events reflect various internal or interface activities. +By combining the values from multiple performance counters, cache +performance can be measured in terms such as: cache miss rate, cache +allocations, interface retry rate, internal resource occupancy, etc. + +The PMU driver exposes the available events and format options under sysfs:: + + /sys/bus/event_source/devices/tad/events/ + /sys/bus/event_source/devices/tad/format/ + +Examples:: + + $ perf list | grep tad + tad/tad_alloc_any/ [Kernel PMU event] + tad/tad_alloc_dtg/ [Kernel PMU event] + tad/tad_alloc_ltg/ [Kernel PMU event] + tad/tad_hit_any/ [Kernel PMU event] + tad/tad_hit_dtg/ [Kernel PMU event] + tad/tad_hit_ltg/ [Kernel PMU event] + tad/tad_req_msh_in_exlmn/ [Kernel PMU event] + tad/tad_tag_rd/ [Kernel PMU event] + tad/tad_tot_cycle/ [Kernel PMU event] + + $ perf stat -e tad_alloc_dtg,tad_alloc_ltg,tad_alloc_any,tad_hit_dtg,tad_hit_ltg,tad_hit_any,tad_tag_rd diff --git a/Documentation/admin-guide/perf/nvidia-pmu.rst b/Documentation/admin-guide/perf/nvidia-pmu.rst index 2e0d47cfe7ea76..f538ef67e0e8f0 100644 --- a/Documentation/admin-guide/perf/nvidia-pmu.rst +++ b/Documentation/admin-guide/perf/nvidia-pmu.rst @@ -34,7 +34,7 @@ strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_scf_pmu_. +see /sys/bus/event_source/devices/nvidia_scf_pmu_. Example usage: @@ -66,7 +66,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_. +see /sys/bus/event_source/devices/nvidia_nvlink_c2c0_pmu_. Example usage: @@ -86,6 +86,22 @@ Example usage: perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/ +The NVLink-C2C has two ports that can be connected to one GPU (occupying both +ports) or to two GPUs (one GPU per port). The user can use "port" bitmap +parameter to select the port(s) to monitor. Each bit represents the port number, +e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The +PMU will monitor both ports by default if not specified. + +Example for port filtering: + +* Count event id 0x0 from the GPU connected with socket 0 on port 0:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x1/ + +* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x3/ + NVLink-C2C1 PMU ------------------- @@ -96,7 +112,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_. +see /sys/bus/event_source/devices/nvidia_nvlink_c2c1_pmu_. Example usage: @@ -116,6 +132,22 @@ Example usage: perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/ +The NVLink-C2C has two ports that can be connected to one GPU (occupying both +ports) or to two GPUs (one GPU per port). The user can use "port" bitmap +parameter to select the port(s) to monitor. Each bit represents the port number, +e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The +PMU will monitor both ports by default if not specified. + +Example for port filtering: + +* Count event id 0x0 from the GPU connected with socket 0 on port 0:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x1/ + +* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x3/ + CNVLink PMU --------------- @@ -125,13 +157,14 @@ to local memory. For PCIE traffic, this PMU captures read and relaxed ordered for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_. +see /sys/bus/event_source/devices/nvidia_cnvlink_pmu_. Each SoC socket can be connected to one or more sockets via CNVLink. The user can use "rem_socket" bitmap parameter to select the remote socket(s) to monitor. Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to -socket 1 to 3. -/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_/format/rem_socket +socket 1 to 3. The PMU will monitor all remote sockets by default if not +specified. +/sys/bus/event_source/devices/nvidia_cnvlink_pmu_/format/rem_socket shows the valid bits that can be set in the "rem_socket" parameter. The PMU can not distinguish the remote traffic initiator, therefore it does not @@ -165,12 +198,13 @@ local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_pcie_pmu_. +see /sys/bus/event_source/devices/nvidia_pcie_pmu_. Each SoC socket can support multiple root ports. The user can use "root_port" bitmap parameter to select the port(s) to monitor, i.e. -"root_port=0xF" corresponds to root port 0 to 3. -/sys/bus/event_sources/devices/nvidia_pcie_pmu_/format/root_port +"root_port=0xF" corresponds to root port 0 to 3. The PMU will monitor all root +ports by default if not specified. +/sys/bus/event_source/devices/nvidia_pcie_pmu_/format/root_port shows the valid bits that can be set in the "root_port" parameter. Example usage: diff --git a/Documentation/admin-guide/quickly-build-trimmed-linux.rst b/Documentation/admin-guide/quickly-build-trimmed-linux.rst index f08149bc53f84d..07cfd8863b46ee 100644 --- a/Documentation/admin-guide/quickly-build-trimmed-linux.rst +++ b/Documentation/admin-guide/quickly-build-trimmed-linux.rst @@ -733,7 +733,7 @@ can easily happen that your self-built kernel will lack modules for tasks you did not perform before utilizing this make target. That's because those tasks require kernel modules that are normally autoloaded when you perform that task for the first time; if you didn't perform that task at least once before using -localmodonfig, the latter will thus assume these modules are superfluous and +localmodconfig, the latter will thus assume these modules are superfluous and disable them. You can try to avoid this by performing typical tasks that often will autoload diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst index f5ec6c9312e1da..08e89e03171467 100644 --- a/Documentation/admin-guide/sysctl/fs.rst +++ b/Documentation/admin-guide/sysctl/fs.rst @@ -41,7 +41,7 @@ pre-allocation or re-sizing of any kernel data structures. dentry-negative ---------------------------- -Policy for negative dentries. Set to 1 to to always delete the dentry when a +Policy for negative dentries. Set to 1 to always delete the dentry when a file is removed, and 0 to disable it. By default, this behavior is disabled. dentry-state diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index b2b36d0c3094d7..a43b78b4b64649 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -1544,6 +1544,13 @@ constant ``FUTEX_TID_MASK`` (0x3fffffff). If a value outside of this range is written to ``threads-max`` an ``EINVAL`` error occurs. +timer_migration +=============== + +When set to a non-zero value, attempt to migrate timers away from idle cpus to +allow them to remain in low power states longer. + +Default is set (1). traceoff_on_warning =================== diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst index a85b3384d1e7a9..9c7aa817adc72d 100644 --- a/Documentation/admin-guide/sysrq.rst +++ b/Documentation/admin-guide/sysrq.rst @@ -49,26 +49,26 @@ How do I use the magic SysRq key? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On x86 - You press the key combo :kbd:`ALT-SysRq-`. + You press the key combo `ALT-SysRq-`. .. note:: Some keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is also known as the 'Print Screen' key. Also some keyboards cannot handle so many keys being pressed at the same time, so you might - have better luck with press :kbd:`Alt`, press :kbd:`SysRq`, - release :kbd:`SysRq`, press :kbd:``, release everything. + have better luck with press `Alt`, press `SysRq`, + release `SysRq`, press ``, release everything. On SPARC - You press :kbd:`ALT-STOP-`, I believe. + You press `ALT-STOP-`, I believe. On the serial console (PC style standard serial ports only) You send a ``BREAK``, then within 5 seconds a command key. Sending ``BREAK`` twice is interpreted as a normal BREAK. On PowerPC - Press :kbd:`ALT - Print Screen` (or :kbd:`F13`) - :kbd:``. - :kbd:`Print Screen` (or :kbd:`F13`) - :kbd:`` may suffice. + Press `ALT - Print Screen` (or `F13`) - ``. + `Print Screen` (or `F13`) - `` may suffice. On other If you know of the key combos for other architectures, please @@ -88,7 +88,7 @@ On all echo _reisub > /proc/sysrq-trigger -The :kbd:`` is case sensitive. +The `` is case sensitive. What are the 'command' keys? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -225,9 +225,9 @@ Sometimes SysRq seems to get 'stuck' after using it, what can I do? When this happens, try tapping shift, alt and control on both sides of the keyboard, and hitting an invalid sysrq sequence again. (i.e., something like -:kbd:`alt-sysrq-z`). +`alt-sysrq-z`). -Switching to another virtual console (:kbd:`ALT+Fn`) and then back again +Switching to another virtual console (`ALT+Fn`) and then back again should also help. I hit SysRq, but nothing seems to happen, what's wrong? @@ -290,7 +290,7 @@ exception the header line from the sysrq command is passed to all console consumers as if the current loglevel was maximum. If only the header is emitted it is almost certain that the kernel loglevel is too low. Should you require the output on the console channel then you will need -to temporarily up the console loglevel using :kbd:`alt-sysrq-8` or:: +to temporarily up the console loglevel using `alt-sysrq-8` or:: echo 8 > /proc/sysrq-trigger diff --git a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst index 6281eae9e6bccf..03c55151346c33 100644 --- a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +++ b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst @@ -1431,7 +1431,7 @@ can easily happen that your self-built kernels will lack modules for tasks you did not perform at least once before utilizing this make target. That happens when a task requires kernel modules which are only autoloaded when you execute it for the first time. So when you never performed that task since starting your -kernel the modules will not have been loaded -- and from localmodonfig's point +kernel the modules will not have been loaded -- and from localmodconfig's point of view look superfluous, which thus disables them to reduce the amount of code to be compiled. diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst index b2e254ec8ee846..6be38c1b9c5bb4 100644 --- a/Documentation/admin-guide/workload-tracing.rst +++ b/Documentation/admin-guide/workload-tracing.rst @@ -83,7 +83,7 @@ scripts/ver_linux is a good way to check if your system already has the necessary tools:: sudo apt-get build-essentials flex bison yacc - sudo apt install libelf-dev systemtap-sdt-dev libaudit-dev libslang2-dev libperl-dev libdw-dev + sudo apt install libelf-dev systemtap-sdt-dev libslang2-dev libperl-dev libdw-dev cscope is a good tool to browse kernel sources. Let's install it now:: diff --git a/Documentation/arch/arm64/asymmetric-32bit.rst b/Documentation/arch/arm64/asymmetric-32bit.rst index 64a0b505da7d80..1ca2b359a907f0 100644 --- a/Documentation/arch/arm64/asymmetric-32bit.rst +++ b/Documentation/arch/arm64/asymmetric-32bit.rst @@ -153,3 +153,11 @@ asymmetric system, a broken guest at EL1 could still attempt to execute mode will return to host userspace with an ``exit_reason`` of ``KVM_EXIT_FAIL_ENTRY`` and will remain non-runnable until successfully re-initialised by a subsequent ``KVM_ARM_VCPU_INIT`` operation. + +NOHZ FULL +--------- + +To avoid perturbing an adaptive-ticks CPU (specified using +``nohz_full=``) when a 32-bit task is forcefully migrated, these CPUs +are treated as 64-bit-only when support for asymmetric 32-bit systems +is enabled. diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst index 3278fb4bf219da..cad6fdc96b98bf 100644 --- a/Documentation/arch/arm64/booting.rst +++ b/Documentation/arch/arm64/booting.rst @@ -449,6 +449,18 @@ Before jumping into the kernel, the following conditions must be met: - HFGWTR_EL2.nGCS_EL0 (bit 52) must be initialised to 0b1. + - For CPUs with debug architecture i.e FEAT_Debugv8pN (all versions): + + - If EL3 is present: + + - MDCR_EL3.TDA (bit 9) must be initialized to 0b0 + + - For CPUs with FEAT_PMUv3: + + - If EL3 is present: + + - MDCR_EL3.TPM (bit 6) must be initialized to 0b0 + The requirements described above for CPU mode, caches, MMUs, architected timers, coherency and system registers apply to all CPUs. All CPUs must enter the kernel in the same exception level. Where the values documented diff --git a/Documentation/arch/arm64/elf_hwcaps.rst b/Documentation/arch/arm64/elf_hwcaps.rst index 2ff922a406ad83..69d7afe5685306 100644 --- a/Documentation/arch/arm64/elf_hwcaps.rst +++ b/Documentation/arch/arm64/elf_hwcaps.rst @@ -174,26 +174,82 @@ HWCAP_GCS Functionality implied by ID_AA64PFR1_EL1.GCS == 0b1, as described by Documentation/arch/arm64/gcs.rst. +HWCAP_CMPBR + Functionality implied by ID_AA64ISAR2_EL1.CSSC == 0b0010. + +HWCAP_FPRCVT + Functionality implied by ID_AA64ISAR3_EL1.FPRCVT == 0b0001. + +HWCAP_F8MM8 + Functionality implied by ID_AA64FPFR0_EL1.F8MM8 == 0b0001. + +HWCAP_F8MM4 + Functionality implied by ID_AA64FPFR0_EL1.F8MM4 == 0b0001. + +HWCAP_SVE_F16MM + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.F16MM == 0b0001. + +HWCAP_SVE_ELTPERM + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.ELTPERM == 0b0001. + +HWCAP_SVE_AES2 + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.AES == 0b0011. + +HWCAP_SVE_BFSCALE + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.B16B16 == 0b0010. + +HWCAP_SVE2P2 + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.SVEver == 0b0011. + +HWCAP_SME2P2 + Functionality implied by ID_AA64SMFR0_EL1.SMEver == 0b0011. + +HWCAP_SME_SBITPERM + Functionality implied by ID_AA64SMFR0_EL1.SBitPerm == 0b1. + +HWCAP_SME_AES + Functionality implied by ID_AA64SMFR0_EL1.AES == 0b1. + +HWCAP_SME_SFEXPA + Functionality implied by ID_AA64SMFR0_EL1.SFEXPA == 0b1. + +HWCAP_SME_STMOP + Functionality implied by ID_AA64SMFR0_EL1.STMOP == 0b1. + +HWCAP_SME_SMOP4 + Functionality implied by ID_AA64SMFR0_EL1.SMOP4 == 0b1. + HWCAP2_DCPODP Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0010. HWCAP2_SVE2 - Functionality implied by ID_AA64ZFR0_EL1.SVEver == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.SVEver == 0b0001. HWCAP2_SVEAES - Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.AES == 0b0001. HWCAP2_SVEPMULL - Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0010. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.AES == 0b0010. HWCAP2_SVEBITPERM - Functionality implied by ID_AA64ZFR0_EL1.BitPerm == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.BitPerm == 0b0001. HWCAP2_SVESHA3 - Functionality implied by ID_AA64ZFR0_EL1.SHA3 == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.SHA3 == 0b0001. HWCAP2_SVESM4 - Functionality implied by ID_AA64ZFR0_EL1.SM4 == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.SM4 == 0b0001. HWCAP2_FLAGM2 Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0010. @@ -202,16 +258,20 @@ HWCAP2_FRINT Functionality implied by ID_AA64ISAR1_EL1.FRINTTS == 0b0001. HWCAP2_SVEI8MM - Functionality implied by ID_AA64ZFR0_EL1.I8MM == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.I8MM == 0b0001. HWCAP2_SVEF32MM - Functionality implied by ID_AA64ZFR0_EL1.F32MM == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.F32MM == 0b0001. HWCAP2_SVEF64MM - Functionality implied by ID_AA64ZFR0_EL1.F64MM == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.F64MM == 0b0001. HWCAP2_SVEBF16 - Functionality implied by ID_AA64ZFR0_EL1.BF16 == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.BF16 == 0b0001. HWCAP2_I8MM Functionality implied by ID_AA64ISAR1_EL1.I8MM == 0b0001. @@ -277,7 +337,8 @@ HWCAP2_EBF16 Functionality implied by ID_AA64ISAR1_EL1.BF16 == 0b0010. HWCAP2_SVE_EBF16 - Functionality implied by ID_AA64ZFR0_EL1.BF16 == 0b0010. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.BF16 == 0b0010. HWCAP2_CSSC Functionality implied by ID_AA64ISAR2_EL1.CSSC == 0b0001. @@ -286,7 +347,8 @@ HWCAP2_RPRFM Functionality implied by ID_AA64ISAR2_EL1.RPRFM == 0b0001. HWCAP2_SVE2P1 - Functionality implied by ID_AA64ZFR0_EL1.SVEver == 0b0010. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.SVEver == 0b0010. HWCAP2_SME2 Functionality implied by ID_AA64SMFR0_EL1.SMEver == 0b0001. @@ -313,7 +375,8 @@ HWCAP2_HBC Functionality implied by ID_AA64ISAR2_EL1.BC == 0b0001. HWCAP2_SVE_B16B16 - Functionality implied by ID_AA64ZFR0_EL1.B16B16 == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.B16B16 == 0b0001. HWCAP2_LRCPC3 Functionality implied by ID_AA64ISAR1_EL1.LRCPC == 0b0011. diff --git a/Documentation/arch/arm64/memory.rst b/Documentation/arch/arm64/memory.rst index 8a658984b8bb67..678fbb418c3ad5 100644 --- a/Documentation/arch/arm64/memory.rst +++ b/Documentation/arch/arm64/memory.rst @@ -23,71 +23,6 @@ swapper_pg_dir contains only kernel (global) mappings while the user pgd contains only user (non-global) mappings. The swapper_pg_dir address is written to TTBR1 and never written to TTBR0. - -AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit):: - - Start End Size Use - ----------------------------------------------------------------------- - 0000000000000000 0000ffffffffffff 256TB user - ffff000000000000 ffff7fffffffffff 128TB kernel logical memory map - [ffff600000000000 ffff7fffffffffff] 32TB [kasan shadow region] - ffff800000000000 ffff80007fffffff 2GB modules - ffff800080000000 fffffbffefffffff 124TB vmalloc - fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down) - fffffbfffe000000 fffffbfffe7fffff 8MB [guard region] - fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space - fffffbffff800000 fffffbffffffffff 8MB [guard region] - fffffc0000000000 fffffdffffffffff 2TB vmemmap - fffffe0000000000 ffffffffffffffff 2TB [guard region] - - -AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support):: - - Start End Size Use - ----------------------------------------------------------------------- - 0000000000000000 000fffffffffffff 4PB user - fff0000000000000 ffff7fffffffffff ~4PB kernel logical memory map - [fffd800000000000 ffff7fffffffffff] 512TB [kasan shadow region] - ffff800000000000 ffff80007fffffff 2GB modules - ffff800080000000 fffffbffefffffff 124TB vmalloc - fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down) - fffffbfffe000000 fffffbfffe7fffff 8MB [guard region] - fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space - fffffbffff800000 fffffbffffffffff 8MB [guard region] - fffffc0000000000 ffffffdfffffffff ~4TB vmemmap - ffffffe000000000 ffffffffffffffff 128GB [guard region] - - -Translation table lookup with 4KB pages:: - - +--------+--------+--------+--------+--------+--------+--------+--------+ - |63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0| - +--------+--------+--------+--------+--------+--------+--------+--------+ - | | | | | | - | | | | | v - | | | | | [11:0] in-page offset - | | | | +-> [20:12] L3 index - | | | +-----------> [29:21] L2 index - | | +---------------------> [38:30] L1 index - | +-------------------------------> [47:39] L0 index - +----------------------------------------> [55] TTBR0/1 - - -Translation table lookup with 64KB pages:: - - +--------+--------+--------+--------+--------+--------+--------+--------+ - |63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0| - +--------+--------+--------+--------+--------+--------+--------+--------+ - | | | | | - | | | | v - | | | | [15:0] in-page offset - | | | +----------> [28:16] L3 index - | | +--------------------------> [41:29] L2 index - | +-------------------------------> [47:42] L1 index (48-bit) - | [51:42] L1 index (52-bit) - +----------------------------------------> [55] TTBR0/1 - - When using KVM without the Virtualization Host Extensions, the hypervisor maps kernel pages in EL2 at a fixed (and potentially random) offset from the linear mapping. See the kern_hyp_va macro and diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst index b42fea07c5cec8..f074f6219f5c33 100644 --- a/Documentation/arch/arm64/silicon-errata.rst +++ b/Documentation/arch/arm64/silicon-errata.rst @@ -198,7 +198,8 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-V3 | #3312417 | ARM64_ERRATUM_3194386 | +----------------+-----------------+-----------------+-----------------------------+ -| ARM | MMU-500 | #841119,826419 | N/A | +| ARM | MMU-500 | #841119,826419 | ARM_SMMU_MMU_500_CPRE_ERRATA| +| | | #562869,1047329 | | +----------------+-----------------+-----------------+-----------------------------+ | ARM | MMU-600 | #1076982,1209401| N/A | +----------------+-----------------+-----------------+-----------------------------+ diff --git a/Documentation/arch/riscv/hwprobe.rst b/Documentation/arch/riscv/hwprobe.rst index 955fbcd19ce90f..f273ea15a8e836 100644 --- a/Documentation/arch/riscv/hwprobe.rst +++ b/Documentation/arch/riscv/hwprobe.rst @@ -293,3 +293,13 @@ The following keys are defined: * :c:macro:`RISCV_HWPROBE_MISALIGNED_VECTOR_UNSUPPORTED`: Misaligned vector accesses are not supported at all and will generate a misaligned address fault. + +* :c:macro:`RISCV_HWPROBE_KEY_VENDOR_EXT_THEAD_0`: A bitmask containing the + thead vendor extensions that are compatible with the + :c:macro:`RISCV_HWPROBE_BASE_BEHAVIOR_IMA`: base system behavior. + + * T-HEAD + + * :c:macro:`RISCV_HWPROBE_VENDOR_EXT_XTHEADVECTOR`: The xtheadvector vendor + extension is supported in the T-Head ISA extensions spec starting from + commit a18c801634 ("Add T-Head VECTOR vendor extension. "). diff --git a/Documentation/arch/x86/amd-memory-encryption.rst b/Documentation/arch/x86/amd-memory-encryption.rst index 6df3264f23b97d..bd840df708ea4d 100644 --- a/Documentation/arch/x86/amd-memory-encryption.rst +++ b/Documentation/arch/x86/amd-memory-encryption.rst @@ -130,8 +130,126 @@ SNP feature support. More details in AMD64 APM[1] Vol 2: 15.34.10 SEV_STATUS MSR +Reverse Map Table (RMP) +======================= + +The RMP is a structure in system memory that is used to ensure a one-to-one +mapping between system physical addresses and guest physical addresses. Each +page of memory that is potentially assignable to guests has one entry within +the RMP. + +The RMP table can be either contiguous in memory or a collection of segments +in memory. + +Contiguous RMP +-------------- + +Support for this form of the RMP is present when support for SEV-SNP is +present, which can be determined using the CPUID instruction:: + + 0x8000001f[eax]: + Bit[4] indicates support for SEV-SNP + +The location of the RMP is identified to the hardware through two MSRs:: + + 0xc0010132 (RMP_BASE): + System physical address of the first byte of the RMP + + 0xc0010133 (RMP_END): + System physical address of the last byte of the RMP + +Hardware requires that RMP_BASE and (RPM_END + 1) be 8KB aligned, but SEV +firmware increases the alignment requirement to require a 1MB alignment. + +The RMP consists of a 16KB region used for processor bookkeeping followed +by the RMP entries, which are 16 bytes in size. The size of the RMP +determines the range of physical memory that the hypervisor can assign to +SEV-SNP guests. The RMP covers the system physical address from:: + + 0 to ((RMP_END + 1 - RMP_BASE - 16KB) / 16B) x 4KB. + +The current Linux support relies on BIOS to allocate/reserve the memory for +the RMP and to set RMP_BASE and RMP_END appropriately. Linux uses the MSR +values to locate the RMP and determine the size of the RMP. The RMP must +cover all of system memory in order for Linux to enable SEV-SNP. + +Segmented RMP +------------- + +Segmented RMP support is a new way of representing the layout of an RMP. +Initial RMP support required the RMP table to be contiguous in memory. +RMP accesses from a NUMA node on which the RMP doesn't reside +can take longer than accesses from a NUMA node on which the RMP resides. +Segmented RMP support allows the RMP entries to be located on the same +node as the memory the RMP is covering, potentially reducing latency +associated with accessing an RMP entry associated with the memory. Each +RMP segment covers a specific range of system physical addresses. + +Support for this form of the RMP can be determined using the CPUID +instruction:: + + 0x8000001f[eax]: + Bit[23] indicates support for segmented RMP + +If supported, segmented RMP attributes can be found using the CPUID +instruction:: + + 0x80000025[eax]: + Bits[5:0] minimum supported RMP segment size + Bits[11:6] maximum supported RMP segment size + + 0x80000025[ebx]: + Bits[9:0] number of cacheable RMP segment definitions + Bit[10] indicates if the number of cacheable RMP segments + is a hard limit + +To enable a segmented RMP, a new MSR is available:: + + 0xc0010136 (RMP_CFG): + Bit[0] indicates if segmented RMP is enabled + Bits[13:8] contains the size of memory covered by an RMP + segment (expressed as a power of 2) + +The RMP segment size defined in the RMP_CFG MSR applies to all segments +of the RMP. Therefore each RMP segment covers a specific range of system +physical addresses. For example, if the RMP_CFG MSR value is 0x2401, then +the RMP segment coverage value is 0x24 => 36, meaning the size of memory +covered by an RMP segment is 64GB (1 << 36). So the first RMP segment +covers physical addresses from 0 to 0xF_FFFF_FFFF, the second RMP segment +covers physical addresses from 0x10_0000_0000 to 0x1F_FFFF_FFFF, etc. + +When a segmented RMP is enabled, RMP_BASE points to the RMP bookkeeping +area as it does today (16K in size). However, instead of RMP entries +beginning immediately after the bookkeeping area, there is a 4K RMP +segment table (RST). Each entry in the RST is 8-bytes in size and represents +an RMP segment:: + + Bits[19:0] mapped size (in GB) + The mapped size can be less than the defined segment size. + A value of zero, indicates that no RMP exists for the range + of system physical addresses associated with this segment. + Bits[51:20] segment physical address + This address is left shift 20-bits (or just masked when + read) to form the physical address of the segment (1MB + alignment). + +The RST can hold 512 segment entries but can be limited in size to the number +of cacheable RMP segments (CPUID 0x80000025_EBX[9:0]) if the number of cacheable +RMP segments is a hard limit (CPUID 0x80000025_EBX[10]). + +The current Linux support relies on BIOS to allocate/reserve the memory for +the segmented RMP (the bookkeeping area, RST, and all segments), build the RST +and to set RMP_BASE, RMP_END, and RMP_CFG appropriately. Linux uses the MSR +values to locate the RMP and determine the size and location of the RMP +segments. The RMP must cover all of system memory in order for Linux to enable +SEV-SNP. + +More details in the AMD64 APM Vol 2, section "15.36.3 Reverse Map Table", +docID: 24593. + Secure VM Service Module (SVSM) =============================== + SNP provides a feature called Virtual Machine Privilege Levels (VMPL) which defines four privilege levels at which guest software can run. The most privileged level is 0 and numerically higher numbers have lesser privileges. diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst index ad2d8ddad27fe4..76f53d3450e739 100644 --- a/Documentation/arch/x86/boot.rst +++ b/Documentation/arch/x86/boot.rst @@ -77,7 +77,7 @@ Protocol 2.14 BURNT BY INCORRECT COMMIT Protocol 2.15 (Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max. ============= ============================================================ - .. note:: +.. note:: The protocol version number should be changed only if the setup header is changed. There is no need to update the version number if boot_params or kernel_info are changed. Additionally, it is recommended to use @@ -95,27 +95,27 @@ Memory Layout The traditional memory map for the kernel loader, used for Image or zImage kernels, typically looks like:: - | | - 0A0000 +------------------------+ - | Reserved for BIOS | Do not use. Reserved for BIOS EBDA. - 09A000 +------------------------+ - | Command line | - | Stack/heap | For use by the kernel real-mode code. - 098000 +------------------------+ - | Kernel setup | The kernel real-mode code. - 090200 +------------------------+ - | Kernel boot sector | The kernel legacy boot sector. - 090000 +------------------------+ - | Protected-mode kernel | The bulk of the kernel image. - 010000 +------------------------+ - | Boot loader | <- Boot sector entry point 0000:7C00 - 001000 +------------------------+ - | Reserved for MBR/BIOS | - 000800 +------------------------+ - | Typically used by MBR | - 000600 +------------------------+ - | BIOS use only | - 000000 +------------------------+ + | | + 0A0000 +------------------------+ + | Reserved for BIOS | Do not use. Reserved for BIOS EBDA. + 09A000 +------------------------+ + | Command line | + | Stack/heap | For use by the kernel real-mode code. + 098000 +------------------------+ + | Kernel setup | The kernel real-mode code. + 090200 +------------------------+ + | Kernel boot sector | The kernel legacy boot sector. + 090000 +------------------------+ + | Protected-mode kernel | The bulk of the kernel image. + 010000 +------------------------+ + | Boot loader | <- Boot sector entry point 0000:7C00 + 001000 +------------------------+ + | Reserved for MBR/BIOS | + 000800 +------------------------+ + | Typically used by MBR | + 000600 +------------------------+ + | BIOS use only | + 000000 +------------------------+ When using bzImage, the protected-mode kernel was relocated to 0x100000 ("high memory"), and the kernel real-mode block (boot sector, @@ -142,28 +142,28 @@ above the 0x9A000 point; too many BIOSes will break above that point. For a modern bzImage kernel with boot protocol version >= 2.02, a memory layout like the following is suggested:: - ~ ~ - | Protected-mode kernel | - 100000 +------------------------+ - | I/O memory hole | - 0A0000 +------------------------+ - | Reserved for BIOS | Leave as much as possible unused - ~ ~ - | Command line | (Can also be below the X+10000 mark) - X+10000 +------------------------+ - | Stack/heap | For use by the kernel real-mode code. - X+08000 +------------------------+ - | Kernel setup | The kernel real-mode code. - | Kernel boot sector | The kernel legacy boot sector. - X +------------------------+ - | Boot loader | <- Boot sector entry point 0000:7C00 - 001000 +------------------------+ - | Reserved for MBR/BIOS | - 000800 +------------------------+ - | Typically used by MBR | - 000600 +------------------------+ - | BIOS use only | - 000000 +------------------------+ + ~ ~ + | Protected-mode kernel | + 100000 +------------------------+ + | I/O memory hole | + 0A0000 +------------------------+ + | Reserved for BIOS | Leave as much as possible unused + ~ ~ + | Command line | (Can also be below the X+10000 mark) + X+10000 +------------------------+ + | Stack/heap | For use by the kernel real-mode code. + X+08000 +------------------------+ + | Kernel setup | The kernel real-mode code. + | Kernel boot sector | The kernel legacy boot sector. + X +------------------------+ + | Boot loader | <- Boot sector entry point 0000:7C00 + 001000 +------------------------+ + | Reserved for MBR/BIOS | + 000800 +------------------------+ + | Typically used by MBR | + 000600 +------------------------+ + | BIOS use only | + 000000 +------------------------+ ... where the address X is as low as the design of the boot loader permits. @@ -229,22 +229,22 @@ Offset/Size Proto Name Meaning =========== ======== ===================== ============================================ .. note:: - (1) For backwards compatibility, if the setup_sects field contains 0, the - real value is 4. + (1) For backwards compatibility, if the setup_sects field contains 0, + the real value is 4. - (2) For boot protocol prior to 2.04, the upper two bytes of the syssize - field are unusable, which means the size of a bzImage kernel - cannot be determined. + (2) For boot protocol prior to 2.04, the upper two bytes of the syssize + field are unusable, which means the size of a bzImage kernel + cannot be determined. - (3) Ignored, but safe to set, for boot protocols 2.02-2.09. + (3) Ignored, but safe to set, for boot protocols 2.02-2.09. If the "HdrS" (0x53726448) magic number is not found at offset 0x202, the boot protocol version is "old". Loading an old kernel, the following parameters should be assumed:: - Image type = zImage - initrd not supported - Real-mode kernel must be located at 0x90000. + Image type = zImage + initrd not supported + Real-mode kernel must be located at 0x90000. Otherwise, the "version" field contains the protocol version, e.g. protocol version 2.01 will contain 0x0201 in this field. When @@ -265,7 +265,7 @@ All general purpose boot loaders should write the fields marked nonstandard address should fill in the fields marked (reloc); other boot loaders can ignore those fields. -The byte order of all fields is littleendian (this is x86, after all.) +The byte order of all fields is little endian (this is x86, after all.) ============ =========== Field name: setup_sects @@ -365,7 +365,7 @@ Offset/size: 0x206/2 Protocol: 2.00+ ============ ======= - Contains the boot protocol version, in (major << 8)+minor format, + Contains the boot protocol version, in (major << 8) + minor format, e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version 10.17. @@ -397,17 +397,17 @@ Protocol: 2.00+ If set to a nonzero value, contains a pointer to a NUL-terminated human-readable kernel version number string, less 0x200. This can be used to display the kernel version to the user. This value - should be less than (0x200*setup_sects). + should be less than (0x200 * setup_sects). For example, if this value is set to 0x1c00, the kernel version number string can be found at offset 0x1e00 in the kernel file. This is a valid value if and only if the "setup_sects" field contains the value 15 or higher, as:: - 0x1c00 < 15*0x200 (= 0x1e00) but - 0x1c00 >= 14*0x200 (= 0x1c00) + 0x1c00 < 15 * 0x200 (= 0x1e00) but + 0x1c00 >= 14 * 0x200 (= 0x1c00) - 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15. + 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15. ============ ================== Field name: type_of_loader @@ -427,9 +427,9 @@ Protocol: 2.00+ For example, for T = 0x15, V = 0x234, write:: - type_of_loader <- 0xE4 - ext_loader_type <- 0x05 - ext_loader_ver <- 0x23 + type_of_loader <- 0xE4 + ext_loader_type <- 0x05 + ext_loader_ver <- 0x23 Assigned boot loader ids (hexadecimal): @@ -686,7 +686,7 @@ Protocol: 2.10+ If a boot loader makes use of this field, it should update the kernel_alignment field with the alignment unit desired; typically:: - kernel_alignment = 1 << min_alignment + kernel_alignment = 1 << min_alignment; There may be a considerable performance cost with an excessively misaligned kernel. Therefore, a loader should typically try each @@ -754,7 +754,7 @@ Protocol: 2.07+ 0x00000000 The default x86/PC environment 0x00000001 lguest 0x00000002 Xen - 0x00000003 Moorestown MID + 0x00000003 Intel MID (Moorestown, CloverTrail, Merrifield, Moorefield) 0x00000004 CE4100 TV Platform ========== ============================== @@ -808,13 +808,13 @@ Protocol: 2.09+ parameters passing mechanism. The definition of struct setup_data is as follow:: - struct setup_data { - u64 next; - u32 type; - u32 len; - u8 data[0]; - }; - + struct setup_data { + __u64 next; + __u32 type; + __u32 len; + __u8 data[]; + } + Where, the next is a 64-bit physical pointer to the next node of linked list, the next field of the last node is 0; the type is used to identify the contents of data; the len is the length of data @@ -834,12 +834,12 @@ Protocol: 2.09+ Thus setup_indirect struct and SETUP_INDIRECT type were introduced in protocol 2.15:: - struct setup_indirect { - __u32 type; - __u32 reserved; /* Reserved, must be set to zero. */ - __u64 len; - __u64 addr; - }; + struct setup_indirect { + __u32 type; + __u32 reserved; /* Reserved, must be set to zero. */ + __u64 len; + __u64 addr; + }; The type member is a SETUP_INDIRECT | SETUP_* type. However, it cannot be SETUP_INDIRECT itself since making the setup_indirect a tree structure @@ -849,17 +849,17 @@ Protocol: 2.09+ Let's give an example how to point to SETUP_E820_EXT data using setup_indirect. In this case setup_data and setup_indirect will look like this:: - struct setup_data { - __u64 next = 0 or ; - __u32 type = SETUP_INDIRECT; - __u32 len = sizeof(setup_indirect); - __u8 data[sizeof(setup_indirect)] = struct setup_indirect { - __u32 type = SETUP_INDIRECT | SETUP_E820_EXT; - __u32 reserved = 0; - __u64 len = ; - __u64 addr = ; - } - } + struct setup_data { + .next = 0, /* or */ + .type = SETUP_INDIRECT, + .len = sizeof(setup_indirect), + .data[sizeof(setup_indirect)] = (struct setup_indirect) { + .type = SETUP_INDIRECT | SETUP_E820_EXT, + .reserved = 0, + .len = , + .addr = , + }, + } .. note:: SETUP_INDIRECT | SETUP_NONE objects cannot be properly distinguished @@ -896,19 +896,19 @@ Offset/size: 0x260/4 The kernel runtime start address is determined by the following algorithm:: - if (relocatable_kernel) { - if (load_address < pref_address) - load_address = pref_address; - runtime_start = align_up(load_address, kernel_alignment); - } else { - runtime_start = pref_address; - } + if (relocatable_kernel) { + if (load_address < pref_address) + load_address = pref_address; + runtime_start = align_up(load_address, kernel_alignment); + } else { + runtime_start = pref_address; + } Hence the necessary memory window location and size can be estimated by a boot loader as:: - memory_window_start = runtime_start; - memory_window_size = init_size; + memory_window_start = runtime_start; + memory_window_size = init_size; ============ =============== Field name: handover_offset @@ -938,12 +938,12 @@ The kernel_info =============== The relationships between the headers are analogous to the various data -sections: +sections:: setup_header = .data boot_params/setup_data = .bss -What is missing from the above list? That's right: +What is missing from the above list? That's right:: kernel_info = .rodata @@ -975,22 +975,22 @@ after kernel_info_var_len_data label. Each chunk of variable size data has to be prefixed with header/magic and its size, e.g.:: kernel_info: - .ascii "LToP" /* Header, Linux top (structure). */ - .long kernel_info_var_len_data - kernel_info - .long kernel_info_end - kernel_info - .long 0x01234567 /* Some fixed size data for the bootloaders. */ + .ascii "LToP" /* Header, Linux top (structure). */ + .long kernel_info_var_len_data - kernel_info + .long kernel_info_end - kernel_info + .long 0x01234567 /* Some fixed size data for the bootloaders. */ kernel_info_var_len_data: - example_struct: /* Some variable size data for the bootloaders. */ - .ascii "0123" /* Header/Magic. */ - .long example_struct_end - example_struct - .ascii "Struct" - .long 0x89012345 + example_struct: /* Some variable size data for the bootloaders. */ + .ascii "0123" /* Header/Magic. */ + .long example_struct_end - example_struct + .ascii "Struct" + .long 0x89012345 example_struct_end: - example_strings: /* Some variable size data for the bootloaders. */ - .ascii "ABCD" /* Header/Magic. */ - .long example_strings_end - example_strings - .asciz "String_0" - .asciz "String_1" + example_strings: /* Some variable size data for the bootloaders. */ + .ascii "ABCD" /* Header/Magic. */ + .long example_strings_end - example_strings + .asciz "String_0" + .asciz "String_1" example_strings_end: kernel_info_end: @@ -1139,67 +1139,63 @@ mode segment. Such a boot loader should enter the following fields in the header:: - unsigned long base_ptr; /* base address for real-mode segment */ - - if ( setup_sects == 0 ) { - setup_sects = 4; - } + unsigned long base_ptr; /* base address for real-mode segment */ - if ( protocol >= 0x0200 ) { - type_of_loader = ; - if ( loading_initrd ) { - ramdisk_image = ; - ramdisk_size = ; - } + if (setup_sects == 0) + setup_sects = 4; - if ( protocol >= 0x0202 && loadflags & 0x01 ) - heap_end = 0xe000; - else - heap_end = 0x9800; + if (protocol >= 0x0200) { + type_of_loader = ; + if (loading_initrd) { + ramdisk_image = ; + ramdisk_size = ; + } - if ( protocol >= 0x0201 ) { - heap_end_ptr = heap_end - 0x200; - loadflags |= 0x80; /* CAN_USE_HEAP */ - } + if (protocol >= 0x0202 && loadflags & 0x01) + heap_end = 0xe000; + else + heap_end = 0x9800; - if ( protocol >= 0x0202 ) { - cmd_line_ptr = base_ptr + heap_end; - strcpy(cmd_line_ptr, cmdline); - } else { - cmd_line_magic = 0xA33F; - cmd_line_offset = heap_end; - setup_move_size = heap_end + strlen(cmdline)+1; - strcpy(base_ptr+cmd_line_offset, cmdline); - } - } else { - /* Very old kernel */ + if (protocol >= 0x0201) { + heap_end_ptr = heap_end - 0x200; + loadflags |= 0x80; /* CAN_USE_HEAP */ + } - heap_end = 0x9800; + if (protocol >= 0x0202) { + cmd_line_ptr = base_ptr + heap_end; + strcpy(cmd_line_ptr, cmdline); + } else { + cmd_line_magic = 0xA33F; + cmd_line_offset = heap_end; + setup_move_size = heap_end + strlen(cmdline) + 1; + strcpy(base_ptr + cmd_line_offset, cmdline); + } + } else { + /* Very old kernel */ - cmd_line_magic = 0xA33F; - cmd_line_offset = heap_end; + heap_end = 0x9800; - /* A very old kernel MUST have its real-mode code - loaded at 0x90000 */ + cmd_line_magic = 0xA33F; + cmd_line_offset = heap_end; - if ( base_ptr != 0x90000 ) { - /* Copy the real-mode kernel */ - memcpy(0x90000, base_ptr, (setup_sects+1)*512); - base_ptr = 0x90000; /* Relocated */ - } + /* A very old kernel MUST have its real-mode code loaded at 0x90000 */ + if (base_ptr != 0x90000) { + /* Copy the real-mode kernel */ + memcpy(0x90000, base_ptr, (setup_sects + 1) * 512); + base_ptr = 0x90000; /* Relocated */ + } - strcpy(0x90000+cmd_line_offset, cmdline); + strcpy(0x90000 + cmd_line_offset, cmdline); - /* It is recommended to clear memory up to the 32K mark */ - memset(0x90000 + (setup_sects+1)*512, 0, - (64-(setup_sects+1))*512); - } + /* It is recommended to clear memory up to the 32K mark */ + memset(0x90000 + (setup_sects + 1) * 512, 0, (64 - (setup_sects + 1)) * 512); + } Loading The Rest of The Kernel ============================== -The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512 +The 32-bit (non-real-mode) kernel starts at offset (setup_sects + 1) * 512 in the kernel file (again, if setup_sects == 0 the real value is 4.) It should be loaded at address 0x10000 for Image/zImage kernels and 0x100000 for bzImage kernels. @@ -1207,13 +1203,14 @@ It should be loaded at address 0x10000 for Image/zImage kernels and The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01 bit (LOAD_HIGH) in the loadflags field is set:: - is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01); - load_address = is_bzImage ? 0x100000 : 0x10000; + is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01); + load_address = is_bzImage ? 0x100000 : 0x10000; -Note that Image/zImage kernels can be up to 512K in size, and thus use -the entire 0x10000-0x90000 range of memory. This means it is pretty -much a requirement for these kernels to load the real-mode part at -0x90000. bzImage kernels allow much more flexibility. +.. note:: + Image/zImage kernels can be up to 512K in size, and thus use the entire + 0x10000-0x90000 range of memory. This means it is pretty much a + requirement for these kernels to load the real-mode part at 0x90000. + bzImage kernels allow much more flexibility. Special Command Line Options ============================ @@ -1282,19 +1279,20 @@ es = ss. In our example from above, we would do:: - /* Note: in the case of the "old" kernel protocol, base_ptr must - be == 0x90000 at this point; see the previous sample code */ - - seg = base_ptr >> 4; + /* + * Note: in the case of the "old" kernel protocol, base_ptr must + * be == 0x90000 at this point; see the previous sample code. + */ + seg = base_ptr >> 4; - cli(); /* Enter with interrupts disabled! */ + cli(); /* Enter with interrupts disabled! */ - /* Set up the real-mode kernel stack */ - _SS = seg; - _SP = heap_end; + /* Set up the real-mode kernel stack */ + _SS = seg; + _SP = heap_end; - _DS = _ES = _FS = _GS = seg; - jmp_far(seg+0x20, 0); /* Run the kernel */ + _DS = _ES = _FS = _GS = seg; + jmp_far(seg + 0x20, 0); /* Run the kernel */ If your boot sector accesses a floppy drive, it is recommended to switch off the floppy motor before running the kernel, since the @@ -1349,7 +1347,7 @@ from offset 0x01f1 of kernel image on should be loaded into struct boot_params and examined. The end of setup header can be calculated as follow:: - 0x0202 + byte value at offset 0x0201 + 0x0202 + byte value at offset 0x0201 In addition to read/modify/write the setup header of the struct boot_params as that of 16-bit boot protocol, the boot loader should @@ -1385,7 +1383,7 @@ Then, the setup header at offset 0x01f1 of kernel image on should be loaded into struct boot_params and examined. The end of setup header can be calculated as follows:: - 0x0202 + byte value at offset 0x0201 + 0x0202 + byte value at offset 0x0201 In addition to read/modify/write the setup header of the struct boot_params as that of 16-bit boot protocol, the boot loader should @@ -1427,7 +1425,7 @@ execution context provided by the EFI firmware. The function prototype for the handover entry point looks like this:: - efi_stub_entry(void *handle, efi_system_table_t *table, struct boot_params *bp) + void efi_stub_entry(void *handle, efi_system_table_t *table, struct boot_params *bp); 'handle' is the EFI image handle passed to the boot loader by the EFI firmware, 'table' is the EFI system table - these are the first two @@ -1442,12 +1440,13 @@ The boot loader *must* fill out the following fields in bp:: All other fields should be zero. -NOTE: The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF - entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd - loading protocol (refer to [0] for an example of the bootloader side of - this), which removes the need for any knowledge on the part of the EFI - bootloader regarding the internal representation of boot_params or any - requirements/limitations regarding the placement of the command line - and ramdisk in memory, or the placement of the kernel image itself. +.. note:: + The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF + entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd + loading protocol (refer to [0] for an example of the bootloader side of + this), which removes the need for any knowledge on the part of the EFI + bootloader regarding the internal representation of boot_params or any + requirements/limitations regarding the placement of the command line + and ramdisk in memory, or the placement of the kernel image itself. [0] https://github.com/u-boot/u-boot/commit/ec80b4735a593961fe701cc3a5d717d4739b0fd0 diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst index a824affd741d1e..6768fc1fad166f 100644 --- a/Documentation/arch/x86/resctrl.rst +++ b/Documentation/arch/x86/resctrl.rst @@ -384,6 +384,16 @@ When monitoring is enabled all MON groups will also contain: Available only with debug option. The identifier used by hardware for the monitor group. On x86 this is the RMID. +When the "mba_MBps" mount option is used all CTRL_MON groups will also contain: + +"mba_MBps_event": + Reading this file shows which memory bandwidth event is used + as input to the software feedback loop that keeps memory bandwidth + below the value specified in the schemata file. Writing the + name of one of the supported memory bandwidth events found in + /sys/fs/resctrl/info/L3_MON/mon_features changes the input + event. + Resource allocation rules ------------------------- diff --git a/Documentation/arch/x86/topology.rst b/Documentation/arch/x86/topology.rst index 7352ab89a55ae4..c12837e61bda53 100644 --- a/Documentation/arch/x86/topology.rst +++ b/Documentation/arch/x86/topology.rst @@ -135,6 +135,10 @@ Thread-related topology information in the kernel: The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo "core_id." + - topology_logical_core_id(); + + The logical core ID to which a thread belongs. + System topology examples diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst deleted file mode 100644 index d69e3cfbdba5a5..00000000000000 --- a/Documentation/arch/x86/x86_64/boot-options.rst +++ /dev/null @@ -1,312 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -=========================== -AMD64 Specific Boot Options -=========================== - -There are many others (usually documented in driver documentation), but -only the AMD64 specific ones are listed here. - -Machine check -============= -Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables. - - mce=off - Disable machine check - mce=no_cmci - Disable CMCI(Corrected Machine Check Interrupt) that - Intel processor supports. Usually this disablement is - not recommended, but it might be handy if your hardware - is misbehaving. - Note that you'll get more problems without CMCI than with - due to the shared banks, i.e. you might get duplicated - error logs. - mce=dont_log_ce - Don't make logs for corrected errors. All events reported - as corrected are silently cleared by OS. - This option will be useful if you have no interest in any - of corrected errors. - mce=ignore_ce - Disable features for corrected errors, e.g. polling timer - and CMCI. All events reported as corrected are not cleared - by OS and remained in its error banks. - Usually this disablement is not recommended, however if - there is an agent checking/clearing corrected errors - (e.g. BIOS or hardware monitoring applications), conflicting - with OS's error handling, and you cannot deactivate the agent, - then this option will be a help. - mce=no_lmce - Do not opt-in to Local MCE delivery. Use legacy method - to broadcast MCEs. - mce=bootlog - Enable logging of machine checks left over from booting. - Disabled by default on AMD Fam10h and older because some BIOS - leave bogus ones. - If your BIOS doesn't do that it's a good idea to enable though - to make sure you log even machine check events that result - in a reboot. On Intel systems it is enabled by default. - mce=nobootlog - Disable boot machine check logging. - mce=monarchtimeout (number) - monarchtimeout: - Sets the time in us to wait for other CPUs on machine checks. 0 - to disable. - mce=bios_cmci_threshold - Don't overwrite the bios-set CMCI threshold. This boot option - prevents Linux from overwriting the CMCI threshold set by the - bios. Without this option, Linux always sets the CMCI - threshold to 1. Enabling this may make memory predictive failure - analysis less effective if the bios sets thresholds for memory - errors since we will not see details for all errors. - mce=recovery - Force-enable recoverable machine check code paths - - nomce (for compatibility with i386) - same as mce=off - - Everything else is in sysfs now. - -APICs -===== - - apic - Use IO-APIC. Default - - noapic - Don't use the IO-APIC. - - disableapic - Don't use the local APIC - - nolapic - Don't use the local APIC (alias for i386 compatibility) - - pirq=... - See Documentation/arch/x86/i386/IO-APIC.rst - - noapictimer - Don't set up the APIC timer - - no_timer_check - Don't check the IO-APIC timer. This can work around - problems with incorrect timer initialization on some boards. - - apicpmtimer - Do APIC timer calibration using the pmtimer. Implies - apicmaintimer. Useful when your PIT timer is totally broken. - -Timing -====== - - notsc - Deprecated, use tsc=unstable instead. - - nohpet - Don't use the HPET timer. - -Idle loop -========= - - idle=poll - Don't do power saving in the idle loop using HLT, but poll for rescheduling - event. This will make the CPUs eat a lot more power, but may be useful - to get slightly better performance in multiprocessor benchmarks. It also - makes some profiling using performance counters more accurate. - Please note that on systems with MONITOR/MWAIT support (like Intel EM64T - CPUs) this option has no performance advantage over the normal idle loop. - It may also interact badly with hyperthreading. - -Rebooting -========= - - reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old] - bios - Use the CPU reboot vector for warm reset - warm - Don't set the cold reboot flag - cold - Set the cold reboot flag - triple - Force a triple fault (init) - kbd - Use the keyboard controller. cold reset (default) - acpi - Use the ACPI RESET_REG in the FADT. If ACPI is not configured or - the ACPI reset does not work, the reboot path attempts the reset - using the keyboard controller. - efi - Use efi reset_system runtime service. If EFI is not configured or - the EFI reset does not work, the reboot path attempts the reset using - the keyboard controller. - pci - Use a write to the PCI config space register 0xcf9 to trigger reboot. - - Using warm reset will be much faster especially on big memory - systems because the BIOS will not go through the memory check. - Disadvantage is that not all hardware will be completely reinitialized - on reboot so there may be boot problems on some systems. - - reboot=force - Don't stop other CPUs on reboot. This can make reboot more reliable - in some cases. - - reboot=default - There are some built-in platform specific "quirks" - you may see: - "reboot: series board detected. Selecting for reboots." - In the case where you think the quirk is in error (e.g. you have - newer BIOS, or newer board) using this option will ignore the built-in - quirk table, and use the generic default reboot actions. - -NUMA -==== - - numa=off - Only set up a single NUMA node spanning all memory. - - numa=noacpi - Don't parse the SRAT table for NUMA setup - - numa=nohmat - Don't parse the HMAT table for NUMA setup, or soft-reserved memory - partitioning. - -ACPI -==== - - acpi=off - Don't enable ACPI - acpi=ht - Use ACPI boot table parsing, but don't enable ACPI interpreter - acpi=force - Force ACPI on (currently not needed) - acpi=strict - Disable out of spec ACPI workarounds. - acpi_sci={edge,level,high,low} - Set up ACPI SCI interrupt. - acpi=noirq - Don't route interrupts - acpi=nocmcff - Disable firmware first mode for corrected errors. This - disables parsing the HEST CMC error source to check if - firmware has set the FF flag. This may result in - duplicate corrected error reports. - -PCI -=== - - pci=off - Don't use PCI - pci=conf1 - Use conf1 access. - pci=conf2 - Use conf2 access. - pci=rom - Assign ROMs. - pci=assign-busses - Assign busses - pci=irqmask=MASK - Set PCI interrupt mask to MASK - pci=lastbus=NUMBER - Scan up to NUMBER busses, no matter what the mptable says. - pci=noacpi - Don't use ACPI to set up PCI interrupt routing. - -IOMMU (input/output memory management unit) -=========================================== -Multiple x86-64 PCI-DMA mapping implementations exist, for example: - - 1. : use no hardware/software IOMMU at all - (e.g. because you have < 3 GB memory). - Kernel boot message: "PCI-DMA: Disabling IOMMU" - - 2. : AMD GART based hardware IOMMU. - Kernel boot message: "PCI-DMA: using GART IOMMU" - - 3. : Software IOMMU implementation. Used - e.g. if there is no hardware IOMMU in the system and it is need because - you have >3GB memory or told the kernel to us it (iommu=soft)) - Kernel boot message: "PCI-DMA: Using software bounce buffering - for IO (SWIOTLB)" - -:: - - iommu=[][,noagp][,off][,force][,noforce] - [,memaper[=]][,merge][,fullflush][,nomerge] - [,noaperture] - -General iommu options: - - off - Don't initialize and use any kind of IOMMU. - noforce - Don't force hardware IOMMU usage when it is not needed. (default). - force - Force the use of the hardware IOMMU even when it is - not actually needed (e.g. because < 3 GB memory). - soft - Use software bounce buffering (SWIOTLB) (default for - Intel machines). This can be used to prevent the usage - of an available hardware IOMMU. - -iommu options only relevant to the AMD GART hardware IOMMU: - - - Set the size of the remapping area in bytes. - allowed - Overwrite iommu off workarounds for specific chipsets. - fullflush - Flush IOMMU on each allocation (default). - nofullflush - Don't use IOMMU fullflush. - memaper[=] - Allocate an own aperture over RAM with size 32MB<[,force,noforce] - - Prereserve that many 2K slots for the software IO bounce buffering. - force - Force all IO through the software TLB. - noforce - Do not initialize the software TLB. - - -Miscellaneous -============= - - nogbpages - Do not use GB pages for kernel direct mappings. - gbpages - Use GB pages for kernel direct mappings. - - -AMD SEV (Secure Encrypted Virtualization) -========================================= -Options relating to AMD SEV, specified via the following format: - -:: - - sev=option1[,option2] - -The available options are: - - debug - Enable debug messages. - - nosnp - Do not enable SEV-SNP (applies to host/hypervisor only). Setting - 'nosnp' avoids the RMP check overhead in memory accesses when - users do not want to run SEV-SNP guests. diff --git a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst index ba74617d4999ac..970ee94eb55161 100644 --- a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst +++ b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst @@ -18,7 +18,7 @@ For more information on the features of cpusets, see Documentation/admin-guide/cgroup-v1/cpusets.rst. There are a number of different configurations you can use for your needs. For more information on the numa=fake command line option and its various ways of -configuring fake nodes, see Documentation/arch/x86/x86_64/boot-options.rst. +configuring fake nodes, see Documentation/admin-guide/kernel-parameters.txt For the purposes of this introduction, we'll assume a very primitive NUMA emulation setup of "numa=fake=4*512,". This will split our system memory into diff --git a/Documentation/arch/x86/x86_64/index.rst b/Documentation/arch/x86/x86_64/index.rst index ad15e9bd623f68..a0261957a08af9 100644 --- a/Documentation/arch/x86/x86_64/index.rst +++ b/Documentation/arch/x86/x86_64/index.rst @@ -7,7 +7,6 @@ x86_64 Support .. toctree:: :maxdepth: 2 - boot-options uefi mm 5level-paging diff --git a/Documentation/arch/x86/x86_64/uefi.rst b/Documentation/arch/x86/x86_64/uefi.rst index fbc30c9a071d3b..e84592dbd6c117 100644 --- a/Documentation/arch/x86/x86_64/uefi.rst +++ b/Documentation/arch/x86/x86_64/uefi.rst @@ -12,14 +12,20 @@ with EFI firmware and specifications are listed below. 1. UEFI specification: http://www.uefi.org -2. Booting Linux kernel on UEFI x86_64 platform requires bootloader - support. Elilo with x86_64 support can be used. +2. Booting Linux kernel on UEFI x86_64 platform can either be + done using the or using a + separate bootloader. 3. x86_64 platform with EFI/UEFI firmware. Mechanics --------- +Refer to to learn how to use the EFI stub. + +Below are general EFI setup guidelines on the x86_64 platform, +regardless of whether you use the EFI stub or a separate bootloader. + - Build the kernel with the following configuration:: CONFIG_FB_EFI=y @@ -31,16 +37,27 @@ Mechanics CONFIG_EFI=y CONFIG_EFIVAR_FS=y or m # optional -- Create a VFAT partition on the disk -- Copy the following to the VFAT partition: +- Create a VFAT partition on the disk with the EFI System flag + You can do this with fdisk with the following commands: + + 1. g - initialize a GPT partition table + 2. n - create a new partition + 3. t - change the partition type to "EFI System" (number 1) + 4. w - write and save the changes + + Afterwards, initialize the VFAT filesystem by running mkfs:: + + mkfs.fat /dev/ + +- Copy the boot files to the VFAT partition: + If you use the EFI stub method, the kernel acts also as an EFI executable. + + You can just copy the bzImage to the EFI/boot/bootx64.efi path on the partition + so that it will automatically get booted, see the page + for additional instructions regarding passage of kernel parameters and initramfs. - elilo bootloader with x86_64 support, elilo configuration file, - kernel image built in first step and corresponding - initrd. Instructions on building elilo and its dependencies - can be found in the elilo sourceforge project. + If you use a custom bootloader, refer to the relevant documentation for help on this part. -- Boot to EFI shell and invoke elilo choosing the kernel image built - in first step. - If some or all EFI runtime services don't work, you can try following kernel command line parameters to turn off some or all EFI runtime services. diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst index 51665a3e6a50cd..1e0e7358e14a96 100644 --- a/Documentation/block/ublk.rst +++ b/Documentation/block/ublk.rst @@ -333,6 +333,4 @@ References .. [#userspace_readme] https://github.com/ming1/ubdsrv/blob/master/README -.. [#stefan] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ - .. [#xiaoguang] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ diff --git a/Documentation/core-api/cgroup.rst b/Documentation/core-api/cgroup.rst new file mode 100644 index 00000000000000..734ea21e1e17a3 --- /dev/null +++ b/Documentation/core-api/cgroup.rst @@ -0,0 +1,9 @@ +================== +Cgroup Kernel APIs +================== + +Device Memory Cgroup API (dmemcg) +================================= +.. kernel-doc:: kernel/cgroup/dmem.c + :export: + diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 563b8fc0002f73..e9789bd381d804 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -53,6 +53,7 @@ Library functionality that is used throughout the kernel. floating-point union_find min_heap + parser Low level entry and exit ======================== @@ -109,6 +110,7 @@ more memory-management documentation in Documentation/mm/index.rst. dma-isa-lpc swiotlb mm-api + cgroup genalloc pin_user_pages boot-time-mm diff --git a/Documentation/core-api/kref.rst b/Documentation/core-api/kref.rst index c61eea6f1bf2bd..8db9ff03d952e6 100644 --- a/Documentation/core-api/kref.rst +++ b/Documentation/core-api/kref.rst @@ -3,7 +3,7 @@ Adding reference counters (krefs) to kernel objects =================================================== :Author: Corey Minyard -:Author: Thomas Hellstrom +:Author: Thomas Hellström A lot of this was lifted from Greg Kroah-Hartman's 2004 OLS paper and presentation on krefs, which can be found at: @@ -321,3 +321,8 @@ rcu grace period after release_entry_rcu was called. That can be accomplished by using kfree_rcu(entry, rhead) as done above, or by calling synchronize_rcu() before using kfree, but note that synchronize_rcu() may sleep for a substantial amount of time. + +Functions and structures +======================== + +.. kernel-doc:: include/linux/kref.h diff --git a/Documentation/core-api/min_heap.rst b/Documentation/core-api/min_heap.rst index 0c636c8b7aa581..683bc6d09f00df 100644 --- a/Documentation/core-api/min_heap.rst +++ b/Documentation/core-api/min_heap.rst @@ -4,6 +4,8 @@ Min Heap API ============ +:Author: Kuan-Wei Chiu + Introduction ============ diff --git a/Documentation/core-api/packing.rst b/Documentation/core-api/packing.rst index 821691f23c541c..0ce2078c8e13dd 100644 --- a/Documentation/core-api/packing.rst +++ b/Documentation/core-api/packing.rst @@ -227,11 +227,119 @@ Intended use Drivers that opt to use this API first need to identify which of the above 3 quirk combinations (for a total of 8) match what the hardware documentation -describes. Then they should wrap the packing() function, creating a new -xxx_packing() that calls it using the proper QUIRK_* one-hot bits set. +describes. + +There are 3 supported usage patterns, detailed below. + +packing() +^^^^^^^^^ + +This API function is deprecated. The packing() function returns an int-encoded error code, which protects the programmer against incorrect API use. The errors are not expected to occur -during runtime, therefore it is reasonable for xxx_packing() to return void -and simply swallow those errors. Optionally it can dump stack or print the -error description. +during runtime, therefore it is reasonable to wrap packing() into a custom +function which returns void and swallows those errors. Optionally it can +dump stack or print the error description. + +.. code-block:: c + + void my_packing(void *buf, u64 *val, int startbit, int endbit, + size_t len, enum packing_op op) + { + int err; + + /* Adjust quirks accordingly */ + err = packing(buf, val, startbit, endbit, len, op, QUIRK_LSW32_IS_FIRST); + if (likely(!err)) + return; + + if (err == -EINVAL) { + pr_err("Start bit (%d) expected to be larger than end (%d)\n", + startbit, endbit); + } else if (err == -ERANGE) { + if ((startbit - endbit + 1) > 64) + pr_err("Field %d-%d too large for 64 bits!\n", + startbit, endbit); + else + pr_err("Cannot store %llx inside bits %d-%d (would truncate)\n", + *val, startbit, endbit); + } + dump_stack(); + } + +pack() and unpack() +^^^^^^^^^^^^^^^^^^^ + +These are const-correct variants of packing(), and eliminate the last "enum +packing_op op" argument. + +Calling pack(...) is equivalent, and preferred, to calling packing(..., PACK). + +Calling unpack(...) is equivalent, and preferred, to calling packing(..., UNPACK). + +pack_fields() and unpack_fields() +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The library exposes optimized functions for the scenario where there are many +fields represented in a buffer, and it encourages consumer drivers to avoid +repetitive calls to pack() and unpack() for each field, but instead use +pack_fields() and unpack_fields(), which reduces the code footprint. + +These APIs use field definitions in arrays of ``struct packed_field_u8`` or +``struct packed_field_u16``, allowing consumer drivers to minimize the size +of these arrays according to their custom requirements. + +The pack_fields() and unpack_fields() API functions are actually macros which +automatically select the appropriate function at compile time, based on the +type of the fields array passed in. + +An additional benefit over pack() and unpack() is that sanity checks on the +field definitions are handled at compile time with ``BUILD_BUG_ON`` rather +than only when the offending code is executed. These functions return void and +wrapping them to handle unexpected errors is not necessary. + +It is recommended, but not required, that you wrap your packed buffer into a +structured type with a fixed size. This generally makes it easier for the +compiler to enforce that the correct size buffer is used. + +Here is an example of how to use the fields APIs: + +.. code-block:: c + + /* Ordering inside the unpacked structure is flexible and can be different + * from the packed buffer. Here, it is optimized to reduce padding. + */ + struct data { + u64 field3; + u32 field4; + u16 field1; + u8 field2; + }; + + #define SIZE 13 + + typdef struct __packed { u8 buf[SIZE]; } packed_buf_t; + + static const struct packed_field_u8 fields[] = { + PACKED_FIELD(100, 90, struct data, field1), + PACKED_FIELD(90, 87, struct data, field2), + PACKED_FIELD(86, 30, struct data, field3), + PACKED_FIELD(29, 0, struct data, field4), + }; + + void unpack_your_data(const packed_buf_t *buf, struct data *unpacked) + { + BUILD_BUG_ON(sizeof(*buf) != SIZE; + + unpack_fields(buf, sizeof(*buf), unpacked, fields, + QUIRK_LITTLE_ENDIAN); + } + + void pack_your_data(const struct data *unpacked, packed_buf_t *buf) + { + BUILD_BUG_ON(sizeof(*buf) != SIZE; + + pack_fields(buf, sizeof(*buf), unpacked, fields, + QUIRK_LITTLE_ENDIAN); + } diff --git a/Documentation/core-api/parser.rst b/Documentation/core-api/parser.rst new file mode 100644 index 00000000000000..45750d04b89517 --- /dev/null +++ b/Documentation/core-api/parser.rst @@ -0,0 +1,17 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============== +Generic parser +============== + +Overview +======== + +The generic parser is a simple parser for parsing mount options, +filesystem options, driver options, subsystem options, etc. + +Parser API +========== + +.. kernel-doc:: lib/parser.c + :export: diff --git a/Documentation/core-api/symbol-namespaces.rst b/Documentation/core-api/symbol-namespaces.rst index 27a9cccc792c1e..06f766a6aab244 100644 --- a/Documentation/core-api/symbol-namespaces.rst +++ b/Documentation/core-api/symbol-namespaces.rst @@ -41,9 +41,9 @@ entries. In addition to the macros EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL(), that allow exporting of kernel symbols to the kernel symbol table, variants of these are available to export symbols into a certain namespace: EXPORT_SYMBOL_NS() and -EXPORT_SYMBOL_NS_GPL(). They take one additional argument: the namespace. -Please note that due to macro expansion that argument needs to be a -preprocessor symbol. E.g. to export the symbol ``usb_stor_suspend`` into the +EXPORT_SYMBOL_NS_GPL(). They take one additional argument: the namespace as a +string constant. Note that this string must not contain whitespaces. +E.g. to export the symbol ``usb_stor_suspend`` into the namespace ``USB_STORAGE``, use:: EXPORT_SYMBOL_NS(usb_stor_suspend, "USB_STORAGE"); @@ -78,11 +78,10 @@ as this argument has preference over a default symbol namespace. A second option to define the default namespace is directly in the compilation unit as preprocessor statement. The above example would then read:: - #undef DEFAULT_SYMBOL_NAMESPACE #define DEFAULT_SYMBOL_NAMESPACE "USB_COMMON" -within the corresponding compilation unit before any EXPORT_SYMBOL macro is -used. +within the corresponding compilation unit before the #include for +. Typically it's placed before the first #include statement. 3. How to use Symbols exported in Namespaces ============================================ diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst index 77e0ece2b1d6f8..f6a3eef4fe7f0a 100644 --- a/Documentation/core-api/xarray.rst +++ b/Documentation/core-api/xarray.rst @@ -42,8 +42,8 @@ call xa_tag_pointer() to create an entry with a tag, xa_untag_pointer() to turn a tagged entry back into an untagged pointer and xa_pointer_tag() to retrieve the tag of an entry. Tagged pointers use the same bits that are used to distinguish value entries from normal pointers, so you must -decide whether they want to store value entries or tagged pointers in -any particular XArray. +decide whether you want to store value entries or tagged pointers in any +particular XArray. The XArray does not support storing IS_ERR() pointers as some conflict with value entries or internal entries. @@ -52,8 +52,9 @@ An unusual feature of the XArray is the ability to create entries which occupy a range of indices. Once stored to, looking up any index in the range will return the same entry as looking up any other index in the range. Storing to any index will store to all of them. Multi-index -entries can be explicitly split into smaller entries, or storing ``NULL`` -into any entry will cause the XArray to forget about the range. +entries can be explicitly split into smaller entries. Unsetting (using +xa_erase() or xa_store() with ``NULL``) any entry will cause the XArray +to forget about the range. Normal API ========== @@ -63,13 +64,14 @@ for statically allocated XArrays or xa_init() for dynamically allocated ones. A freshly-initialised XArray contains a ``NULL`` pointer at every index. -You can then set entries using xa_store() and get entries -using xa_load(). xa_store will overwrite any entry with the -new entry and return the previous entry stored at that index. You can -use xa_erase() instead of calling xa_store() with a -``NULL`` entry. There is no difference between an entry that has never -been stored to, one that has been erased and one that has most recently -had ``NULL`` stored to it. +You can then set entries using xa_store() and get entries using +xa_load(). xa_store() will overwrite any entry with the new entry and +return the previous entry stored at that index. You can unset entries +using xa_erase() or by setting the entry to ``NULL`` using xa_store(). +There is no difference between an entry that has never been stored to +and one that has been erased with xa_erase(); an entry that has most +recently had ``NULL`` stored to it is also equivalent except if the +XArray was initialized with ``XA_FLAGS_ALLOC``. You can conditionally replace an entry at an index by using xa_cmpxchg(). Like cmpxchg(), it will only succeed if diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst b/Documentation/dev-tools/gdb-kernel-debugging.rst deleted file mode 100644 index 895285c037c727..00000000000000 --- a/Documentation/dev-tools/gdb-kernel-debugging.rst +++ /dev/null @@ -1,179 +0,0 @@ -.. highlight:: none - -Debugging kernel and modules via gdb -==================================== - -The kernel debugger kgdb, hypervisors like QEMU or JTAG-based hardware -interfaces allow to debug the Linux kernel and its modules during runtime -using gdb. Gdb comes with a powerful scripting interface for python. The -kernel provides a collection of helper scripts that can simplify typical -kernel debugging steps. This is a short tutorial about how to enable and use -them. It focuses on QEMU/KVM virtual machines as target, but the examples can -be transferred to the other gdb stubs as well. - - -Requirements ------------- - -- gdb 7.2+ (recommended: 7.4+) with python support enabled (typically true - for distributions) - - -Setup ------ - -- Create a virtual Linux machine for QEMU/KVM (see www.linux-kvm.org and - www.qemu.org for more details). For cross-development, - https://landley.net/aboriginal/bin keeps a pool of machine images and - toolchains that can be helpful to start from. - -- Build the kernel with CONFIG_GDB_SCRIPTS enabled, but leave - CONFIG_DEBUG_INFO_REDUCED off. If your architecture supports - CONFIG_FRAME_POINTER, keep it enabled. - -- Install that kernel on the guest, turn off KASLR if necessary by adding - "nokaslr" to the kernel command line. - Alternatively, QEMU allows to boot the kernel directly using -kernel, - -append, -initrd command line switches. This is generally only useful if - you do not depend on modules. See QEMU documentation for more details on - this mode. In this case, you should build the kernel with - CONFIG_RANDOMIZE_BASE disabled if the architecture supports KASLR. - -- Build the gdb scripts (required on kernels v5.1 and above):: - - make scripts_gdb - -- Enable the gdb stub of QEMU/KVM, either - - - at VM startup time by appending "-s" to the QEMU command line - - or - - - during runtime by issuing "gdbserver" from the QEMU monitor - console - -- cd /path/to/linux-build - -- Start gdb: gdb vmlinux - - Note: Some distros may restrict auto-loading of gdb scripts to known safe - directories. In case gdb reports to refuse loading vmlinux-gdb.py, add:: - - add-auto-load-safe-path /path/to/linux-build - - to ~/.gdbinit. See gdb help for more details. - -- Attach to the booted guest:: - - (gdb) target remote :1234 - - -Examples of using the Linux-provided gdb helpers ------------------------------------------------- - -- Load module (and main kernel) symbols:: - - (gdb) lx-symbols - loading vmlinux - scanning for modules in /home/user/linux/build - loading @0xffffffffa0020000: /home/user/linux/build/net/netfilter/xt_tcpudp.ko - loading @0xffffffffa0016000: /home/user/linux/build/net/netfilter/xt_pkttype.ko - loading @0xffffffffa0002000: /home/user/linux/build/net/netfilter/xt_limit.ko - loading @0xffffffffa00ca000: /home/user/linux/build/net/packet/af_packet.ko - loading @0xffffffffa003c000: /home/user/linux/build/fs/fuse/fuse.ko - ... - loading @0xffffffffa0000000: /home/user/linux/build/drivers/ata/ata_generic.ko - -- Set a breakpoint on some not yet loaded module function, e.g.:: - - (gdb) b btrfs_init_sysfs - Function "btrfs_init_sysfs" not defined. - Make breakpoint pending on future shared library load? (y or [n]) y - Breakpoint 1 (btrfs_init_sysfs) pending. - -- Continue the target:: - - (gdb) c - -- Load the module on the target and watch the symbols being loaded as well as - the breakpoint hit:: - - loading @0xffffffffa0034000: /home/user/linux/build/lib/libcrc32c.ko - loading @0xffffffffa0050000: /home/user/linux/build/lib/lzo/lzo_compress.ko - loading @0xffffffffa006e000: /home/user/linux/build/lib/zlib_deflate/zlib_deflate.ko - loading @0xffffffffa01b1000: /home/user/linux/build/fs/btrfs/btrfs.ko - - Breakpoint 1, btrfs_init_sysfs () at /home/user/linux/fs/btrfs/sysfs.c:36 - 36 btrfs_kset = kset_create_and_add("btrfs", NULL, fs_kobj); - -- Dump the log buffer of the target kernel:: - - (gdb) lx-dmesg - [ 0.000000] Initializing cgroup subsys cpuset - [ 0.000000] Initializing cgroup subsys cpu - [ 0.000000] Linux version 3.8.0-rc4-dbg+ (... - [ 0.000000] Command line: root=/dev/sda2 resume=/dev/sda1 vga=0x314 - [ 0.000000] e820: BIOS-provided physical RAM map: - [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable - [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved - .... - -- Examine fields of the current task struct(supported by x86 and arm64 only):: - - (gdb) p $lx_current().pid - $1 = 4998 - (gdb) p $lx_current().comm - $2 = "modprobe\000\000\000\000\000\000\000" - -- Make use of the per-cpu function for the current or a specified CPU:: - - (gdb) p $lx_per_cpu("runqueues").nr_running - $3 = 1 - (gdb) p $lx_per_cpu("runqueues", 2).nr_running - $4 = 0 - -- Dig into hrtimers using the container_of helper:: - - (gdb) set $next = $lx_per_cpu("hrtimer_bases").clock_base[0].active.next - (gdb) p *$container_of($next, "struct hrtimer", "node") - $5 = { - node = { - node = { - __rb_parent_color = 18446612133355256072, - rb_right = 0x0 , - rb_left = 0x0 - }, - expires = { - tv64 = 1835268000000 - } - }, - _softexpires = { - tv64 = 1835268000000 - }, - function = 0xffffffff81078232 , - base = 0xffff88003fd0d6f0, - state = 1, - start_pid = 0, - start_site = 0xffffffff81055c1f , - start_comm = "swapper/2\000\000\000\000\000\000" - } - - -List of commands and functions ------------------------------- - -The number of commands and convenience functions may evolve over the time, -this is just a snapshot of the initial version:: - - (gdb) apropos lx - function lx_current -- Return current task - function lx_module -- Find module by name and return the module variable - function lx_per_cpu -- Return per-cpu variable - function lx_task_by_pid -- Find Linux task by PID and return the task_struct variable - function lx_thread_info -- Calculate Linux thread_info from task variable - lx-dmesg -- Print Linux kernel log buffer - lx-lsmod -- List currently loaded modules - lx-symbols -- (Re-)load symbols of Linux kernel and currently loaded modules - -Detailed help can be obtained via "help " for commands and "help -function " for convenience functions. diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst index 3c0ac08b270911..65c54b27a60b8d 100644 --- a/Documentation/dev-tools/index.rst +++ b/Documentation/dev-tools/index.rst @@ -10,6 +10,9 @@ whole; patches welcome! A brief overview of testing-specific tools can be found in Documentation/dev-tools/testing-overview.rst +Tools that are specific to debugging can be found in +Documentation/process/debugging/index.rst + .. toctree:: :caption: Table of contents :maxdepth: 2 @@ -27,8 +30,6 @@ Documentation/dev-tools/testing-overview.rst kmemleak kcsan kfence - gdb-kernel-debugging - kgdb kselftest kunit/index ktap diff --git a/Documentation/dev-tools/kgdb.rst b/Documentation/dev-tools/kgdb.rst deleted file mode 100644 index cb626a7a000c41..00000000000000 --- a/Documentation/dev-tools/kgdb.rst +++ /dev/null @@ -1,937 +0,0 @@ -================================================= -Using kgdb, kdb and the kernel debugger internals -================================================= - -:Author: Jason Wessel - -Introduction -============ - -The kernel has two different debugger front ends (kdb and kgdb) which -interface to the debug core. It is possible to use either of the -debugger front ends and dynamically transition between them if you -configure the kernel properly at compile and runtime. - -Kdb is simplistic shell-style interface which you can use on a system -console with a keyboard or serial console. You can use it to inspect -memory, registers, process lists, dmesg, and even set breakpoints to -stop in a certain location. Kdb is not a source level debugger, although -you can set breakpoints and execute some basic kernel run control. Kdb -is mainly aimed at doing some analysis to aid in development or -diagnosing kernel problems. You can access some symbols by name in -kernel built-ins or in kernel modules if the code was built with -``CONFIG_KALLSYMS``. - -Kgdb is intended to be used as a source level debugger for the Linux -kernel. It is used along with gdb to debug a Linux kernel. The -expectation is that gdb can be used to "break in" to the kernel to -inspect memory, variables and look through call stack information -similar to the way an application developer would use gdb to debug an -application. It is possible to place breakpoints in kernel code and -perform some limited execution stepping. - -Two machines are required for using kgdb. One of these machines is a -development machine and the other is the target machine. The kernel to -be debugged runs on the target machine. The development machine runs an -instance of gdb against the vmlinux file which contains the symbols (not -a boot image such as bzImage, zImage, uImage...). In gdb the developer -specifies the connection parameters and connects to kgdb. The type of -connection a developer makes with gdb depends on the availability of -kgdb I/O modules compiled as built-ins or loadable kernel modules in the -test machine's kernel. - -Compiling a kernel -================== - -- In order to enable compilation of kdb, you must first enable kgdb. - -- The kgdb test compile options are described in the kgdb test suite - chapter. - -Kernel config options for kgdb ------------------------------- - -To enable ``CONFIG_KGDB`` you should look under -:menuselection:`Kernel hacking --> Kernel debugging` and select -:menuselection:`KGDB: kernel debugger`. - -While it is not a hard requirement that you have symbols in your vmlinux -file, gdb tends not to be very useful without the symbolic data, so you -will want to turn on ``CONFIG_DEBUG_INFO`` which is called -:menuselection:`Compile the kernel with debug info` in the config menu. - -It is advised, but not required, that you turn on the -``CONFIG_FRAME_POINTER`` kernel option which is called :menuselection:`Compile -the kernel with frame pointers` in the config menu. This option inserts code -into the compiled executable which saves the frame information in registers -or on the stack at different points which allows a debugger such as gdb to -more accurately construct stack back traces while debugging the kernel. - -If the architecture that you are using supports the kernel option -``CONFIG_STRICT_KERNEL_RWX``, you should consider turning it off. This -option will prevent the use of software breakpoints because it marks -certain regions of the kernel's memory space as read-only. If kgdb -supports it for the architecture you are using, you can use hardware -breakpoints if you desire to run with the ``CONFIG_STRICT_KERNEL_RWX`` -option turned on, else you need to turn off this option. - -Next you should choose one or more I/O drivers to interconnect the debugging -host and debugged target. Early boot debugging requires a KGDB I/O -driver that supports early debugging and the driver must be built into -the kernel directly. Kgdb I/O driver configuration takes place via -kernel or module parameters which you can learn more about in the -section that describes the parameter kgdboc. - -Here is an example set of ``.config`` symbols to enable or disable for kgdb:: - - # CONFIG_STRICT_KERNEL_RWX is not set - CONFIG_FRAME_POINTER=y - CONFIG_KGDB=y - CONFIG_KGDB_SERIAL_CONSOLE=y - -Kernel config options for kdb ------------------------------ - -Kdb is quite a bit more complex than the simple gdbstub sitting on top -of the kernel's debug core. Kdb must implement a shell, and also adds -some helper functions in other parts of the kernel, responsible for -printing out interesting data such as what you would see if you ran -``lsmod``, or ``ps``. In order to build kdb into the kernel you follow the -same steps as you would for kgdb. - -The main config option for kdb is ``CONFIG_KGDB_KDB`` which is called -:menuselection:`KGDB_KDB: include kdb frontend for kgdb` in the config menu. -In theory you would have already also selected an I/O driver such as the -``CONFIG_KGDB_SERIAL_CONSOLE`` interface if you plan on using kdb on a -serial port, when you were configuring kgdb. - -If you want to use a PS/2-style keyboard with kdb, you would select -``CONFIG_KDB_KEYBOARD`` which is called :menuselection:`KGDB_KDB: keyboard as -input device` in the config menu. The ``CONFIG_KDB_KEYBOARD`` option is not -used for anything in the gdb interface to kgdb. The ``CONFIG_KDB_KEYBOARD`` -option only works with kdb. - -Here is an example set of ``.config`` symbols to enable/disable kdb:: - - # CONFIG_STRICT_KERNEL_RWX is not set - CONFIG_FRAME_POINTER=y - CONFIG_KGDB=y - CONFIG_KGDB_SERIAL_CONSOLE=y - CONFIG_KGDB_KDB=y - CONFIG_KDB_KEYBOARD=y - -Kernel Debugger Boot Arguments -============================== - -This section describes the various runtime kernel parameters that affect -the configuration of the kernel debugger. The following chapter covers -using kdb and kgdb as well as providing some examples of the -configuration parameters. - -Kernel parameter: kgdboc ------------------------- - -The kgdboc driver was originally an abbreviation meant to stand for -"kgdb over console". Today it is the primary mechanism to configure how -to communicate from gdb to kgdb as well as the devices you want to use -to interact with the kdb shell. - -For kgdb/gdb, kgdboc is designed to work with a single serial port. It -is intended to cover the circumstance where you want to use a serial -console as your primary console as well as using it to perform kernel -debugging. It is also possible to use kgdb on a serial port which is not -designated as a system console. Kgdboc may be configured as a kernel -built-in or a kernel loadable module. You can only make use of -``kgdbwait`` and early debugging if you build kgdboc into the kernel as -a built-in. - -Optionally you can elect to activate kms (Kernel Mode Setting) -integration. When you use kms with kgdboc and you have a video driver -that has atomic mode setting hooks, it is possible to enter the debugger -on the graphics console. When the kernel execution is resumed, the -previous graphics mode will be restored. This integration can serve as a -useful tool to aid in diagnosing crashes or doing analysis of memory -with kdb while allowing the full graphics console applications to run. - -kgdboc arguments -~~~~~~~~~~~~~~~~ - -Usage:: - - kgdboc=[kms][[,]kbd][[,]serial_device][,baud] - -The order listed above must be observed if you use any of the optional -configurations together. - -Abbreviations: - -- kms = Kernel Mode Setting - -- kbd = Keyboard - -You can configure kgdboc to use the keyboard, and/or a serial device -depending on if you are using kdb and/or kgdb, in one of the following -scenarios. The order listed above must be observed if you use any of the -optional configurations together. Using kms + only gdb is generally not -a useful combination. - -Using loadable module or built-in -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -1. As a kernel built-in: - - Use the kernel boot argument:: - - kgdboc=,[baud] - -2. As a kernel loadable module: - - Use the command:: - - modprobe kgdboc kgdboc=,[baud] - - Here are two examples of how you might format the kgdboc string. The - first is for an x86 target using the first serial port. The second - example is for the ARM Versatile AB using the second serial port. - - 1. ``kgdboc=ttyS0,115200`` - - 2. ``kgdboc=ttyAMA1,115200`` - -Configure kgdboc at runtime with sysfs -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -At run time you can enable or disable kgdboc by writing parameters -into sysfs. Here are two examples: - -1. Enable kgdboc on ttyS0:: - - echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc - -2. Disable kgdboc:: - - echo "" > /sys/module/kgdboc/parameters/kgdboc - -.. note:: - - You do not need to specify the baud if you are configuring the - console on tty which is already configured or open. - -More examples -^^^^^^^^^^^^^ - -You can configure kgdboc to use the keyboard, and/or a serial device -depending on if you are using kdb and/or kgdb, in one of the following -scenarios. - -1. kdb and kgdb over only a serial port:: - - kgdboc=[,baud] - - Example:: - - kgdboc=ttyS0,115200 - -2. kdb and kgdb with keyboard and a serial port:: - - kgdboc=kbd,[,baud] - - Example:: - - kgdboc=kbd,ttyS0,115200 - -3. kdb with a keyboard:: - - kgdboc=kbd - -4. kdb with kernel mode setting:: - - kgdboc=kms,kbd - -5. kdb with kernel mode setting and kgdb over a serial port:: - - kgdboc=kms,kbd,ttyS0,115200 - -.. note:: - - Kgdboc does not support interrupting the target via the gdb remote - protocol. You must manually send a :kbd:`SysRq-G` unless you have a proxy - that splits console output to a terminal program. A console proxy has a - separate TCP port for the debugger and a separate TCP port for the - "human" console. The proxy can take care of sending the :kbd:`SysRq-G` - for you. - -When using kgdboc with no debugger proxy, you can end up connecting the -debugger at one of two entry points. If an exception occurs after you -have loaded kgdboc, a message should print on the console stating it is -waiting for the debugger. In this case you disconnect your terminal -program and then connect the debugger in its place. If you want to -interrupt the target system and forcibly enter a debug session you have -to issue a :kbd:`Sysrq` sequence and then type the letter :kbd:`g`. Then you -disconnect the terminal session and connect gdb. Your options if you -don't like this are to hack gdb to send the :kbd:`SysRq-G` for you as well as -on the initial connect, or to use a debugger proxy that allows an -unmodified gdb to do the debugging. - -Kernel parameter: ``kgdboc_earlycon`` -------------------------------------- - -If you specify the kernel parameter ``kgdboc_earlycon`` and your serial -driver registers a boot console that supports polling (doesn't need -interrupts and implements a nonblocking read() function) kgdb will attempt -to work using the boot console until it can transition to the regular -tty driver specified by the ``kgdboc`` parameter. - -Normally there is only one boot console (especially that implements the -read() function) so just adding ``kgdboc_earlycon`` on its own is -sufficient to make this work. If you have more than one boot console you -can add the boot console's name to differentiate. Note that names that -are registered through the boot console layer and the tty layer are not -the same for the same port. - -For instance, on one board to be explicit you might do:: - - kgdboc_earlycon=qcom_geni kgdboc=ttyMSM0 - -If the only boot console on the device was "qcom_geni", you could simplify:: - - kgdboc_earlycon kgdboc=ttyMSM0 - -Kernel parameter: ``kgdbwait`` ------------------------------- - -The Kernel command line option ``kgdbwait`` makes kgdb wait for a -debugger connection during booting of a kernel. You can only use this -option if you compiled a kgdb I/O driver into the kernel and you -specified the I/O driver configuration as a kernel command line option. -The kgdbwait parameter should always follow the configuration parameter -for the kgdb I/O driver in the kernel command line else the I/O driver -will not be configured prior to asking the kernel to use it to wait. - -The kernel will stop and wait as early as the I/O driver and -architecture allows when you use this option. If you build the kgdb I/O -driver as a loadable kernel module kgdbwait will not do anything. - -Kernel parameter: ``kgdbcon`` ------------------------------ - -The ``kgdbcon`` feature allows you to see printk() messages inside gdb -while gdb is connected to the kernel. Kdb does not make use of the kgdbcon -feature. - -Kgdb supports using the gdb serial protocol to send console messages to -the debugger when the debugger is connected and running. There are two -ways to activate this feature. - -1. Activate with the kernel command line option:: - - kgdbcon - -2. Use sysfs before configuring an I/O driver:: - - echo 1 > /sys/module/debug_core/parameters/kgdb_use_con - -.. note:: - - If you do this after you configure the kgdb I/O driver, the - setting will not take effect until the next point the I/O is - reconfigured. - -.. important:: - - You cannot use kgdboc + kgdbcon on a tty that is an - active system console. An example of incorrect usage is:: - - console=ttyS0,115200 kgdboc=ttyS0 kgdbcon - -It is possible to use this option with kgdboc on a tty that is not a -system console. - -Run time parameter: ``kgdbreboot`` ----------------------------------- - -The kgdbreboot feature allows you to change how the debugger deals with -the reboot notification. You have 3 choices for the behavior. The -default behavior is always set to 0. - -.. tabularcolumns:: |p{0.4cm}|p{11.5cm}|p{5.6cm}| - -.. flat-table:: - :widths: 1 10 8 - - * - 1 - - ``echo -1 > /sys/module/debug_core/parameters/kgdbreboot`` - - Ignore the reboot notification entirely. - - * - 2 - - ``echo 0 > /sys/module/debug_core/parameters/kgdbreboot`` - - Send the detach message to any attached debugger client. - - * - 3 - - ``echo 1 > /sys/module/debug_core/parameters/kgdbreboot`` - - Enter the debugger on reboot notify. - -Kernel parameter: ``nokaslr`` ------------------------------ - -If the architecture that you are using enables KASLR by default, -you should consider turning it off. KASLR randomizes the -virtual address where the kernel image is mapped and confuses -gdb which resolves addresses of kernel symbols from the symbol table -of vmlinux. - -Using kdb -========= - -Quick start for kdb on a serial port ------------------------------------- - -This is a quick example of how to use kdb. - -1. Configure kgdboc at boot using kernel parameters:: - - console=ttyS0,115200 kgdboc=ttyS0,115200 nokaslr - - OR - - Configure kgdboc after the kernel has booted; assuming you are using - a serial port console:: - - echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc - -2. Enter the kernel debugger manually or by waiting for an oops or - fault. There are several ways you can enter the kernel debugger - manually; all involve using the :kbd:`SysRq-G`, which means you must have - enabled ``CONFIG_MAGIC_SYSRQ=y`` in your kernel config. - - - When logged in as root or with a super user session you can run:: - - echo g > /proc/sysrq-trigger - - - Example using minicom 2.2 - - Press: :kbd:`CTRL-A` :kbd:`f` :kbd:`g` - - - When you have telneted to a terminal server that supports sending - a remote break - - Press: :kbd:`CTRL-]` - - Type in: ``send break`` - - Press: :kbd:`Enter` :kbd:`g` - -3. From the kdb prompt you can run the ``help`` command to see a complete - list of the commands that are available. - - Some useful commands in kdb include: - - =========== ================================================================= - ``lsmod`` Shows where kernel modules are loaded - ``ps`` Displays only the active processes - ``ps A`` Shows all the processes - ``summary`` Shows kernel version info and memory usage - ``bt`` Get a backtrace of the current process using dump_stack() - ``dmesg`` View the kernel syslog buffer - ``go`` Continue the system - =========== ================================================================= - -4. When you are done using kdb you need to consider rebooting the system - or using the ``go`` command to resuming normal kernel execution. If you - have paused the kernel for a lengthy period of time, applications - that rely on timely networking or anything to do with real wall clock - time could be adversely affected, so you should take this into - consideration when using the kernel debugger. - -Quick start for kdb using a keyboard connected console ------------------------------------------------------- - -This is a quick example of how to use kdb with a keyboard. - -1. Configure kgdboc at boot using kernel parameters:: - - kgdboc=kbd - - OR - - Configure kgdboc after the kernel has booted:: - - echo kbd > /sys/module/kgdboc/parameters/kgdboc - -2. Enter the kernel debugger manually or by waiting for an oops or - fault. There are several ways you can enter the kernel debugger - manually; all involve using the :kbd:`SysRq-G`, which means you must have - enabled ``CONFIG_MAGIC_SYSRQ=y`` in your kernel config. - - - When logged in as root or with a super user session you can run:: - - echo g > /proc/sysrq-trigger - - - Example using a laptop keyboard: - - Press and hold down: :kbd:`Alt` - - Press and hold down: :kbd:`Fn` - - Press and release the key with the label: :kbd:`SysRq` - - Release: :kbd:`Fn` - - Press and release: :kbd:`g` - - Release: :kbd:`Alt` - - - Example using a PS/2 101-key keyboard - - Press and hold down: :kbd:`Alt` - - Press and release the key with the label: :kbd:`SysRq` - - Press and release: :kbd:`g` - - Release: :kbd:`Alt` - -3. Now type in a kdb command such as ``help``, ``dmesg``, ``bt`` or ``go`` to - continue kernel execution. - -Using kgdb / gdb -================ - -In order to use kgdb you must activate it by passing configuration -information to one of the kgdb I/O drivers. If you do not pass any -configuration information kgdb will not do anything at all. Kgdb will -only actively hook up to the kernel trap hooks if a kgdb I/O driver is -loaded and configured. If you unconfigure a kgdb I/O driver, kgdb will -unregister all the kernel hook points. - -All kgdb I/O drivers can be reconfigured at run time, if -``CONFIG_SYSFS`` and ``CONFIG_MODULES`` are enabled, by echo'ing a new -config string to ``/sys/module//parameter/