ACPI: processor_idle: Skip dummy wait for processors based on the Zen microarchitecture

See: https://lkml.org/lkml/2022/9/21/74 See: https://www.phoronix.com/news/Linux-AMD-Old-Chipset-WA Signed-off-by: Mike Pagano <mpagano@gentoo.org>
author: Mike Pagano <mpagano@gentoo.org> 2022-09-27 08:01:09 -0400
committer: Mike Pagano <mpagano@gentoo.org> 2022-09-27 08:01:09 -0400
commit: 50ea097c7e3b862632af3b0c99cb2467857c7fa7 (patch)
tree: 31d53b565e27927f5d39ce8f0a5bf58481985846
parent: Remove duplicate patch (diff)
download: linux-patches-50ea097c7e3b862632af3b0c99cb2467857c7fa7.tar.gz
linux-patches-50ea097c7e3b862632af3b0c99cb2467857c7fa7.tar.bz2
linux-patches-50ea097c7e3b862632af3b0c99cb2467857c7fa7.zip
2 files changed, 146 insertions, 0 deletions
diff --git a/0000_README b/0000_README
index c5e283b0..034f4583 100644
--- a/0000_README
+++ b/0000_README
@@ -134,3 +134,7 @@ Desc:   BMQ(BitMap Queue) Scheduler. A new CPU scheduler developed from PDS(incl
 Patch:  5021_BMQ-and-PDS-gentoo-defaults.patch
 From:   https://gitweb.gentoo.org/proj/linux-patches.git/
 Desc:   Set defaults for BMQ. Add archs as people test, default to N
+
+Patch:  5030_ACPI-Skip-dummy-wait-for-Zen-processors.patch
+From:   https://lkml.org/lkml/2022/9/21/74
+Desc:   ACPI: processor_idle: Skip dummy wait for processors based on the Zen microarchitecture
diff --git a/5030_ACPI-Skip-dummy-wait-for-Zen-processors.patch b/5030_ACPI-Skip-dummy-wait-for-Zen-processors.patch
new file mode 100644
index 00000000..e0f6ebe8
--- /dev/null
+++ b/5030_ACPI-Skip-dummy-wait-for-Zen-processors.patch
@@ -0,0 +1,142 @@
+linux-kernel.vger.kernel.org archive mirror
+ help / color / mirror / Atom feed
+From: K Prateek Nayak <kprateek.nayak@amd.com>
+To: <linux-kernel@vger.kernel.org>
+Cc: <rafael@kernel.org>, <lenb@kernel.org>,
+	<linux-acpi@vger.kernel.org>, <linux-pm@vger.kernel.org>,
+	<dave.hansen@linux.intel.com>, <bp@alien8.de>,
+	<tglx@linutronix.de>, <andi@lisas.de>, <puwen@hygon.cn>,
+	<mario.limonciello@amd.com>, <peterz@infradead.org>,
+	<rui.zhang@intel.com>, <gpiccoli@igalia.com>,
+	<daniel.lezcano@linaro.org>, <ananth.narayan@amd.com>,
+	<gautham.shenoy@amd.com>,
+	K Prateek Nayak <kprateek.nayak@amd.com>,
+	"Calvin Ong" <calvin.ong@amd.com>, <stable@vger.kernel.org>,
+	<regressions@lists.linux.dev>
+Subject: [PATCH] ACPI: processor_idle: Skip dummy wait for processors based on the Zen microarchitecture
+Date: Wed, 21 Sep 2022 12:06:38 +0530	[thread overview]
+Message-ID: <20220921063638.2489-1-kprateek.nayak@amd.com> (raw)
+
+Processors based on the Zen microarchitecture support IOPORT based deeper
+C-states. The idle driver reads the acpi_gbl_FADT.xpm_timer_block.address
+in the IOPORT based C-state exit path which is claimed to be a
+"Dummy wait op" and has been around since ACPI introduction to Linux
+dating back to Andy Grover's Mar 14, 2002 posting [1].
+The comment above the dummy operation was elaborated by Andreas Mohr back
+in 2006 in commit b488f02156d3d ("ACPI: restore comment justifying 'extra'
+P_LVLx access") [2] where the commit log claims:
+"this dummy read was about: STPCLK# doesn't get asserted in time on
+(some) chipsets, which is why we need to have a dummy I/O read to delaarchitecture-relatedy
+further instruction processing until the CPU is fully stopped."
+
+However, sampling certain workloads with IBS on AMD Zen3 system shows
+that a significant amount of time is spent in the dummy op, which
+incorrectly gets accounted as C-State residency. A large C-State
+residency value can prime the cpuidle governor to recommend a deeper
+C-State during the subsequent idle instances, starting a vicious cycle,
+leading to performance degradation on workloads that rapidly switch
+between busy and idle phases.
+
+One such workload is tbench where a massive performance degradation can
+be observed during certain runs. Following are some statistics gathered
+by running tbench with 128 clients, on a dual socket (2 x 64C/128T) Zen3
+system with the baseline kernel, baseline kernel keeping C2 disabled,
+and baseline kernel with this patch applied keeping C2 enabled:
+
+baseline kernel was tip:sched/core at
+commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
+wakelist if wakee cpu is idle")
+
+Kernel        : baseline      baseline + C2 disabled   baseline + patch
+
+Min (MB/s)    : 2215.06       33072.10 (+1393.05%)     33016.10 (+1390.52%)
+Max (MB/s)    : 32938.80      34399.10                 34774.50
+Median (MB/s) : 32191.80      33476.60                 33805.70
+AMean (MB/s)  : 22448.55      33649.27 (+49.89%)       33865.43 (+50.85%)
+AMean Stddev  : 17526.70      680.14                   880.72
+AMean CoefVar : 78.07%        2.02%                    2.60%
+
+The data shows there are edge cases that can cause massive regressions
+in case of tbench. Profiling the bad runs with IBS shows a significant
+amount of time being spent in acpi_idle_do_entry method:
+
+Overhead  Command          Shared Object             Symbolarchitecture-related
+  74.76%  swapper          [kernel.kallsyms]         [k] acpi_idle_do_entry
+   0.71%  tbench           [kernel.kallsyms]         [k] update_sd_lb_stats.constprop.0
+   0.69%  tbench_srv       [kernel.kallsyms]         [k] update_sd_lb_stats.constprop.0
+   0.49%  swapper          [kernel.kallsyms]         [k] psi_group_change
+   ...
+
+Annotation of acpi_idle_do_entry method reveals almost all the time in
+acpi_idle_do_entry is spent on the port I/O in wait_for_freeze():
+
+  0.14 │      in     (%dx),%al       # <------ First "in" corresponding to inb(cx->address)
+  0.51 │      mov    0x144d64d(%rip),%rax
+  0.00 │      test   $0x80000000,%eax
+       │    ↓ jne    62 	     # <------ Skip if running in guest
+  0.00 │      mov    0x19800c3(%rip),%rdx
+ 99.33 │      in     (%dx),%eax      # <------ Second "in" corresponding to inl(acpi_gbl_FADT.xpm_timer_block.address)
+  0.00 │62:   mov    -0x8(%rbp),%r12
+  0.00 │      leavearchitecture-related
+  0.00 │    ← ret
+
+This overhead is reflected in the C2 residency on the test system where
+C2 is an IOPORT based C-State. The total C-state residency reported by
+"cpupower idle-info" on CPU0 for good and bad case over the 80s tbench
+run is as follows (all numbers are in microseconds):
+
+			    Good Run 		Bad Run
+			   (Baseline)
+
+POLL: 			       43338		   6231  (-85.62%)
+C1 (MWAIT Based): 	    23576156 		 363861  (-98.45%)
+C2 (IOPORT Based): 	    10781218 	       77027280  (+614.45%)
+
+The larger residency value in bad case leads to the system recommending
+C2 state again for subsequent idle instances. The pattern lasts till the
+end of the tbench run. Following is the breakdown of "entry_method"
+passed to acpi_idle_do_entry during good run and bad run:
+
+                                        			Good Run    Bad Run
+							       (Baseline)
+
+Number of times acpi_idle_do_entry was called:             	6149573     6149050  (-0.01%)
+ |-> Number of times entry_method was "ACPI_CSTATE_FFH":        6141494       88144  (-98.56%)
+ |-> Number of times entry_method was "ACPI_CSTATE_HALT":             0           0  (+0.00%)
+ |-> Number of times entry_method was "ACPI_CSTATE_SYSTEMIO":      8079     6060906  (+74920.49%)
+
+For processors based on the Zen microarchitecture, this dummy wait op is
+unnecessary and can be skipped when choosing IOPORT based C-States to
+avoid polluting the C-state residency information.
+
+Link: https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/commit/?id=972c16130d9dc182cedcdd408408d9eacc7d6a2d [1]
+Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b488f02156d3deb08f5ad7816d565c370a8cc6f1 [2]
+
+Suggested-by: Calvin Ong <calvin.ong@amd.com>
+Cc: stable@vger.kernel.org
+Cc: regressions@lists.linux.dev
+Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
+---
+ drivers/acpi/processor_idle.c | 7 +++++--
+ 1 file changed, 5 insertions(+), 2 deletions(-)
+
+diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
+index 16a1663d02d4..18850aa2b79b 100644
+--- a/drivers/acpi/processor_idle.c
++++ b/drivers/acpi/processor_idle.c
+@@ -528,8 +528,11 @@ static int acpi_idle_bm_check(void)
+ static void wait_for_freeze(void)
+ {
+ #ifdef	CONFIG_X86
+-	/* No delay is needed if we are in guest */
+-	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
++	/*
++	 * No delay is needed if we are in guest or on a processor
++	 * based on the Zen microarchitecture.
++	 */
++	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || boot_cpu_has(X86_FEATURE_ZEN))
+ 		return;
+ #endif
+ 	/* Dummy wait op - must do something useless after P_LVL2 read
+-- 
+2.25.1
author	Mike Pagano <mpagano@gentoo.org>	2022-09-27 08:01:09 -0400
committer	Mike Pagano <mpagano@gentoo.org>	2022-09-27 08:01:09 -0400
commit	50ea097c7e3b862632af3b0c99cb2467857c7fa7 (patch)
tree	31d53b565e27927f5d39ce8f0a5bf58481985846
parent	Remove duplicate patch (diff)
download	linux-patches-50ea097c7e3b862632af3b0c99cb2467857c7fa7.tar.gz linux-patches-50ea097c7e3b862632af3b0c99cb2467857c7fa7.tar.bz2 linux-patches-50ea097c7e3b862632af3b0c99cb2467857c7fa7.zip