summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-04-26xen/arm: p2m don't fall over on FEAT_LPA enabled hwfix/arm-feat-lpaAlex Bennée
When we introduced FEAT_LPA to QEMU's -cpu max we discovered older kernels had a bug where the physical address was copied directly from ID_AA64MMFR0_EL1.PARange field. The early cpu_init code of Xen commits the same error by blindly copying across the max supported range. Unsurprisingly when the page tables aren't set up for these greater ranges hilarity ensues and the hypervisor crashes fairly early on in the boot-up sequence. This happens when we write to the control register in enable_mmu(). Attempt to fix this the same way as the Linux kernel does by gating PARange to the maximum the hypervisor can handle. I also had to fix up code in p2m which panics when it sees an "invalid" entry in PARange. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Julien Grall <julien@xen.org> Cc: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> Cc: Bertrand Marquis <bertrand.marquis@arm.com>
2022-02-16x86emul: fix VPBLENDMW with mask and memory operandJan Beulich
Element size for this opcode depends on EVEX.W, not the low opcode bit. Make use of AVX512BW being a prereq to AVX512_BITALG and move the case label there, adding an AVX512BW feature check. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> master commit: eddf13b5e9401f6871dcce1ce61c80cff62079ed master date: 2022-02-14 10:08:38 +0100
2022-02-16tools/libs: Fix build dependenciesAnthony PERARD
Some libs' Makefile aren't loading the dependencies files *.d2. We can load them from "libs.mk" as none of the Makefile here are changing $(DEPS) or $(DEPS_INCLUDE) so it is fine to move the "include" to "libs.mk". As a little improvement, don't load the dependencies files (and thus avoid regenerating the *.d2 files) during `make clean`. Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> master commit: e62cc29f9b6c42b67182a1362e2ea18bad75b5ff master date: 2022-02-08 11:15:53 +0000
2022-02-16build: fix exported variable name CFLAGS_stack_boundaryAnthony PERARD
Exporting a variable with a dash doesn't work reliably, they may be striped from the environment when calling a sub-make or sub-shell. CFLAGS-stack-boundary start to be removed from env in patch "build: set ALL_OBJS in main Makefile; move prelink.o to main Makefile" when running `make "ALL_OBJS=.."` due to the addition of the quote. At least in my empirical tests. Fixes: 2740d96efd ("xen/build: have the root Makefile generates the CFLAGS") Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
2022-02-16libxl: force netback to wait for hotplug execution before connectingRoger Pau Monné
By writing an empty "hotplug-status" xenstore node in the backend path libxl can force Linux netback to wait for hotplug script execution before proceeding to the 'connected' state. This is required so that netback doesn't skip state 2 (InitWait) and thus blocks libxl waiting for such state in order to launch the hotplug script (see libxl__wait_device_connection). Reported-by: James Dingwall <james-xen@dingwall.me.uk> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: James Dingwall <james-xen@dingwall.me.uk> Reviewed-by: Paul Durrant <paul@xen.org> Tested-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> master commit: 0bdc43c8dec993258e930b34855853c22b917519 master date: 2022-01-27 13:51:19 +0100
2022-02-16tools/libs/light: don't touch nr_vcpus_out if listing vcpus and returning NULLDario Faggioli
If we are in libxl_list_vcpu() and we are returning NULL, let's avoid touching the output parameter *nr_vcpus_out, which the caller should have initialized to 0. The current behavior could be problematic if are creating a domain and, in the meantime, an existing one is destroyed when we have already done some steps of the loop. At which point, we'd return a NULL list of vcpus but with something different than 0 as the number of vcpus in that list. And this can cause troubles in the callers (e.g., nr_vcpus_on_nodes()), when they do a libxl_vcpuinfo_list_free(). Crashes due to this are rare and difficult to reproduce, but have been observed, with stack traces looking like this one: #0 libxl_bitmap_dispose (map=map@entry=0x50) at libxl_utils.c:626 #1 0x00007fe72c993a32 in libxl_vcpuinfo_dispose (p=p@entry=0x38) at _libxl_types.c:692 #2 0x00007fe72c94e3c4 in libxl_vcpuinfo_list_free (list=0x0, nr=<optimized out>) at libxl_utils.c:1059 #3 0x00007fe72c9528bf in nr_vcpus_on_nodes (vcpus_on_node=0x7fe71000eb60, suitable_cpumap=0x7fe721df0d38, tinfo_elements=48, tinfo=0x7fe7101b3900, gc=0x7fe7101bbfa0) at libxl_numa.c:258 #4 libxl__get_numa_candidate (gc=gc@entry=0x7fe7100033a0, min_free_memkb=4233216, min_cpus=4, min_nodes=min_nodes@entry=0, max_nodes=max_nodes@entry=0, suitable_cpumap=suitable_cpumap@entry=0x7fe721df0d38, numa_cmpf=0x7fe72c940110 <numa_cmpf>, cndt_out=0x7fe721df0cf0, cndt_found=0x7fe721df0cb4) at libxl_numa.c:394 #5 0x00007fe72c94152b in numa_place_domain (d_config=0x7fe721df11b0, domid=975, gc=0x7fe7100033a0) at libxl_dom.c:209 #6 libxl__build_pre (gc=gc@entry=0x7fe7100033a0, domid=domid@entry=975, d_config=d_config@entry=0x7fe721df11b0, state=state@entry=0x7fe710077700) at libxl_dom.c:436 #7 0x00007fe72c92c4a5 in libxl__domain_build (gc=0x7fe7100033a0, d_config=d_config@entry=0x7fe721df11b0, domid=975, state=0x7fe710077700) at libxl_create.c:444 #8 0x00007fe72c92de8b in domcreate_bootloader_done (egc=0x7fe721df0f60, bl=0x7fe7100778c0, rc=<optimized out>) at libxl_create.c:1222 #9 0x00007fe72c980425 in libxl__bootloader_run (egc=egc@entry=0x7fe721df0f60, bl=bl@entry=0x7fe7100778c0) at libxl_bootloader.c:403 #10 0x00007fe72c92f281 in initiate_domain_create (egc=egc@entry=0x7fe721df0f60, dcs=dcs@entry=0x7fe7100771b0) at libxl_create.c:1159 #11 0x00007fe72c92f456 in do_domain_create (ctx=ctx@entry=0x7fe71001c840, d_config=d_config@entry=0x7fe721df11b0, domid=domid@entry=0x7fe721df10a8, restore_fd=restore_fd@entry=-1, send_back_fd=send_back_fd@entry=-1, params=params@entry=0x0, ao_how=0x0, aop_console_how=0x7fe721df10f0) at libxl_create.c:1856 #12 0x00007fe72c92f776 in libxl_domain_create_new (ctx=0x7fe71001c840, d_config=d_config@entry=0x7fe721df11b0, domid=domid@entry=0x7fe721df10a8, ao_how=ao_how@entry=0x0, aop_console_how=aop_console_how@entry=0x7fe721df10f0) at libxl_create.c:2075 Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Tested-by: James Fehlig <jfehlig@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> master commit: d9d3496e817ace919092d70d4730257b37c2e743 master date: 2022-01-31 10:58:07 +0100
2022-02-08x86/spec-ctrl: Support Intel PSFD for guestsAndrew Cooper
The Feb 2022 microcode from Intel retrofits AMD's MSR_SPEC_CTRL.PSFD interface to Sunny Cove (IceLake) and later cores. Update the MSR_SPEC_CTRL emulation, and expose it to guests. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 52ce1c97844db213de01c5300eaaa8cf101a285f)
2022-02-08x86/cpuid: Infrastructure for cpuid word 7:2.edxAndrew Cooper
While in principle it would be nice to keep leaf 7 in order, that would involve having an extra 5 words of zeros in a featureset. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit f3709b15fc86c6c6a0959cec8d97f21d0e9f9629)
2022-02-08tests/tsx: Extend test-tsx to check MSR_MCU_OPT_CTRLAndrew Cooper
This MSR needs to be identical across the system for TSX to have identical behaviour everywhere. Furthermore, its CPUID bit (SRBDS_CTRL) shouldn't be visible to guests. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 4b45c4faa8c0637eb41cb4b143ccd4e9548c4908)
2022-02-08x86/tsx: Cope with TSX deprecation on WHL-R/CFL-RAndrew Cooper
The February 2022 microcode is formally de-featuring TSX on the TAA-impacted client CPUs. The backup TAA mitigation (VERW regaining its flushing side effect) is being dropped, meaning that `smt=0 spec-ctrl=md-clear` no longer protects against TAA on these parts. The new functionality enumerates itself via the RTM_ALWAYS_ABORT CPUID bit (the same as June 2021), but has its control in MSR_MCU_OPT_CTRL as opposed to MSR_TSX_FORCE_ABORT. TSX now defaults to being disabled on ucode load. Furthermore, if SGX is enabled in the BIOS, TSX is locked and cannot be re-enabled. In this case, override opt_tsx to 0, so the RTM/HLE CPUID bits get hidden by default. While updating the command line documentation, take the opportunity to add a paragraph explaining what TSX being disabled actually means, and how migration compatibility works. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit ad9f7c3b2e0df38ad6d54f4769d4dccf765fbcee)
2022-02-08x86/tsx: Move has_rtm_always_abort to an outer scopeAndrew Cooper
We are about to introduce a second path which needs to conditionally force the presence of RTM_ALWAYS_ABORT. No functional change. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 4116139131e93b4f075e5442e3c1b424280f6f1f)
2022-02-08x86/spec-ctrl: Clean up MSR_MCU_OPT_CTRL handlingAndrew Cooper
Introduce cpu_has_srbds_ctrl as more users are going to appear shortly. MSR_MCU_OPT_CTRL is gaining extra functionality, meaning that the current default_xen_mcu_opt_ctrl is no longer a good fit. Introduce two new helpers, update_mcu_opt_ctrl() which does a full RMW cycle on the MSR, and set_in_mcu_opt_ctrl() which lets callers configure specific bits at a time without clobbering each others settings. No functional change. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 39a40f3835efcc25c1b05a25c321a01d7e11cbd7)
2022-02-04x86/cpuid: Infrastructure for leaf 7:1.ebxJan Beulich
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> (cherry picked from commit e1828e3032ebfe036023cd733adfd2d4ec856688)
2022-02-04x86/cpuid: Disentangle logic for new feature leavesAndrew Cooper
Adding a new feature leaf is a reasonable amount of boilerplate and for the patch to build, at least one feature from the new leaf needs defining. This typically causes two non-trivial changes to be merged together. First, have gen-cpuid.py write out some extra placeholder defines: #define CPUID_BITFIELD_11 bool :1, :1, lfence_dispatch:1, ... #define CPUID_BITFIELD_12 uint32_t :32 /* placeholder */ #define CPUID_BITFIELD_13 uint32_t :32 /* placeholder */ #define CPUID_BITFIELD_14 uint32_t :32 /* placeholder */ #define CPUID_BITFIELD_15 uint32_t :32 /* placeholder */ This allows DECL_BITFIELD() to be added to struct cpuid_policy without requiring a XEN_CPUFEATURE() declared for the leaf. The choice of 4 is arbitrary, and allows us to add more than one leaf at a time if necessary. Second, rework generic_identify() to not use specific feature names. The choice of deriving the index from a feature was to avoid mismatches, but its correctness depends on bugs like c/s 249e0f1d8f20 ("x86/cpuid: Fix TSXLDTRK definition") not happening. Switch to using FEATURESET_* just like the policy/featureset helpers. This breaks the cognitive complexity of needing to know which leaf a specifically named feature should reside in, and is shorter to write. It is also far easier to identify as correct at a glance, given the correlation with the CPUID leaf being read. In addition, tidy up some other bits of generic_identify() * Drop leading zeros from leaf numbers. * Don't use a locked update for X86_FEATURE_APERFMPERF. * Rework extended_cpuid_level calculation to avoid setting it twice. * Use "leaf >= $N" consistently so $N matches with the CPUID input. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit e3662437eb43cc8002bd39be077ef68b131649c5)
2022-02-04x86/cpuid: Enable MSR_SPEC_CTRL in SVM guests by defaultAndrew Cooper
With all other pieces in place, MSR_SPEC_CTRL is fully working for HVM guests. Update the CPUID derivation logic (both PV and HVM to avoid losing subtle changes), drop the MSR intercept, and explicitly enable the CPUID bits for HVM guests. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit a7e7c7260cde78a148810db5320cbf39686c3e09)
2022-02-04x86/msr: AMD MSR_SPEC_CTRL infrastructureAndrew Cooper
Fill in VMCB accessors for spec_ctrl in svm_{get,set}_reg(), and CPUID checks for all supported bits in guest_{rd,wr}msr(). Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 22b9add22b4a9af37305c8441fec12cb26bd142b)
2022-02-04x86/svm: VMEntry/Exit logic for MSR_SPEC_CTRLAndrew Cooper
Hardware maintains both host and guest versions of MSR_SPEC_CTRL, but guests run with the logical OR of both values. Therefore, in principle we want to clear Xen's value before entering the guest. However, for migration compatibility (future work), and for performance reasons with SEV-SNP guests, we want the ability to use a nonzero value behind the guest's back. Use vcpu_msrs to hold this value, with the guest value in the VMCB. On the VMEntry path, adjusting MSR_SPEC_CTRL must be done after CLGI so as to be atomic with respect to NMIs/etc. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 614cec7d79d76786f5638a6e4da0576b57732ca1)
2022-02-04x86/spec-ctrl: Use common MSR_SPEC_CTRL logic for AMDAndrew Cooper
Currently, amd_init_ssbd() works by being the only write to MSR_SPEC_CTRL in the system. This ceases to be true when using the common logic. Include AMD MSR_SPEC_CTRL in has_spec_ctrl to activate the common paths, and introduce an AMD specific block to control alternatives. Also update the boot/resume paths to configure default_xen_spec_ctrl. svm.h needs an adjustment to remove a dependency on include order. For now, only active alternatives for HVM - PV will require more work. No functional change, as no alternatives are defined yet for HVM yet. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 378f2e6df31442396f0afda19794c5c6091d96f9)
2022-02-04x86/spec-ctrl: Record the last write to MSR_SPEC_CTRLAndrew Cooper
In some cases, writes to MSR_SPEC_CTRL do not have interesting side effects, and we should implement lazy context switching like we do with other MSRs. In the short term, this will be used by the SVM infrastructure, but I expect to extend it to other contexts in due course. Introduce cpu_info.last_spec_ctrl for the purpose, and cache writes made from the boot/resume paths. The value can't live in regular per-cpu data when it is eventually used for PV guests when XPTI might be active. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 00f2992b6c7a9d4090443c1a85bf83224a87eeb9)
2022-02-04x86/spec-ctrl: Don't use spec_ctrl_{enter,exit}_idle() for S3Andrew Cooper
'idle' here refers to hlt/mwait. The S3 path isn't an idle path - it is a platform reset. We need to load default_xen_spec_ctrl unilaterally on the way back up. Currently it happens as a side effect of X86_FEATURE_SC_MSR_IDLE or the next return-to-guest, but that's fragile behaviour. Conversely, there is no need to clear IBRS and flush the store buffers on the way down; we're microseconds away from cutting power. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 71fac402e05ade7b0af2c34f77517449f6f7e2c1)
2022-02-04x86/spec-ctrl: Introduce new has_spec_ctrl booleanAndrew Cooper
Most MSR_SPEC_CTRL setup will be common between Intel and AMD. Instead of opencoding an OR of two features everywhere, introduce has_spec_ctrl instead. Reword the comment above the Intel specific alternatives block to highlight that it is Intel specific, and pull the setting of default_xen_spec_ctrl.IBRS out because it will want to be common. No functional change. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 5d9eff3a312763d889cfbf3c8468b6dfb3ab490c)
2022-02-04x86/spec-ctrl: Drop use_spec_ctrl booleanAndrew Cooper
Several bugfixes have reduced the utility of this variable from it's original purpose, and now all it does is aid in the setup of SCF_ist_wrmsr. Simplify the logic by drop the variable, and doubling up the setting of SCF_ist_wrmsr for the PV and HVM blocks, which will make the AMD SPEC_CTRL support easier to follow. Leave a comment explaining why SCF_ist_wrmsr is still necessary for the VMExit case. No functional change. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit ec083bf552c35e10347449e21809f4780f8155d2)
2022-02-04x86/cpuid: Advertise SSB_NO to guests by defaultAndrew Cooper
This is a statement of hardware behaviour, and not related to controls for the guest kernel to use. Pass it straight through from hardware. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 15b7611efd497c4b65f350483857082cb70fc348)
2022-02-04x86/msr: Fix migration compatibility issue with MSR_SPEC_CTRLAndrew Cooper
This bug existed in early in 2018 between MSR_SPEC_CTRL arriving in microcode, and SSBD arriving a few months later. It went unnoticed presumably because everyone was busy rebooting everything. The same bug will reappear when adding PSFD support. Clamp the guest MSR_SPEC_CTRL value to that permitted by CPUID on migrate. The guest is already playing with reserved bits at this point, and clamping the value will prevent a migration to a less capable host from failing. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 969a57f73f6b011b2ebf4c0ab1715efc65837335)
2022-02-04tools/guest: Fix comment regarding CPUID compatibilityAndrew Cooper
It was Xen 4.14 where CPUID data was added to the migration stream, and 4.13 that we need to worry about with regards to compatibility. Xen 4.12 isn't relevant. Expand and correct the commentary. Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 820cc393434097f3b7976acdccbf1d96071d6d23)
2022-02-04x86/vmx: Drop spec_ctrl load in VMEntry pathAndrew Cooper
This is not needed now that the VMEntry path is not responsible for loading the guest's MSR_SPEC_CTRL value. Fixes: 81f0eaadf84d ("x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> (cherry picked from commit 9ce3ef20b4f085a7dc8ee41b0fec6fdeced3773e)
2022-02-03MAINTAINERS: Anthony is stable branch tools maintainerJan Beulich
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
2022-01-26x86/pvh: fix population of the low 1MB for dom0Roger Pau Monné
RMRRs are setup ahead of populating the p2m and hence the ASSERT when populating the low 1MB needs to be relaxed when it finds an existing entry: it's either RAM or a RMRR resulting from the IOMMU setup. Rework the logic a bit and introduce a local mfn variable in order to assert that if the gfn is populated and not RAM it is an identity map. Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> master commit: 2d5fc9120d556ec3c4b1acf0ab5660a6d3f7ebeb master date: 2022-01-25 10:52:24 +0000
2022-01-26x86: Fix build with the get/set_reg() infrastructureAndrew Cooper
I clearly messed up concluding that the stubs were safe to drop. The is_{pv,hvm}_domain() predicates are not symmetrical with both CONFIG_PV and CONFIG_HVM. As a result logic of the form `if ( pv/hvm ) ... else ...` will always have one side which can't be DCE'd. While technically only the hvm stubs are needed, due to the use of the is_pv_domain() predicate in guest_{rd,wr}msr(), sort out the pv stubs too to avoid leaving a bear trap for future users. Fixes: 88d3ff7ab15d ("x86/guest: Introduce {get,set}_reg() infrastructure") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> master commit: 13caa585791234fe3e3719c8376f7ea731012451 master date: 2022-01-21 12:42:11 +0000
2022-01-25x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handlingAndrew Cooper
The logic was based on a mistaken understanding of how NMI blocking on vmexit works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path, and the guest's value will be clobbered before it is saved. Switch to using MSR load/save lists. This causes the guest value to be saved atomically with respect to NMIs/MCEs/etc. First, update vmx_cpuid_policy_changed() to configure the load/save lists at the same time as configuring the intercepts. This function is always used in remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole function, rather than having multiple remote acquisitions of the same VMCS. Both of vmx_{add,del}_guest_msr() can fail. The -ESRCH delete case is fine, but all others are fatal to the running of the VM, so handle them using domain_crash() - this path is only used during domain construction anyway. Second, update vmx_{get,set}_reg() to use the MSR load/save lists rather than vcpu_msrs, and update the vcpu_msrs comment to describe the new state location. Finally, adjust the entry/exit asm. Because the guest value is saved and loaded atomically, we do not need to manually load the guest value, nor do we need to enable SCF_use_shadow. This lets us remove the use of DO_SPEC_CTRL_EXIT_TO_GUEST. Additionally, SPEC_CTRL_ENTRY_FROM_PV gets removed too, because on an early entry failure, we're no longer in the guest MSR_SPEC_CTRL context needing to switch back to Xen's context. The only action remaining is to load Xen's MSR_SPEC_CTRL value on vmexit. We could in principle use the host msr list, but is expected to complicated future work. Delete DO_SPEC_CTRL_ENTRY_FROM_HVM entirely, and use a shorter code sequence to simply reload Xen's setting from the top-of-stack block. Adjust the comment at the top of spec_ctrl_asm.h in light of this bugfix. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> master commit: 81f0eaadf84d273a6ff8df3660b874a02d0e7677 master date: 2022-01-20 16:32:11 +0000
2022-01-25x86/spec-ctrl: Drop SPEC_CTRL_{ENTRY_FROM,EXIT_TO}_HVMAndrew Cooper
These were written before Spectre/Meltdown went public, and there was large uncertainty in how the protections would evolve. As it turns out, they're very specific to Intel hardware, and not very suitable for AMD. Drop the macros, opencoding the relevant subset of functionality, and leaving grep-fodder to locate the logic. No change at all for VT-x. For AMD, the only relevant piece of functionality is DO_OVERWRITE_RSB, although we will soon be adding (different) logic to handle MSR_SPEC_CTRL. This has a marginal improvement of removing an unconditional pile of long-nops from the vmentry/exit path. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> master commit: 95b13fa43e0753b7514bef13abe28253e8614f62 master date: 2022-01-20 16:32:11 +0000
2022-01-25x86/msr: Split MSR_SPEC_CTRL handlingAndrew Cooper
In order to fix a VT-x bug, and support MSR_SPEC_CTRL on AMD, move MSR_SPEC_CTRL handling into the new {pv,hvm}_{get,set}_reg() infrastructure. Duplicate the msrs->spec_ctrl.raw accesses in the PV and VT-x paths for now. The SVM path is currently unreachable because of the CPUID policy. No functional change. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> master commit: 6536688439dbca1d08fd6db5be29c39e3917fb2f master date: 2022-01-20 16:32:11 +0000
2022-01-25x86/guest: Introduce {get,set}_reg() infrastructureAndrew Cooper
Various registers have per-guest-type or per-vendor locations or access requirements. To support their use from common code, provide accessors which allow for per-guest-type behaviour. For now, just infrastructure handling default cases and expectations. Subsequent patches will start handling registers using this infrastructure. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> master commit: 88d3ff7ab15da277a85b39735797293fb541c718 master date: 2022-01-20 16:32:11 +0000
2022-01-25libxl/PCI: Fix PV hotplug & stubdom coldplugJason Andryuk
commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are reflected in the config" broken PCI hotplug (xl pci-attach) for PV domains when it moved libxl__create_pci_backend() later in the function. This also broke HVM + stubdom PCI passthrough coldplug. For that, the PCI devices are hotplugged to a running PV stubdom, and then the QEMU QMP device_add commands are made to QEMU inside the stubdom. A running PV domain calls libxl__wait_for_backend(). With the current placement of libxl__create_pci_backend(), the path does not exist and the call immediately fails: libxl: error: libxl_device.c:1388:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/43/0 does not exist libxl: error: libxl_pci.c:1764:device_pci_add_done: Domain 42:libxl__device_pci_add failed for PCI device 0:2:0.0 (rc -3) libxl: error: libxl_create.c:1857:domcreate_attach_devices: Domain 42:unable to add pci devices The wait is only relevant when: 1) The domain is PV 2) The domain is running 3) The backend is already present This is because: 1) xen-pcifront is only used for PV. It does not load for HVM domains where QEMU is used. 2) If the domain is not running (starting), then the frontend state will be Initialising. xen-pciback waits for the frontend to transition to at Initialised before attempting to connect. So a wait for a non-running domain is not applicable as the backend will not transition to Connected. 3) For presence, num_devs is already used to determine if the backend needs to be created. Re-use num_devs to determine if the backend wait is necessary. The wait is necessary to avoid racing with another PCI attachment reconfiguring the front/back or changing to some other state like closing. If we are creating the backend, then we don't have to worry about the state since it is being created. Fixes: 0fdb48ffe7a1 ("libxl: Make sure devices added by pci-attach are reflected in the config") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> master commit: 73ee2795aaef2cb086ac078bffe1c6b33c0ea91b master date: 2022-01-13 14:33:16 +0100
2022-01-25x86/time: improve TSC / CPU freq calibration accuracyJan Beulich
While the problem report was for extreme errors, even smaller ones would better be avoided: The calculated period to run calibration loops over can (and usually will) be shorter than the actual time elapsed between first and last platform timer and TSC reads. Adjust values returned from the init functions accordingly. On a Skylake system I've tested this on accuracy (using HPET) went from detecting in some cases more than 220kHz too high a value to about ±2kHz. On other systems (or on this system, but with PMTMR) the original error range was much smaller, with less (in some cases only very little) improvement. Reported-by: James Dingwall <james-xen@dingwall.me.uk> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> master commit: a5c9a80af34eefcd6e31d0ed2b083f452cd9076d master date: 2022-01-13 14:31:52 +0100
2022-01-25x86/time: use relative counts in calibration loopsJan Beulich
Looping until reaching/exceeding a certain value is error prone: If the target value is close enough to the wrapping point, the loop may not terminate at all. Switch to using delta values, which then allows to fold the two loops each into just one. Fixes: 93340297802b ("x86/time: calibrate TSC against platform timer") Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> master commit: 467191641d2a2fd2e43b3ae7b80399f89d339980 master date: 2022-01-13 14:30:18 +0100
2022-01-25passthrough/x86: stop pirq iteration immediately in case of errorJulien Grall
pt_pirq_iterate() will iterate in batch over all the PIRQs. The outer loop will bail out if 'rc' is non-zero but the inner loop will continue. This means 'rc' will get clobbered and we may miss any errors (such as -ERESTART in the case of the callback pci_clean_dpci_irq()). This is CVE-2022-23035 / XSA-395. Fixes: c24536b636f2 ("replace d->nr_pirqs sized arrays with radix tree") Fixes: f6dd295381f4 ("dpci: replace tasklet with softirq") Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> master commit: 9480a1a519cf016623f657dc544cb372a82b5708 master date: 2022-01-25 13:27:02 +0100
2022-01-25xen/grant-table: Only decrement the refcounter when grant is fully unmappedJulien Grall
The grant unmapping hypercall (GNTTABOP_unmap_grant_ref) is not a simple revert of the changes done by the grant mapping hypercall (GNTTABOP_map_grant_ref). Instead, it is possible to partially (or even not) clear some flags. This will leave the grant is mapped until a future call where all the flags would be cleared. XSA-380 introduced a refcounting that is meant to only be dropped when the grant is fully unmapped. Unfortunately, unmap_common() will decrement the refcount for every successful call. A consequence is a domain would be able to underflow the refcount and trigger a BUG(). Looking at the code, it is not clear to me why a domain would want to partially clear some flags in the grant-table. But as this is part of the ABI, it is better to not change the behavior for now. Fix it by checking if the maptrack handle has been released before decrementing the refcounting. This is CVE-2022-23034 / XSA-394. Fixes: 9781b51efde2 ("gnttab: replace mapkind()") Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> master commit: 975a8fb45ca186b3476e5656c6ad5dad1122dbfd master date: 2022-01-25 13:25:49 +0100
2022-01-25xen/arm: p2m: Always clear the P2M entry when the mapping is removedJulien Grall
Commit 2148a125b73b ("xen/arm: Track page accessed between batch of Set/Way operations") allowed an entry to be invalid from the CPU PoV (lpae_is_valid()) but valid for Xen (p2m_is_valid()). This is useful to track which page is accessed and only perform an action on them (e.g. clean & invalidate the cache after a set/way instruction). Unfortunately, __p2m_set_entry() is only zeroing the P2M entry when lpae_is_valid() returns true. This means the entry will not be zeroed if the entry was valid from Xen PoV but invalid from the CPU PoV for tracking purpose. As a consequence, this will allow a domain to continue to access the page after it was removed. Resolve the issue by always zeroing the entry if it the LPAE bit is set or the entry is about to be removed. This is CVE-2022-23033 / XSA-393. Reported-by: Dmytro Firsov <Dmytro_Firsov@epam.com> Fixes: 2148a125b73b ("xen/arm: Track page accessed between batch of Set/Way operations") Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Julien Grall <jgrall@amazon.com> master commit: a428b913a002eb2b7425b48029c20a52eeee1b5a master date: 2022-01-25 13:25:01 +0100
2022-01-06x86/spec-ctrl: Fix default calculation of opt_srb_lockAndrew Cooper
Since this logic was introduced, opt_tsx has become more complicated and shouldn't be compared to 0 directly. While there are no buggy logic paths, the correct expression is !(opt_tsx & 1) but the rtm_disabled boolean is easier and clearer to use. Fixes: 8fe24090d940 ("x86/cpuid: Rework HLE and RTM handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> master commit: 31f3bc97f4508687215e459a5e35676eecf1772b master date: 2022-01-05 09:44:26 +0000
2022-01-06x86/cpuid: Fix TSXLDTRK definitionAndrew Cooper
TSXLDTRK lives in CPUID leaf 7[0].edx, not 7[0].ecx. Bit 16 in ecx is LA57. Fixes: a6d1b558471f ("x86emul: support X{SUS,RES}LDTRK") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> master commit: 249e0f1d8f203188ccdcced5a05c2149739e1566 master date: 2021-12-14 12:30:48 +0000
2022-01-06revert "hvmloader: PA range 0xfc000000-0xffffffff should be UC"Jan Beulich
This reverts commit c22bd567ce22f6ad9bd93318ad0d7fd1c2eadb0d. While its description is correct from an abstract or real hardware pov, the range is special inside HVM guests. The range being UC in particular gets in the way of OVMF, which places itself at [FFE00000,FFFFFFFF]. While this is benign to epte_get_entry_emt() as long as the IOMMU isn't enabled for a guest, it becomes a very noticable problem otherwise: It takes about half a minute for OVMF to decompress itself into its designated address range. And even beyond OVMF there's no reason to have e.g. the ACPI memory range marked UC. Fixes: c22bd567ce22 ("hvmloader: PA range 0xfc000000-0xffffffff should be UC") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> master commit: ea187c0b7a73c26258c0e91e4f3656989804555f master date: 2021-12-17 08:56:15 +0100
2022-01-06x86/HVM: permit CLFLUSH{,OPT} on execute-only code segmentsJan Beulich
Both SDM and PM explicitly permit this. Fixes: 52dba7bd0b36 ("x86emul: generalize wbinvd() hook") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Paul Durrant <paul@xen.org> master commit: df3e1a5efe700a9f59eced801cac73f9fd02a0e2 master date: 2021-12-10 14:03:56 +0100
2022-01-06x86: avoid wrong use of all-but-self IPI shorthandJan Beulich
With "nosmp" I did observe a flood of "APIC error on CPU0: 04(04), Send accept error" log messages on an AMD system. And rightly so - nothing excludes the use of the shorthand in send_IPI_mask() in this case. Set "unaccounted_cpus" to "true" also when command line restrictions are the cause. Note that PV-shim mode is unaffected by this change, first and foremost because "nosmp" and "maxcpus=" are ignored in this case. Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> master commit: 7621880de0bb40bae6436a5b106babc0e4718f4d master date: 2021-12-10 10:26:52 +0100
2022-01-06x86/HVM: fail virt-to-linear conversion for insn fetches from non-code segmentsJan Beulich
Just like (in protected mode) reads may not go to exec-only segments and writes may not go to non-writable ones, insn fetches may not access data segments. Fixes: 623e83716791 ("hvm: Support hardware task switching") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> master commit: 311297f4216a4387bdae6df6cfbb1f5edb06618a master date: 2021-12-06 14:15:05 +0100
2022-01-06x86/Viridian: fix error code useJan Beulich
Both the wrong use of HV_STATUS_* and the return type of hv_vpset_to_vpmask() can lead to viridian_hypercall()'s ASSERT_UNREACHABLE() triggering when translating error codes from Xen to Viridian representation. Fixes: b4124682db6e ("viridian: add ExProcessorMasks variants of the flush hypercalls") Fixes: 9afa867d42ba ("viridian: add ExProcessorMasks variant of the IPI hypercall") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> master commit: 857fee77845be0c5c35fd51bac64455369d32a6f master date: 2021-11-24 11:09:56 +0100
2022-01-06VT-d: don't leak domid mapping on error pathJan Beulich
While domain_context_mapping() invokes domain_context_unmap() in a sub- case of handling DEV_TYPE_PCI when encountering an error, thus avoiding a leak, individual calls to domain_context_mapping_one() aren't similarly covered. Such a leak might persist until domain destruction. Leverage that these cases can be recognized by pdev being non-NULL. Fixes: dec403cc668f ("VT-d: fix iommu_domid for PCI/PCIx devices assignment") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> master commit: e6252a51faf42c892eb5fc71f8a2617580832196 master date: 2021-11-24 11:07:11 +0100
2022-01-06VT-d: split domid map cleanup check into a functionJan Beulich
This logic will want invoking from elsewhere. No functional change intended. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> master commit: 9fdc10abe9457e4c9879a266f82372cb08e88ffb master date: 2021-11-24 11:06:20 +0100
2021-12-13docs/efi: Fix wrong compatible in dts exampleLuca Fancellu
The example in section "UEFI boot and dom0less on ARM" has a wrong compatible for the DTB passthrough, it is "ramdisk" instead of "device-tree". This patch fixes the example. Fixes: a1743fc3a9fe ("arm/efi: Use dom0less configuration when using EFI boot") Signed-off-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Acked-by: Julien Grall <jgrall@amazon.com> (cherry picked from commit 620ed2c8c777282154a91abca69083a40c9d918d)
2021-12-06REAMDE: trim over-long lines around figletIan Jackson
Signed-off-by: Ian Jackson <iwj@xenproject.org>