Age | Commit message (Collapse) | Author |
|
When system turns NUMA off or system lacks of NUMA support,
Xen will fake a NUMA node to make system works as a single
node NUMA system.
In this case the memory node map doesn't need to be allocated
from boot pages, it will use the _memnodemap directly. But
memnodemapsize hasn't been set. Xen should assert in phys_to_nid.
Because x86 was using an empty macro "VIRTUAL_BUG_ON" to replace
ASSERT, this bug will not be triggered on x86.
Actually, Xen will only use 1 slot of memnodemap in this case.
So we set memnodemap[0] to 0 and memnodemapsize to 1 in this
patch to fix it.
Signed-off-by: Wei Chen <wei.chen@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
|
|
This patch adds a new domain_tot_pages() inline helper function into
sched.h, which will be needed by a subsequent patch.
No functional change.
NOTE: While modifying the comment for 'tot_pages' in sched.h this patch
makes some cosmetic fixes to surrounding comments.
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: George Dunlap <George.Dunlap@eu.citrix.com>
Acked-by: Julien Grall <julien@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
|
|
Move the parameter related definitions from init.h into a new header
file param.h. This will avoid include hell when new dependencies are
added to parameter definitions.
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <julien@xen.org>
Acked-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
|
|
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
|
|
With almost all users of keyhandler_scratch gone, clean up the 3 remaining
users and drop the buffer.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|
|
For different keyhandlers, replace a hex with delimiter representation
of time to PRI_stime which is decimal ns currently.
Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
|
|
Callers are inconsistent with whether they pass a newline to panic(),
including adjacent calls in the same function using different styles.
painc() not expecting a newline is inconsistent with most other printing
functions, which is most likely why we've gained so many inconsistencies.
Switch panic() to expect a newline, and update all callers which currently
lack a newline to include one.
This actually reduces the size of .rodata (0x07e3e8 down to 0x07e3a8) because
a number of strings are passed to both panic() and printk(). As they
previously differed by \n alone, they couldn't be merged.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>
|
|
Most of the users of page_to_mfn and mfn_to_page are either overriding
the macros to make them work with mfn_t or use mfn_x/_mfn because the
rest of the function use mfn_t.
So make page_to_mfn and mfn_to_page return mfn_t by default. The __*
version are now dropped as this patch will convert all the remaining
non-typesafe callers.
Only reasonable clean-ups are done in this patch. The rest will use
_mfn/mfn_x for the time being.
Lastly, domain_page_to_mfn is also converted to use mfn_t given that
most of the callers are now switched to _mfn(domain_page_to_mfn(...)).
Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
|
|
At the moment, most of the callers will have to use mfn_x. However
follow-up patches will remove some of them by propagating the typesafe a
bit further.
Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
|
|
Modify the custom parameter parsing routines in:
xen/arch/x86/numa.c
to indicate whether the parameter value was parsed successfully.
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
|
|
alloc_boot_pages will panic if it is not possible to allocate. So the
check in the caller is pointless.
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|
|
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
- drop the only left CONFIG_NUMA conditional (this is always true)
- drop struct node_data's node_id field (being always equal to the
node_data[] array index used)
- don't open code node_{start,end}_pfn() nor node_spanned_pages()
except when used as lvalues (those could be converted too, but this
seems a little awkward)
- no longer open code pfn_to_paddr() in an expression being modified
anyway
- make dump less verbose by logging actual vs intended node IDs only
when they don't match
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
When node zero has no memory, the DMA bit width will end up getting set
to 9, which is obviously not helpful to hold back a reasonable amount
of low enough memory for Dom0 to use for DMA purposes. Find the lowest
node with memory below 4Gb instead.
Introduce arch_get_dma_bitsize() to keep this arch-specific logic out
of common code.
Also adjust the original calculation: I think the subtraction of 1
should have been part of the flsl() argument rather than getting
applied to its result. And while previously the division by 4 was valid
to be done on the flsl() result, this now also needs to be converted,
as is should only be applied to the spanned pages value.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
|
|
While reviewing those patches I noticed a few types that could do with
tweaking.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
x86 is the only architecture which uses __devinitdata, and also has
CONFIG_HOTPLUG enabled, making the annotation empty.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|
|
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|
|
- don't overrun apicid_to_node[] (possible in the x2APIC case)
- don't limit number of processor related SRAT entries we can consume
- make acpi_numa_{processor,x2apic}_affinity_init() as similar to one
another as possible
- print APIC IDs in hex (to ease matching with other log messages), at
once making legacy and x2APIC ones distinguishable (by width)
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
struct keyhandler does not contain much information, and requires a lot
of boilerplate to use. It is far more convenient to have
register_keyhandler() take each piece of information a parameter,
especially when introducing temporary debugging keyhandlers.
This in turn allows struct keyhandler itself to become private to
keyhandler.c and for the key_table to become more efficient.
key_table doesn't need to contain 256 entries; all keys are ASCII which
limits them to 7 bits of index, rather than 8. It can also become a
straight array, rather than an array of pointers. The overall effect of
this is the key_table grows in size by 50%, but there are no longer
24-byte keyhandler structures all over the data section.
All of the key_table entries in keyhandler.c can be initialised at
compile time rather than runtime.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|
|
... by grouping sequences of contiguous CPUs.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
Use u8-sized node IDs and unsigned PXMs consistently throughout
code (and introduce nodeid_t type).
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|
|
Signed-off-by: Elena Ufimsteva <ufimtseva@gmail.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
|
|
Convert to Xen coding style from mixed one.
Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
|
|
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
By showing the number of free pages on each node.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
|
|
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
Use cpumask_copy() instead of direct variable assignments for copying
CPU masks. While direct assignments are not a problem when both sides
are variables actually defined as cpumask_t (except for possibly
copying *much* more than would actually need to be copied), they must
not happen when the original variable is of type cpumask_var_t (which
may have lass space allocated to it than a full cpumask_t). Eliminate
as many of such assignments as possible (in several cases it's even
possible to collapse two operations [copy then clear one bit] into one
[cpumask_andnot()]), and thus set the way for reducing the allocation
size in alloc_cpumask_var().
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
The former is the runtime equivalent of NR_CPUS (and users of NR_CPUS,
where necessary, get adjusted accordingly), while the latter is for the
sole use of determining the allocation size when dynamically allocating
CPU masks (done later in this series).
Adjust accessors to use either of the two to bound their bitmap
operations - which one gets used depends on whether accessing the bits
in the gap between nr_cpu_ids and nr_cpumask_bits is benign but more
efficient.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
The specifier only needs to be added to the function's definition.
At the same time, fix init_cpu_to_node() to be __init rather than
__devinit (it is only called at boot time).
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
This also includes the removal of some entirely unused functions.
The patch builds upon the makefile adjustments done in the earlier
sent patch titled "move more kernel decompression bits to .init.*
sections".
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
There are a few places in Xen where we walk a domain's page lists
without holding the page_alloc lock. They race with updates to the
page lists, which are normally rare but can be quite common under PoD
when the domain is close to its memory limit and the PoD reclaimer is
busy. This patch protects those places by taking the page_alloc lock.
I think this is OK for the two debug-key printouts - they don't run
from irq context and look deadlock-free. The tboot change seems safe
too unless tboot shutdown functions are called from irq context or
with the page_alloc lock held. The p2m one is the scariest but there
are already code paths in PoD that take the page_alloc lock with the
p2m lock held so it's no worse than existing code.
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>
|
|
...and fix up the ensuing fall-out of implicit dependencies
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
|
|
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
c/s 20599 caused the hash shift to become significantly smaller on
systems with an SRAT like this
(XEN) SRAT: Node 0 PXM 0 0-a0000
(XEN) SRAT: Node 0 PXM 0 100000-80000000
(XEN) SRAT: Node 1 PXM 1 80000000-d0000000
(XEN) SRAT: Node 1 PXM 1 100000000-130000000
Comined with the static size of the memnodemap[] array, NUMA got
therefore disabled on such systems. The backport from Linux was really
incomplete, as Linux much earlier had already introduced a dynamcially
allocated memnodemap[].
Further, doing to/from pdx translations on addresses just past a valid
range is not correct, as it may strip/fail to insert non-zero bits in
this case.
Finally, using 63 as the cover-it-all shift value is invalid on 32bit,
since pdx values are unsigned long.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
In changeset 20599, the node that has no memory populated is marked
parsed, but not online. However, if there are CPU populated in this
node, the corresponding CPU mapping (i.e. the cpu_to_node) is still
setup to the offline node, this will cause trouble for memory
allocation.
This patch changes the init_cpu_to_node() and srant_detect_node(), to
considering the node is offlined situation.
Now the apicid_to_node is only used to keep the mapping between
cpu/node provided by BIOS, and should not be used for memory
allocation anymore.
One thing left is to update the cpu_to_node mapping after memory
populated by memory hot-add.
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
This is a reintroduction of 20726:ddb8c5e798f9, which I incorrectly
reverted in 20745:d3215a968db9
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
|
|
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
|
|
In changeset 20599, the node that has no memory populated is marked
parsed, but not online. However, if there are CPU populated in this
node, the corresponding CPU mapping (i.e. the cpu_to_node) is still
setup to the offline node, this will cause trouble for memory
allocation.
This patch changes the init_cpu_to_node() and srant_detect_node(), to
considering the node is offlined situation.
Now the apicid_to_node is only used to keep the mapping between
cpu/node provided by BIOS, and should not be used for memory
allocation anymore.
One thing left is to update the cpu_to_node mapping after memory
populated by memory hot-add.
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
|
|
Currently xen hypervisor use nodes to keep start/end address of
node. It assume memory among nodes has no overlap, this is not always
true, especially if we have memory hotplug support in the system.
This patch backport Linux kernel's memblks to support overlapping
among node. The memblks will be used both for checking conflict, and
caculate memnode_shift.
Also, currently if there is no memory populated in a node when system
booting, the node will be unparsed later, and the corresponding CPU's
numa information will be removed also. This patch will keep the CPU
information.
One thing need notice is, currently we caculate memnode_shift with all
memory, including un-populated ones. This should work if the smallest
chuck is not so small. Other option can be flags in the page_info
structure, etc.
The memnodemap is changed from paddr to pdx, both to save space, and
also because currently most access is from pfn.
A flag is mem_hotplug added if there is hotplug memory range.
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
|
|
I did some benchmark runs (lmbench & kernel compile) with a number of
guests running in parallel to compare the performance of numa=on vs.
numa=off. As soon as one starts to load the machine, the performance
goes down in the numa=off case. The tests were done on an 8-node
machine (4 cores each). lmbench (actually copying large amounts of
memory) shows a dramatic dropdown, but I even noticed significant
performance decrease for a tmpfs based Linux kernel compile. Here a
summary of the data:
lmbench's rd benchmark (normalized to native Linux (=100)):
guests numa=off numa=on avg increase
min avg max min avg max
1 78.0 102.3
7 37.4 45.6 62.0 90.6 102.3 110.9 124.4%
15 21.0 25.8 31.7 41.7 48.7 54.1 88.2%
23 13.4 17.5 23.2 25.0 28.0 30.1 60.2%
kernel compile in tmpfs, 1 VCPU, 2GB RAM, average of elapsed time:
guests numa=off numa=on increase
1 480.610 464.320 3.4%
7 482.109 461.721 4.2%
15 515.297 477.669 7.3%
23 548.427 495.180 9.7%
again with 2 VCPUs and make -j2:
1 264.580 261.690 1.1%
7 279.763 258.907 7.7%
15 330.385 272.762 17.4%
23 463.510 390.547 15.7% (46 VCPUs on 32pCPUs)
Selected tests on a 4-node machine showed similar behavior (7.9 %
increase with 6 parallel guests on the 2 VCPU kernel compile
benchmark).
Note that this does not affect non-NUMA machines at all, since NUMA
will be turned off again by the code if no NUMA topology is detected.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
|
|
This patch add CPU hot-add in system.
a) It mark all CPU as possible when booting, if CONFIG_HOTPLUG_CPU is
set. BTW, this will increase per_cpu area.
b) When a CPU is added through hypercall, the CPU will be marked as
present and offline, and the numa information is setup if numa is
supported. The CPU will be brought to online by dom0 online explicitly.
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
|
|
Make various data items const or __read_mostly where
possible/reasonable.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
Add a new keyhandler that triggers all the side-effect-free
keyhandlers. This lets automated tests (and users) log the full set
of keyhandlers without having to be aware of which ones might reboot
the host.
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
|
|
Signed-off-by: Yang Xiaowei <xiaowei.yang@intel.com>
|
|
Unless more than 16Tb are going to ever be supported in Xen, this will
allow reducing the linked list entries in struct page_info from 16 to
8 bytes.
This doesn't modify struct shadow_page_info, yet, so in order to meet
the constraints of that 'mirror' structure the list entry gets
artificially forced to be 16 bytes in size. That workaround will be
removed in a subsequent patch.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
This patch will collect memory location (the domain has how many pages
in different node) of each domain and display if you input debug key.
Signed-off-by: Zhou Ting <ting.g.zhou@intel.com>
|
|
Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>
|
|
If memory address >4G, the address will overflow in some NUMA code if
using unsigned long to statement a physical address in PAE arch.
Replace "unsigned long" with paddr_t to avoid overflow.
Signed-off-by: Duan Ronghui <ronghui.duan@intel.com>
|
|
I don't know how significant this is (most of the NUMA node data seems
unused at this point), but anyway: enable proper operation of NUMA
emulation and the fake NUMA node in case there's no SRAT table on
x86-32. This will at least make the "Faking node ..." message not
print confusing information anymore.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|