aboutsummaryrefslogtreecommitdiff
path: root/arch/powerpc/include
AgeCommit message (Collapse)Author
2014-02-06powerpc: Fix the setup of CPU-to-Node mappings during CPU onlineSrivatsa S. Bhat
commit d4edc5b6c480a0917e61d93d55531d7efa6230be upstream. On POWER platforms, the hypervisor can notify the guest kernel about dynamic changes in the cpu-numa associativity (VPHN topology update). Hence the cpu-to-node mappings that we got from the firmware during boot, may no longer be valid after such updates. This is handled using the arch_update_cpu_topology() hook in the scheduler, and the sched-domains are rebuilt according to the new mappings. But unfortunately, at the moment, CPU hotplug ignores these updated mappings and instead queries the firmware for the cpu-to-numa relationships and uses them during CPU online. So the kernel can end up assigning wrong NUMA nodes to CPUs during subsequent CPU hotplug online operations (after booting). Further, a particularly problematic scenario can result from this bug: On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8) threads per core. The switch to Single-Threaded (ST) mode is performed by offlining all except the first CPU thread in each core. Switching back to SMT mode involves onlining those other threads back, in each core. Now consider this scenario: 1. During boot, the kernel gets the cpu-to-node mappings from the firmware and assigns the CPUs to NUMA nodes appropriately, during CPU online. 2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and communicates this update to the kernel. The kernel in turn updates its cpu-to-node associations and rebuilds its sched domains. Everything is fine so far. 3. Now, the user switches the machine from SMT to ST mode (say, by running ppc64_cpu --smt=1). This involves offlining all except 1 thread in each core. 4. The user then tries to switch back from ST to SMT mode (say, by running ppc64_cpu --smt=4), and this involves onlining those threads back. Since CPU hotplug ignores the new mappings, it queries the firmware and tries to associate the newly onlined sibling threads to the old NUMA nodes. This results in sibling threads within the same core getting associated with different NUMA nodes, which is incorrect. The scheduler's build-sched-domains code gets thoroughly confused with this and enters an infinite loop and causes soft-lockups, as explained in detail in commit 3be7db6ab (powerpc: VPHN topology change updates all siblings). So to fix this, use the numa_cpu_lookup_table to remember the updated cpu-to-node mappings, and use them during CPU hotplug online operations. Further, we also need to ensure that all threads in a core are assigned to a common NUMA node, irrespective of whether all those threads were online during the topology update. To achieve this, we take care not to use cpu_sibling_mask() since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread() and set up the mappings manually using the 'threads_per_core' value for that particular platform. This helps us ensure that we don't hit this bug with any combination of CPU hotplug and SMT mode switching. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09powerpc: Fix bad stack check in exception entryMichael Neuling
commit 90ff5d688e61f49f23545ffab6228bd7e87e6dc7 upstream. In EXCEPTION_PROLOG_COMMON() we check to see if the stack pointer (r1) is valid when coming from the kernel. If it's not valid, we die but with a nice oops message. Currently we allocate a stack frame (subtract INT_FRAME_SIZE) before we check to see if the stack pointer is negative. Unfortunately, this won't detect a bad stack where r1 is less than INT_FRAME_SIZE. This patch fixes the check to compare the modified r1 with -INT_FRAME_SIZE. With this, bad kernel stack pointers (including NULL pointers) are correctly detected again. Kudos to Paulus for finding this. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20powerpc: Fix PTE page address mismatch in pgtable ctor/dtorHong H. Pham
commit cf77ee54362a245f9a01f240adce03a06c05eb68 upstream. In pte_alloc_one(), pgtable_page_ctor() is passed an address that has not been converted by page_address() to the newly allocated PTE page. When the PTE is freed, __pte_free_tlb() calls pgtable_page_dtor() with an address to the PTE page that has been converted by page_address(). The mismatch in the PTE's page address causes pgtable_page_dtor() to access invalid memory, so resources for that PTE (such as the page lock) is not properly cleaned up. On PPC32, only SMP kernels are affected. On PPC64, only SMP kernels with 4K page size are affected. This bug was introduced by commit d614bb041209fd7cb5e4b35e11a7b2f6ee8f62b8 "powerpc: Move the pte free routines from common header". On a preempt-rt kernel, a spinlock is dynamically allocated for each PTE in pgtable_page_ctor(). When the PTE is freed, calling pgtable_page_dtor() with a mismatched page address causes a memory leak, as the pointer to the PTE's spinlock is bogus. On mainline, there isn't any immediately obvious symptoms, but the problem still exists here. Fixes: d614bb041209fd7c "powerpc: Move the pte free routes from common header" Cc: Paul Mackerras <paulus@samba.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Hong H. Pham <hong.pham@windriver.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-18compiler/gcc4: Add quirk for 'asm goto' miscompilation bugIngo Molnar
commit 3f0116c3238a96bc18ad4b4acefe4e7be32fa861 upstream. Fengguang Wu, Oleg Nesterov and Peter Zijlstra tracked down a kernel crash to a GCC bug: GCC miscompiles certain 'asm goto' constructs, as outlined here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670 Implement a workaround suggested by Jakub Jelinek. Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Suggested-by: Jakub Jelinek <jakub@redhat.com> Reviewed-by: Richard Henderson <rth@twiddle.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20131015062351.GA4666@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-07powerpc: Work around gcc miscompilation of __pa() on 64-bitPaul Mackerras
commit bdbc29c19b2633b1d9c52638fb732bcde7a2031a upstream. On 64-bit, __pa(&static_var) gets miscompiled by recent versions of gcc as something like: addis 3,2,.LANCHOR1+4611686018427387904@toc@ha addi 3,3,.LANCHOR1+4611686018427387904@toc@l This ends up effectively ignoring the offset, since its bottom 32 bits are zero, and means that the result of __pa() still has 0xC in the top nibble. This happens with gcc 4.8.1, at least. To work around this, for 64-bit we make __pa() use an AND operator, and for symmetry, we make __va() use an OR operator. Using an AND operator rather than a subtraction ends up with slightly shorter code since it can be done with a single clrldi instruction, whereas it takes three instructions to form the constant (-PAGE_OFFSET) and add it on. (Note that MEMORY_START is always 0 on 64-bit.) Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-14powerpc/tm: Fix context switching TAR, PPR and DSCR SPRsMichael Neuling
commit 28e61cc466d8daace4b0f04ba2b83e0bd68f5832 upstream. If a transaction is rolled back, the Target Address Register (TAR), Processor Priority Register (PPR) and Data Stream Control Register (DSCR) should be restored to the checkpointed values before the transaction began. Any changes to these SPRs inside the transaction should not be visible in the abort handler. Currently Linux doesn't save or restore the checkpointed TAR, PPR or DSCR. If we preempt a processes inside a transaction which has modified any of these, on process restore, that same transaction may be aborted we but we won't see the checkpointed versions of these SPRs. This adds checkpointed versions of these SPRs to the thread_struct and adds the save/restore of these three SPRs to the treclaim/trechkpt code. Without this if any of these SPRs are modified during a transaction, users may incorrectly see a speculated SPR value even if the transaction is aborted. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-14powerpc: Save the TAR register earlierMichael Neuling
commit c2d52644e2da8a07ecab5ca62dd0bc563089e8dc upstream. This moves us to save the Target Address Register (TAR) a earlier in __switch_to. It introduces a new function save_tar() to do this. We need to save the TAR earlier as we will overwrite it in the transactional memory reclaim/recheckpoint path. We are going to do this in a subsequent patch which will fix saving the TAR register when it's modified inside a transaction. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-14powerpc: Rework setting up H/FSCR bit definitionsMichael Neuling
commit 74e400cee6c0266ba2d940ed78d981f1e24a8167 upstream. This reworks the Facility Status and Control Regsiter (FSCR) config bit definitions so that we can access the bit numbers. This is needed for a subsequent patch to fix the userspace DSCR handling. HFSCR and FSCR bit definitions are the same, so reuse them. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-11powerpc: VPHN topology change updates all siblingsRobert Jennings
commit 3be7db6ab45b21345386d1a466da133b19cde5e4 upstream. When an associativity level change is found for one thread, the siblings threads need to be updated as well. This is done today for PRRN in stage_topology_update() but is missing for VPHN in update_cpu_associativity_changes_mask(). This patch will correctly update all thread siblings during a topology change. Without this patch a topology update can result in a CPU in init_sched_groups_power() getting stuck indefinitely in a loop. This loop is built in build_sched_groups(). As a result of the thread moving to a node separate from its siblings the struct sched_group will have its next pointer set to point to itself rather than the sched_group struct of the next thread. This happens because we have a domain without the SD_OVERLAP flag, which is correct, and a topology that doesn't conform with reality (threads on the same core assigned to different numa nodes). When this list is traversed by init_sched_groups_power() it will reach the thread's sched_group structure and loop indefinitely; the cpu will be stuck at this point. The bug was exposed when VPHN was enabled in commit b7abef0 (v3.9). Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04powerpc/modules: Module CRC relocation fix causes perf issuesAnton Blanchard
commit 0e0ed6406e61434d3f38fb58aa8464ec4722b77e upstream. Module CRCs are implemented as absolute symbols that get resolved by a linker script. We build an intermediate .o that contains an unresolved symbol for each CRC. genksysms parses this .o, calculates the CRCs and writes a linker script that "resolves" the symbols to the calculated CRC. Unfortunately the ppc64 relocatable kernel sees these CRCs as symbols that need relocating and relocates them at boot. Commit d4703aef (module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y) added a hook to reverse the bogus relocations. Part of this patch created a symbol at 0x0: # head -2 /proc/kallsyms 0000000000000000 T reloc_start c000000000000000 T .__start This reloc_start symbol is causing lots of confusion to perf. It thinks reloc_start is a massive function that stretches from 0x0 to 0xc000000000000000 and we get various cryptic errors out of perf, including: problem incrementing symbol count, skipping event This patch removes the reloc_start linker script label and instead defines it as PHYSICAL_START. We also need to wrap it with CONFIG_PPC64 because the ppc32 kernel can set a non zero PHYSICAL_START at compile time and we wouldn't want to subtract it from the CRCs in that case. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-25powerpc/perf: Freeze PMC5/6 if we're not using themMichael Ellerman
commit 7a7a41f9d5b28ac3a916b057a7d3cd3f435ee9a6 upstream. On Power8 we can freeze PMC5 and 6 if we're not using them. Normally they run all the time. As noticed by Anshuman, we should unfreeze them when we disable the PMU as there are legacy tools which expect them to run all the time. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-25powerpc: Remove KVMTEST from RELON exception handlersMichael Ellerman
commit c9f69518e5f08170bc857984a077f693d63171df upstream. KVMTEST is a macro which checks whether we are taking an exception from guest context, if so we branch out of line and eventually call into the KVM code to handle the switch. When running real guests on bare metal (HV KVM) the hardware ensures that we never take a relocation on exception when transitioning from guest to host. For PR KVM we disable relocation on exceptions ourself in kvmppc_core_init_vm(), as of commit a413f47 "Disable relocation on exceptions whenever PR KVM is active". So convert all the RELON macros to use NOTEST, and drop the remaining KVM_HANDLER() definitions we have for 0xe40 and 0xe80. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-14Merge branch 'merge' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc fixes from Benjamin Herrenschmidt: "So here are 3 fixes still for 3.10. Fixes are simple, bugs are nasty (though not recent regressions, nasty enough) and all targeted at stable" * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc: Fix missing/delayed calls to irq_work powerpc: Fix emulation of illegal instructions on PowerNV platform powerpc: Fix stack overflow crash in resume_kernel when ftracing
2013-06-15powerpc: Fix stack overflow crash in resume_kernel when ftracingMichael Ellerman
It's possible for us to crash when running with ftrace enabled, eg: Bad kernel stack pointer bffffd12 at c00000000000a454 cpu 0x3: Vector: 300 (Data Access) at [c00000000ffe3d40] pc: c00000000000a454: resume_kernel+0x34/0x60 lr: c00000000000335c: performance_monitor_common+0x15c/0x180 sp: bffffd12 msr: 8000000000001032 dar: bffffd12 dsisr: 42000000 If we look at current's stack (paca->__current->stack) we see it is equal to c0000002ecab0000. Our stack is 16K, and comparing to paca->kstack (c0000002ecab3e30) we can see that we have overflowed our kernel stack. This leads to us writing over our struct thread_info, and in this case we have corrupted thread_info->flags and set _TIF_EMULATE_STACK_STORE. Dumping the stack we see: 3:mon> t c0000002ecab0000 [c0000002ecab0000] c00000000002131c .performance_monitor_exception+0x5c/0x70 [c0000002ecab0080] c00000000000335c performance_monitor_common+0x15c/0x180 --- Exception: f01 (Performance Monitor) at c0000000000fb2ec .trace_hardirqs_off+0x1c/0x30 [c0000002ecab0370] c00000000016fdb0 .trace_graph_entry+0xb0/0x280 (unreliable) [c0000002ecab0410] c00000000003d038 .prepare_ftrace_return+0x98/0x130 [c0000002ecab04b0] c00000000000a920 .ftrace_graph_caller+0x14/0x28 [c0000002ecab0520] c0000000000d6b58 .idle_cpu+0x18/0x90 [c0000002ecab05a0] c00000000000a934 .return_to_handler+0x0/0x34 [c0000002ecab0620] c00000000001e660 .timer_interrupt+0x160/0x300 [c0000002ecab06d0] c0000000000025dc decrementer_common+0x15c/0x180 --- Exception: 901 (Decrementer) at c0000000000104d4 .arch_local_irq_restore+0x74/0xa0 [c0000002ecab09c0] c0000000000fe044 .trace_hardirqs_on+0x14/0x30 (unreliable) [c0000002ecab0fb0] c00000000016fe3c .trace_graph_entry+0x13c/0x280 [c0000002ecab1050] c00000000003d038 .prepare_ftrace_return+0x98/0x130 [c0000002ecab10f0] c00000000000a920 .ftrace_graph_caller+0x14/0x28 [c0000002ecab1160] c0000000000161f0 .__ppc64_runlatch_on+0x10/0x40 [c0000002ecab11d0] c00000000000a934 .return_to_handler+0x0/0x34 --- Exception: 901 (Decrementer) at c0000000000104d4 .arch_local_irq_restore+0x74/0xa0 ... and so on __ppc64_runlatch_on() is called from RUNLATCH_ON in the exception entry path. At that point the irq state is not consistent, ie. interrupts are hard disabled (by the exception entry), but the paca soft-enabled flag may be out of sync. This leads to the local_irq_restore() in trace_graph_entry() actually enabling interrupts, which we do not want. Because we have not yet reprogrammed the decrementer we immediately take another decrementer exception, and recurse. The fix is twofold. Firstly make sure we call DISABLE_INTS before calling RUNLATCH_ON. The badly named DISABLE_INTS actually reconciles the irq state in the paca with the hardware, making it safe again to call local_irq_save/restore(). Although that should be sufficient to fix the bug, we also mark the runlatch routines as notrace. They are called very early in the exception entry and we are asking for trouble tracing them. They are also fairly uninteresting and tracing them just adds unnecessary overhead. [ This regression was introduced by fe1952fc0afb9a2e4c79f103c08aef5d13db1873 "powerpc: Rework runlatch code" by myself --BenH ] CC: <stable@vger.kernel.org> [v3.4+] Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-11Merge branch 'fixes' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm bugfixes from Gleb Natapov: "There is one more fix for MIPS KVM ABI here, MIPS and PPC build breakage fixes and a couple of PPC bug fixes" * 'fixes' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm/ppc/booke64: Fix lazy ee handling in kvmppc_handle_exit() kvm/ppc/booke: Hold srcu lock when calling gfn functions kvm/ppc/booke64: Disable e6500 support kvm/ppc/booke64: Fix AltiVec interrupt numbers and build breakage mips/kvm: Use KVM_REG_MIPS and proper size indicators for *_ONE_REG kvm: Add definition of KVM_REG_MIPS KVM: add kvm_para_available to asm-generic/kvm_para.h
2013-06-11kvm/ppc/booke64: Fix AltiVec interrupt numbers and build breakageMihai Caraman
Interrupt numbers defined for Book3E follows IVORs definition. Align BOOKE_INTERRUPT_ALTIVEC_UNAVAIL and BOOKE_INTERRUPT_ALTIVEC_ASSIST to this rule which also fixes the build breakage. IVORs 32 and 33 are shared so reflect this in the interrupts naming. This fixes a build break for 64-bit booke KVM. Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-10powerpc/hw_breakpoints: Add DABRX cpu feature to fix 32-bit regressionMichael Neuling
When introducing support for DABRX in 4474ef0, we broke older 32-bit CPUs that don't have that register. Some CPUs have a DABR but not DABRX. Configuration are: - No 32bit CPUs have DABRX but some have DABR. - POWER4+ and below have the DABR but no DABRX. - 970 and POWER5 and above have DABR and DABRX. - POWER8 has DAWR, hence no DABRX. This introduces CPU_FTR_DABRX and sets it on appropriate CPUs. We use the top 64 bits for CPU FTR bits since only 64 bit CPUs have this. Processors that don't have the DABRX will still work as they will fall back to software filtering these breakpoints via perf_exclude_event(). Signed-off-by: Michael Neuling <mikey@neuling.org> Reported-by: "Gorelik, Jacob (335F)" <jacob.gorelik@jpl.nasa.gov> cc: stable@vger.kernel.org (v3.9 only) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-01powerpc/kvm/book3s: Add support for H_IPOLL and H_XIRR_X in XICS emulationPaul Mackerras
This adds the remaining two hypercalls defined by PAPR for manipulating the XICS interrupt controller, H_IPOLL and H_XIRR_X. H_IPOLL returns information about the priority and pending interrupts for a virtual cpu, without changing any state. H_XIRR_X is like H_XIRR in that it reads and acknowledges the highest-priority pending interrupt, but it also returns the timestamp (timebase register value) from when the interrupt was first received by the hypervisor. Currently we just return the current time, since we don't do any software queueing of virtual interrupts inside the XICS emulation code. These hcalls are not currently used by Linux guests, but may be in future. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-01powerpc/pseries: Kill all prefetch streams on context switchMichael Neuling
On context switch, we should have no prefetch streams leak from one userspace process to another. This frees up prefetch resources for the next process. Based on patch from Milton Miller. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-01powerpc/tm: Fix userspace stack corruption on signal delivery for active ↵Michael Neuling
transactions When in an active transaction that takes a signal, we need to be careful with the stack. It's possible that the stack has moved back up after the tbegin. The obvious case here is when the tbegin is called inside a function that returns before a tend. In this case, the stack is part of the checkpointed transactional memory state. If we write over this non transactionally or in suspend, we are in trouble because if we get a tm abort, the program counter and stack pointer will be back at the tbegin but our in memory stack won't be valid anymore. To avoid this, when taking a signal in an active transaction, we need to use the stack pointer from the checkpointed state, rather than the speculated state. This ensures that the signal context (written tm suspended) will be written below the stack required for the rollback. The transaction is aborted becuase of the treclaim, so any memory written between the tbegin and the signal will be rolled back anyway. For signals taken in non-TM or suspended mode, we use the normal/non-checkpointed stack pointer. Tested with 64 and 32 bit signals Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: <stable@vger.kernel.org> # v3.9 Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-01powerpc/tm: Move TM abort cause codes to uapiMichael Neuling
These cause codes are usable by userspace, so let's export to uapi. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: <stable@vger.kernel.org> # v3.9 Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-01powerpc/tm: Abort on emulation and alignment faultsMichael Neuling
If we are emulating an instruction inside an active user transaction that touches memory, the kernel can't emulate it as it operates in transactional suspend context. We need to abort these transactions and send them back to userspace for the hardware to rollback. We can service these if the user transaction is in suspend mode, since the kernel will operate in the same suspend context. This adds a check to all alignment faults and to specific instruction emulations (only string instructions for now). If the user process is in an active (non-suspended) transaction, we abort the transaction go back to userspace allowing the HW to roll back the transaction and tell the user of the failure. This also adds new tm abort cause codes to report the reason of the persistent error to the user. Crappy test case here http://neuling.org/devel/junkcode/aligntm.c Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: <stable@vger.kernel.org> # v3.9 Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-01powerpc/tm: Make room for hypervisor in abort cause codesMichael Neuling
PAPR carves out 0xff-0xe0 for hypervisor use of transactional memory software abort cause codes. Unfortunately we don't respect this currently. Below fixes this to move our cause codes to below this region. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: <stable@vger.kernel.org> # 3.9 only Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-24powerpc: Make radeon 32-bit MSI quirk work on powernvBenjamin Herrenschmidt
This moves the quirk itself to pci_64.c as to get built on all ppc64 platforms (the only ones with a pci_dn), factors the two implementations of get_pdn() into a single pci_get_dn() and use the quirk to do 32-bit MSIs on IODA based powernv platforms. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-24powerpc: Context switch more PMU related SPRsMichael Ellerman
In commit 9353374 "Context switch the new EBB SPRs" we added support for context switching some new EBB SPRs. However despite four of us signing off on that patch we missed some. To be fair these are not actually new SPRs, but they are now potentially user accessible so need to be context switched. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14powerpc: Use the new schedule_user API on userspace preemptionLi Zhong
This patch corresponds to [PATCH] x86: Use the new schedule_user API on userspace preemption commit 0430499ce9d78691f3985962021b16bf8f8a8048 Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14powerpc: Syscall hooks for context tracking subsystemLi Zhong
This is the syscall slow path hooks for context tracking subsystem, corresponding to [PATCH] x86: Syscall hooks for userspace RCU extended QS commit bf5a3c13b939813d28ce26c01425054c740d6731 TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is better for it to be in the same 16 bits with others in the group, so in the asm code, andi. with this group could work. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14powerpc/powernv: Detect OPAL v3 API versionBenjamin Herrenschmidt
Future firmwares will support that new version Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14powerpc: Bring all threads online prior to migration/hibernationRobert Jennings
This patch brings online all threads which are present but not online prior to migration/hibernation. After migration/hibernation those threads are taken back offline. During migration/hibernation all online CPUs must call H_JOIN, this is required by the hypervisor. Without this patch, threads that are offline (H_CEDE'd) will not be woken to make the H_JOIN call and the OS will be deadlocked (all threads either JOIN'd or CEDE'd). Cc: <stable@kernel.org> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14powerpc: Fix build errors STRICT_MM_TYPECHECKSAneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14powerpc/mm: Use the correct mask value when looking at pgtable addressAneesh Kumar K.V
Our pgtable are 2*sizeof(pte_t)*PTRS_PER_PTE which is PTE_FRAG_SIZE. Instead of depending on frag size, mask with PMD_MASKED_BITS. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-10powerpc: hard_irq_disable(): Call trace_hardirqs_off after disablingScott Wood
lockdep.c has this: /* * So we're supposed to get called after you mask local IRQs, * but for some reason the hardware doesn't quite think you did * a proper job. */ if (DEBUG_LOCKS_WARN_ON(!irqs_disabled())) return; Since irqs_disabled() is based on soft_enabled(), that (not just the hard EE bit) needs to be 0 before we call trace_hardirqs_off. Signed-off-by: Scott Wood <scottwood@freescale.com>
2013-05-10powerpc/powernv: Improve kexec reliabilityBenjamin Herrenschmidt
We add a machine_shutdown hook that frees the OPAL interrupts (so they get masked at the source and don't fire while kexec'ing) and which triggers an IODA reset on all the PCIe host bridges which will have the effect of blocking all DMAs and subsequent PCIs interrupts. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-08powerpc: Add an in memory udbg consoleAlistair Popple
This patch adds a new udbg early debug console which utilises statically defined input and output buffers stored within the kernel BSS. It is primarily designed to assist with bring up of new hardware which may not have a working console but which has a method of reading/writing kernel memory. This version incorporates comments made by Ben H (thanks!). Changes from v1: - Add memory barriers. - Ensure updating of read/write positions is atomic. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-07powerpc: Make hard_irq_disable() do the right thing vs. irq tracingBenjamin Herrenschmidt
If hard_irq_disable() is called while interrupts are already soft-disabled (which is the most common case) all is already well. However you can (and in some cases want) to call it while everything is enabled (to make sure you don't get a lazy even, for example before entry into KVM guests) and in this case we need to inform the irq tracer that the irqs are going off. We have to change the inline into a macro to avoid an include circular dependency hell hole. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-06powerpc/pci: Support per-aperture memory offsetBenjamin Herrenschmidt
The PCI core supports an offset per aperture nowadays but our arch code still has a single offset per host bridge representing the difference betwen CPU memory addresses and PCI MMIO addresses. This is a problem as new machines and hypervisor versions are coming out where the 64-bit windows will have a different offset (basically mapped 1:1) from the 32-bit windows. This fixes it by using separate offsets. In the long run, we probably want to get rid of that intermediary struct pci_controller and have those directly stored into the pci_host_bridge as they are parsed but this will be a more invasive change. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-06powerpc/cputable: Reserve bits in HWCAP2 for new featuresNishanth Aravamudan
Also, make HTM's presence dependent on the .config option. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-06powerpc/pseries: Perform proper max_bus_speed detectionKleber Sacilotto de Souza
On pseries machines the detection for max_bus_speed should be done through an OpenFirmware property. This patch adds a function to perform this detection and a hook to perform dynamic adding of the function only for pseries. This is done by overwriting the weak pcibios_root_bridge_prepare function which is called by pci_create_root_bus(). From: Lucas Kannebley Tavares <lucaskt@linux.vnet.ibm.com> Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-06powerpc/pseries: Force 32 bit MSIs for devices that require itBrian King
The following patch implements a new PAPR change which allows the OS to force the use of 32 bit MSIs, regardless of what the PCI capabilities indicate. This is required for some devices that advertise support for 64 bit MSIs but don't actually support them. Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-06powerpc: Emulate non privileged DSCR read and writeAnton Blanchard
POWER8 allows read and write of the DSCR in userspace. We added kernel emulation so applications could always use the instructions regardless of the CPU type. Unfortunately there are two SPRs for the DSCR and we only added emulation for the privileged one. Add code to match the non privileged one. A simple test was created to verify the fix: http://ozlabs.org/~anton/junkcode/user_dscr_test.c Without the patch we get a SIGILL and it passes with the patch. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: <stable@kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-05Merge tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Gleb Natapov: "Highlights of the updates are: general: - new emulated device API - legacy device assignment is now optional - irqfd interface is more generic and can be shared between arches x86: - VMCS shadow support and other nested VMX improvements - APIC virtualization and Posted Interrupt hardware support - Optimize mmio spte zapping ppc: - BookE: in-kernel MPIC emulation with irqfd support - Book3S: in-kernel XICS emulation (incomplete) - Book3S: HV: migration fixes - BookE: more debug support preparation - BookE: e6500 support ARM: - reworking of Hyp idmaps s390: - ioeventfd for virtio-ccw And many other bug fixes, cleanups and improvements" * tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits) kvm: Add compat_ioctl for device control API KVM: x86: Account for failing enable_irq_window for NMI window request KVM: PPC: Book3S: Add API for in-kernel XICS emulation kvm/ppc/mpic: fix missing unlock in set_base_addr() kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write kvm/ppc/mpic: remove users kvm/ppc/mpic: fix mmio region lists when multiple guests used kvm/ppc/mpic: remove default routes from documentation kvm: KVM_CAP_IOMMU only available with device assignment ARM: KVM: iterate over all CPUs for CPU compatibility check KVM: ARM: Fix spelling in error message ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally KVM: ARM: Fix API documentation for ONE_REG encoding ARM: KVM: promote vfp_host pointer to generic host cpu context ARM: KVM: add architecture specific hook for capabilities ARM: KVM: perform HYP initilization for hotplugged CPUs ARM: KVM: switch to a dual-step HYP init code ARM: KVM: rework HYP page table freeing ARM: KVM: enforce maximum size for identity mapped code ARM: KVM: move to a KVM provided HYP idmap ...
2013-05-02Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc update from Benjamin Herrenschmidt: "The main highlights this time around are: - A pile of addition POWER8 bits and nits, such as updated performance counter support (Michael Ellerman), new branch history buffer support (Anshuman Khandual), base support for the new PCI host bridge when not using the hypervisor (Gavin Shan) and other random related bits and fixes from various contributors. - Some rework of our page table format by Aneesh Kumar which fixes a thing or two and paves the way for THP support. THP itself will not make it this time around however. - More Freescale updates, including Altivec support on the new e6500 cores, new PCI controller support, and a pile of new boards support and updates. - The usual batch of trivial cleanups & fixes" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits) powerpc: Fix build error for book3e powerpc: Context switch the new EBB SPRs powerpc: Turn on the EBB H/FSCR bits powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S powerpc: Setup BHRB instructions facility in HFSCR for POWER8 powerpc: Fix interrupt range check on debug exception powerpc: Update tlbie/tlbiel as per ISA doc powerpc: Print page size info during boot powerpc: print both base and actual page size on hash failure powerpc: Fix hpte_decode to use the correct decoding for page sizes powerpc: Decode the pte-lp-encoding bits correctly. powerpc: Use encode avpn where we need only avpn values powerpc: Reduce PTE table memory wastage powerpc: Move the pte free routines from common header powerpc: Reduce the PTE_INDEX_SIZE powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format powerpc: New hugepage directory format powerpc: Don't truncate pgd_index wrongly powerpc: Don't hard code the size of pte page powerpc: Save DAR and DSISR in pt_regs on MCE ...
2013-05-02KVM: PPC: Book3S: Add API for in-kernel XICS emulationPaul Mackerras
This adds the API for userspace to instantiate an XICS device in a VM and connect VCPUs to it. The API consists of a new device type for the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl, which is used to assert and deassert interrupt inputs of the XICS. The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES. Each attribute within this group corresponds to the state of one interrupt source. The attribute number is the same as the interrupt source number. This does not support irq routing or irqfd yet. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexander Graf <agraf@suse.de>
2013-05-02powerpc: Fix build error for book3eAneesh Kumar K.V
We moved the definition of shift_to_mmu_psize and mmu_psize_to_shift out of hugetlbpage.c in patch "powerpc: New hugepage directory format". These functions are not related to hugetlbpage and we want to use them outside hugetlbpage.c We missed a definition for book3e when we moved these functions. Add similar functions to mmu-book3e.h Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-02powerpc: Context switch the new EBB SPRsMichael Ellerman
This context switches the new Event Based Branching (EBB) SPRs. The three new SPRs are: - Event Based Branch Handler Register (EBBHR) - Event Based Branch Return Register (EBBRR) - Branch Event Status and Control Register (BESCR) Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Matt Evans <matt@ozlabs.org> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-02powerpc: Turn on the EBB H/FSCR bitsMichael Neuling
This turns Event Based Branching (EBB) on in the Hypervisor Facility Status and Control Register (HFSCR) and Facility Status and Control Register (FSCR). Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-02powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207SMichael Ellerman
We are getting low on cpu feature bits. So rather than add a separate bit for every new Power8 feature, add a bit for arch 2.07 server catagory and use that instead. Hijack the value we had for BCTAR, but swap the value with CFAR so that all the ARCH defines are together. Note we don't touch CPU_FTR_TM, because it is conditionally enabled if the kernel is built with TM support. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-02powerpc: Setup BHRB instructions facility in HFSCR for POWER8Anshuman Khandual
Make BHRB instructions available in problem and privileged states. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: "Highlights (1721 non-merge commits, this has to be a record of some sort): 1) Add 'random' mode to team driver, from Jiri Pirko and Eric Dumazet. 2) Make it so that any driver that supports configuration of multiple MAC addresses can provide the forwarding database add and del calls by providing a default implementation and hooking that up if the driver doesn't have an explicit set of handlers. From Vlad Yasevich. 3) Support GSO segmentation over tunnels and other encapsulating devices such as VXLAN, from Pravin B Shelar. 4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton. 5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita Dukkipati. 6) In the PHY layer, allow supporting wake-on-lan in situations where the PHY registers have to be written for it to be configured. Use it to support wake-on-lan in mv643xx_eth. From Michael Stapelberg. 7) Significantly improve firewire IPV6 support, from YOSHIFUJI Hideaki. 8) Allow multiple packets to be sent in a single transmission using network coding in batman-adv, from Martin Hundebøll. 9) Add support for T5 cxgb4 chips, from Santosh Rastapur. 10) Generalize the VXLAN forwarding tables so that there is more flexibility in configurating various aspects of the endpoints. From David Stevens. 11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver, from Dmitry Kravkov. 12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo Neira Ayuso. 13) Start adding networking selftests. 14) In situations of overload on the same AF_PACKET fanout socket, or per-cpu packet receive queue, minimize drop by distributing the load to other cpus/fanouts. From Willem de Bruijn and Eric Dumazet. 15) Add support for new payload offset BPF instruction, from Daniel Borkmann. 16) Convert several drivers over to mdoule_platform_driver(), from Sachin Kamat. 17) Provide a minimal BPF JIT image disassembler userspace tool, from Daniel Borkmann. 18) Rewrite F-RTO implementation in TCP to match the final specification of it in RFC4138 and RFC5682. From Yuchung Cheng. 19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear you like netlink, so I implemented netlink dumping of netlink sockets.") From Andrey Vagin. 20) Remove ugly passing of rtnetlink attributes into rtnl_doit functions, from Thomas Graf. 21) Allow userspace to be able to see if a configuration change occurs in the middle of an address or device list dump, from Nicolas Dichtel. 22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes Frederic Sowa. 23) Increase accuracy of packet length used by packet scheduler, from Jason Wang. 24) Beginning set of changes to make ipv4/ipv6 fragment handling more scalable and less susceptible to overload and locking contention, from Jesper Dangaard Brouer. 25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*() instead. From Hong Zhiguo. 26) Optimize route usage in IPVS by avoiding reference counting where possible, from Julian Anastasov. 27) Convert IPVS schedulers to RCU, also from Julian Anastasov. 28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger Eitzenberger. 29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG, nfnetlink_log, and nfnetlink_queue. From Gao feng. 30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa. 31) Support several new r8169 chips, from Hayes Wang. 32) Support tokenized interface identifiers in ipv6, from Daniel Borkmann. 33) Use usbnet_link_change() helper in USB net driver, from Ming Lei. 34) Add 802.1ad vlan offload support, from Patrick McHardy. 35) Support mmap() based netlink communication, also from Patrick McHardy. 36) Support HW timestamping in mlx4 driver, from Amir Vadai. 37) Rationalize AF_PACKET packet timestamping when transmitting, from Willem de Bruijn and Daniel Borkmann. 38) Bring parity to what's provided by /proc/net/packet socket dumping and the info provided by netlink socket dumping of AF_PACKET sockets. From Nicolas Dichtel. 39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin Poirier" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) filter: fix va_list build error af_unix: fix a fatal race with bit fields bnx2x: Prevent memory leak when cnic is absent bnx2x: correct reading of speed capabilities net: sctp: attribute printl with __printf for gcc fmt checks netlink: kconfig: move mmap i/o into netlink kconfig netpoll: convert mutex into a semaphore netlink: Fix skb ref counting. net_sched: act_ipt forward compat with xtables mlx4_en: fix a build error on 32bit arches Revert "bnx2x: allow nvram test to run when device is down" bridge: avoid OOPS if root port not found drivers: net: cpsw: fix kernel warn on cpsw irq enable sh_eth: use random MAC address if no valid one supplied 3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA) tg3: fix to append hardware time stamping flags unix/stream: fix peeking with an offset larger than data in queue unix/dgram: fix peeking with an offset larger than data in queue unix/dgram: peek beyond 0-sized skbs openvswitch: Remove unneeded ovs_netdev_get_ifindex() ...
2013-05-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull compat cleanup from Al Viro: "Mostly about syscall wrappers this time; there will be another pile with patches in the same general area from various people, but I'd rather push those after both that and vfs.git pile are in." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: syscalls.h: slightly reduce the jungles of macros get rid of union semop in sys_semctl(2) arguments make do_mremap() static sparc: no need to sign-extend in sync_file_range() wrapper ppc compat wrappers for add_key(2) and request_key(2) are pointless x86: trim sys_ia32.h x86: sys32_kill and sys32_mprotect are pointless get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC merge compat sys_ipc instances consolidate compat lookup_dcookie() convert vmsplice to COMPAT_SYSCALL_DEFINE switch getrusage() to COMPAT_SYSCALL_DEFINE switch epoll_pwait to COMPAT_SYSCALL_DEFINE convert sendfile{,64} to COMPAT_SYSCALL_DEFINE switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect make HAVE_SYSCALL_WRAPPERS unconditional consolidate cond_syscall and SYSCALL_ALIAS declarations teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long get rid of duplicate logics in __SC_....[1-6] definitions