linux-linaro-stable-pfalcon-test.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2016-01-22	Merge branch 'linux-linaro-lsk-v4.1' into linux-linaro-lsk-v4.1-androidlinux-linaro-lsk-v4.1-android	Alex Shi

2016-01-22	Merge remote-tracking branch 'v4.1/topic/PSCI' into linux-linaro-lsk-v4.1linux-linaro-lsk-v4.1	Alex Shi

2016-01-20	drivers: psci: make PSCI 1.0 functions initialization version dependentv4.1/topic/PSCI	Lorenzo Pieralisi
	The PSCI specifications [1] and the SMC calling convention mandate that unimplemented functions ids must return NOT_SUPPORTED (0xffffffff) if a function id is called but it is not implemented. Consequently, PSCI 1.0 function ids that require the 1.0 PSCI_FEATURES call to be initialized: CPU_SUSPEND (psci_init_cpu_suspend()) SYSTEM_SUSPEND (psci_init_system_suspend()) call the PSCI_FEATURES function id independently of the detected PSCI firmware version, since, if the PSCI_FEATURES function id is not implemented, it must return NOT_SUPPORTED according to the PSCI specifications, causing the initialization functions to fail as expected. Some existing PSCI implementations (ie Qemu PSCI emulation), do not comply with the SMC calling convention and fail if function ids that are not implemented are called from the OS, causing boot failures. To solve this issue, this patch adds code that checks the PSCI firmware version before calling PSCI 1.0 initialization functions so that the OS makes sure that it is calling 1.0 functions only if the firmware version detected is 1.0 or greater, therefore avoiding PSCI calls that are bound to fail and might cause system boot failures owing to non-compliant PSCI firmware implementations. [1] http://infocenter.arm.com/help/topic/com.arm.doc.den0022c/DEN0022C_Power_State_Coordination_Interface.pdf Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Mark Rutland <mark.rutland@arm.com> Tested-by: Kevin Hilman <khilman@kernel.org> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Olof Johansson <olof@lixom.net> (cherry picked from commit 79b04beb1e0ac7754e667f0aa47b57a197dc343a) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	ARM: Remove __ref on hotplug cpu die path	Stephen Boyd
	Now that __cpuinit has been removed, the __ref markings on these functions are useless. Remove them. This also reduces the size of the multi_v7_defconfig image: $ size before after text data bss dec hex filename 12683578 1470996 348904 14503478 dd4e36 before 12683274 1470996 348904 14503174 dd4d06 after presumably because now we don't have to jump to code in the .ref.text section and/or the noinline marking is removed. Cc: Shiraz Hashim <shiraz.linux.kernel@gmail.com> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <linux-omap@vger.kernel.org> Cc: <linux-arm-msm@vger.kernel.org> Cc: <spear-devel@list.st.com> Cc: <linux-tegra@vger.kernel.org> Acked-by: Tony Lindgren <tony@atomide.com> Acked-by: Barry Song <baohua@kernel.org> Acked-by: Andy Gross <agross@codeaurora.org> Acked-by: Viresh Kumar <vireshk@kernel.org> Acked-by: Thierry Reding <thierry.reding@gmail.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Olof Johansson <olof@lixom.net> (cherry picked from commit b96fc2f3c145815359ac1f9f12cc5c852b9ba3f5) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	drivers: firmware: psci: add system suspend support	Sudeep Holla
	PSCI v1.0 introduces a new API called PSCI_SYSTEM_SUSPEND. This API provides the mechanism by which the calling OS can request entry into the deepest possible system sleep state. It meets all the necessary preconditions for entering suspend to RAM state in Linux. This patch adds support for PSCI_SYSTEM_SUSPEND in psci firmware and registers a psci system suspend operation to implement the suspend-to-RAM(s2r) in a generic way on all the platforms implementing PSCI. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> (cherry picked from commit faf7ec4a92c0231d1079177095077c162eb9b466) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	drivers: firmware: psci: define more generic PSCI_FN_NATIVE macro	Sudeep Holla
	This patch replaces the definition and usage of PSCI_0_2_FN_NATIVE with the new and more generic macro PSCI_FN_NATIVE that can be used with any version. This will be useful for the new features introduced in PSCIv1.0 and for any future revisions. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> (cherry picked from commit 029180b1c99046831c33ed43fdbdb620506cb15b) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	drivers: firmware: psci: add PSCI v1.0 DT bindings	Lorenzo Pieralisi
	PSCI 1.0 is designed to be fully compliant to the PSCI 0.2 specification, with minor differences that are described in the PSCI specification. In particular, PSCI v1.0 augments the specification with a new power_state format (extended stateid - probeable through the PSCI_FEATURES call), changes some function return codes and functions usage requirements wrt PSCI 0.2. These changes mean that 1.0 vs 0.2 compliancy should be enforced through a DT compatible string that allows firmware to specify 1.0 only compliancy so that older kernels are prevented from using PSCI 1.0 FW implementations in a non-compatible way (eg by calling a 1.0 FW implementation and expecting 0.2 behaviour). This patch adds PSCI 1.0 DT bindings and related compatible string. Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Jisheng Zhang <jszhang@marvell.com> Cc: Mark Rutland <mark.rutland@arm.com> (cherry picked from commit 0fc197c7cb3b1139fccb3b92e8db19a93f81f6fb) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	drivers: firmware: psci: add extended stateid power_state support	Lorenzo Pieralisi
	PSCI v1.0 augmented the power_state parameter format specification (extended stateid) and introduced a way to probe it through the PSCI_FEATURES interface. This patch implements code that detects the power_state format at run-time through the PSCI_FEATURES interface, so that the power_state argument can be properly detected and validated in the kernel according to the information provided through firmware. Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Tested-by: Jisheng Zhang <jszhang@marvell.com> Cc: Mark Rutland <mark.rutland@arm.com> (cherry picked from commit a5c00bb28da0bb34f901d090839fc448246aa996) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	drivers: firmware: psci: add PSCI_FEATURES call	Lorenzo Pieralisi
	PSCI v1.0 introduces a PSCI_FEATURES call that allows to probe for features related to a specific function identifier. This patch adds PSCI_FEATURES support to the PSCI firmware layer. Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Tested-by: Jisheng Zhang <jszhang@marvell.com> Cc: Mark Rutland <mark.rutland@arm.com> (cherry picked from commit 5f004e0c9fb152a080b47d06dc48bdd29765a734) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	drivers: firmware: psci: move power_state handling to generic code	Lorenzo Pieralisi
	Functions implemented on arm64 to check if a power_state parameter is valid and if the power_state implies context loss are not arm64 specific and should be moved to generic code so that they can be reused on arm systems too. This patch moves the functions handling the power_state parameter to generic PSCI firmware layer code. Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Jisheng Zhang <jszhang@marvell.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> (cherry picked from commit 068654c200cc32966ce7906ca0bd096b9b97e988) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	drivers: firmware: psci: add INVALID_ADDRESS return value	Lorenzo Pieralisi
	PSCI 1.0 introduces the INVALID_ADDRESS return value for functions that take an address as input parameter (eg CPU_SUSPEND). This patch adds INVALID_ADDRESS return value to kernel code and updates the PSCI to linux error conversion to take it into account. The kernel error value associated to INVALID_ADDRESS is set to the error returned when the PSCI error code is INVALID_PARAMETERS to comply with current call sites expected return value, given that the kernel at present has no use for the additional error information reported. Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Jisheng Zhang <jszhang@marvell.com> (cherry picked from commit 2217d7c68e5caf50ec86b8c75c76bf06eb4b2c45) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm/arm64: KVM: Fix PSCI affinity info return value for non valid cores	Alexander Spyridakis
	If a guest requests the affinity info for a non-existing vCPU we need to properly return an error, instead of erroneously reporting an off state. Signed-off-by: Alexander Spyridakis <a.spyridakis@virtualopensystems.com> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> (cherry picked from commit 0c0672922dcc70ffba11d96385e98e42fb3ae08d) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	ARM: migrate to common PSCI client code	Mark Rutland
	Now that the common PSCI client code has been factored out to drivers/firmware, and made safe for 32-bit use, move the 32-bit ARM code over to it. This results in a moderate reduction of duplicated lines, and will prevent further duplication as the PSCI client code is updated for PSCI 1.0 and beyond. The two legacy platform users of the PSCI invocation code are updated to account for interface changes. In both cases the power state parameter (which is constant) is now generated using macros, so that the pack/unpack logic can be killed in preparation for PSCI 1.0 power state changes. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Rob Herring <robh@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ashwin Chaugule <ashwin.chaugule@linaro.org> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> (cherry picked from commit be120397e7709d9d5ed88317a385ce864a2603bc) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	ARM: psci: boot_secondary: replace __pa with virt_to_idmap	Grygorii Strashko
	On some PAE systems (e.g. TI Keystone), memory is above the 32-bit addressable limit, and the interconnect provides an aliased view of parts of physical memory in the 32-bit addressable space. This alias is strictly for boot time usage, and is not otherwise usable because of coherency limitations. In this case, virt_to_phys(secondary_startup) would return the physical address of the secondary CPU boot entry point, but on such systems, this would be above the 4GB limit. A separate function, virt_to_idmap(), has been provided to return a usable physical address for functions in the identity mapping, and this must be used in preference to virt_to_phys() or __pa() to find the physical entry point for functions in the identity mapping range. For other systems, virt_to_idmap() and virt_to_phys() return identical physical addresses. Acked-by: Santosh Shilimkar <ssantosh@kernel.org> Acked-by: Nicolas Pitre <nico@linaro.org> Tested-by Vitaly Andrianov <vitalya@ti.com> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> [Mark: apply rmk's suggested rewording] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Will Deacon <will.deacon@arm.com> (cherry picked from commit 37cf524f9360f9165d67459b7bf795c01824df98) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	drivers: psci: support native SMC{32,64} calls	Mark Rutland
	A 32-bit OS cannot make calls with SMC64 IDs, while a 64-bit OS must invoke some PSCI functions with SMC64 IDs. This patch introduces and makes use of a new macro to choose the appropriate IDs based on the register width of the OS, which will allow 32-bit callers to use the PSCI client code. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> (cherry picked from commit 5211df00a4b595b96a7721a1253074b327945d33) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: psci: factor invocation code to drivers	Mark Rutland
	To enable sharing with arm, move the core PSCI framework code to drivers/firmware. This results in a minor gain in lines of code, but this will quickly be amortised by the removal of code currently duplicated in arch/arm. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> (cherry picked from commit bff60792f994a87324ab57e89e945b4572b1ef77) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: force CONFIG_SMP=y and remove redundant #ifdefs	Will Deacon
	Nobody seems to be producing !SMP systems anymore, so this is just becoming a source of kernel bugs, particularly if people want to use coherent DMA with non-shared pages. This patch forces CONFIG_SMP=y for arm64, removing a modest amount of code in the process. Signed-off-by: Will Deacon <will.deacon@arm.com> (cherry picked from commit 4b3dc9679cf779339d9049800803dfc3c83433d1) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: drop sleep_idmap_phys and clean up cpu_resume()	Ard Biesheuvel
	Two cleanups of the asm function cpu_resume(): - The global variable sleep_idmap_phys always points to idmap_pg_dir, so we can just use that value directly in the CPU resume path. - Unclutter the load of sleep_save_sp::save_ptr_stash_phys. Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Tested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (cherry picked from commit 9acdc2af0c0b836183b7f31f630bbed341a7cf4d) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: reduce ID map to a single page	Ard Biesheuvel
	Commit ea8c2e112445 ("arm64: Extend the idmap to the whole kernel image") changed the early page table code so that the entire kernel Image is covered by the identity map. This allows functions that need to enable or disable the MMU to reside anywhere in the kernel Image. However, this change has the unfortunate side effect that the Image cannot cross a physical 512 MB alignment boundary anymore, since the early page table code cannot deal with the Image crossing a /virtual/ 512 MB alignment boundary. So instead, reduce the ID map to a single page, that is populated by the contents of the .idmap.text section. Only three functions reside there at the moment: __enable_mmu(), cpu_resume_mmu() and cpu_reset(). If new code is introduced that needs to manipulate the MMU state, it should be added to this section as well. Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (cherry picked from commit 5dfe9d7d23c26d029415379630523f141a748c5b) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: psci: fix !CONFIG_HOTPLUG_CPU build warning	Will Deacon
	When building without CONFIG_HOTPLUG_CPU, GCC complains (rightly) that psci_tos_resident_on is unused: arch/arm64/kernel/psci.c:61:13: warning: ‘psci_tos_resident_on’ defined but not used [-Wunused-function] static bool psci_tos_resident_on(int cpu) As it's only ever used when CONFIG_HOTPLUG_CPU is selected, let's move it into the existing ifdef. Signed-off-by: Will Deacon <will.deacon@arm.com> [Mark: write commit message] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (cherry picked from commit 73bf8412e4f24b114c853012663fff4d3cde06a2) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: psci: remove ACPI coupling	Mark Rutland
	The 32-bit ARM port doesn't have ACPI headers, and conditionally including them is going to look horrendous. In preparation for sharing the PSCI invocation code with 32-bit, move the acpi_psci_* function declarations and definitions such that the PSCI client code need not include ACPI headers. While it would seem like we could simply hide the ACPI includes in psci.h, the ACPI headers have hilarious circular dependencies which make this infeasible without reorganising most of ACPICA. So rather than doing that, move the acpi_psci_* prototypes into psci.h. The psci_acpi_init function is made dependent on CONFIG_ACPI (with a stub implementation in asm/psci.h) such that it need not be built for 32-bit ARM or kernels without ACPI support. The currently missing __init annotations are added to the prototypes in the header. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Hanjun Guo <hanjun.guo@linaro.org> Reviewed-by: Al Stone <al.stone@linaro.org> Reviewed-by: Ashwin Chaugule <ashwin.chaugule@linaro.org> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Will Deacon <will.deacon@arm.com> (cherry picked from commit c5a1330573c1748179898f4799f130e416ce4738) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: psci: kill psci_power_state	Mark Rutland
	A PSCI 1.0 implementation may choose to use the new extended StateID format, the presence of which may be queried via the PSCI_FEATURES call. The layout of this new StateID format is incompatible with the existing format, and so to handle both we must abstract attempts to parse the fields. In preparation for PSCI 1.0 support, this patch introduces psci_power_state_loses_context and psci_power_state_is_valid functions to query information from a PSCI power state, which is no longer decomposed (and hence the pack/unpack functions are removed). As it is no longer decomposed, it is now passed round as an opaque u32 token. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Will Deacon <will.deacon@arm.com> (cherry picked from commit c8cc42737788537ebef810ee22400f757e1819ca) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: psci: account for Trusted OS instances	Mark Rutland
	Software resident in the secure world (a "Trusted OS") may cause CPU_OFF calls for the CPU it is resident on to be denied. Such a denial would be fatal for the kernel, and so we must detect when this can happen before the point of no return. This patch implements Trusted OS detection for PSCI 0.2+ systems, using MIGRATE_INFO_TYPE and MIGRATE_INFO_UP_CPU. When a trusted OS is detected as resident on a particular CPU, attempts to hot unplug that CPU will be denied early, before they can prove fatal. Trusted OS migration is not implemented by this patch. Implementation of migratable UP trusted OSs seems unlikely, and the right policy for migration is unclear (and will likely differ across implementations). As such, it is likely that migration will require cooperation with Trusted OS drivers. PSCI implementations prior to 0.1 do not provide the facility to detect the presence of a Trusted OS, nor the CPU any such OS is resident on, so without additional information it is not possible to handle Trusted OSs with PSCI 0.1. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Will Deacon <will.deacon@arm.com> (cherry picked from commit ff3010e6fcdb5f7e6999c6026ab7fcf835d54c5a) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: psci: support unsigned return values	Mark Rutland
	PSCI_VERSION and MIGRATE_INFO_TYPE_UP_CPU return unsigned values, with the latter returning a 64-bit value. However, the PSCI invocation functions have prototypes returning int. This patch upgrades the invocation functions to return unsigned long, with a new typedef to keep things legible. As PSCI_VERSION cannot return a negative value, the erroneous check against PSCI_RET_NOT_SUPPORTED is also removed. The unrelated psci_initcall_t typedef is moved closer to its first user, to avoid confusion with the invocation functions. In preparation for sharing the code with ARM, unsigned long is used in preference of u64. In the SMC32 calling convention, the relevant fields will be 32 bits wide. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Will Deacon <will.deacon@arm.com> (cherry picked from commit a06eed3e90c272675f2ef50f5bc5b3ec91652d77) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: psci: remove unnecessary id indirection	Mark Rutland
	PSCI 0.1 did not define canonical IDs for CPU_ON, CPU_OFF, CPU_SUSPEND, or MIGRATE, and so these need to be provided when using firmware compliant to PSCI 0.1. However, functions introduced in 0.2 or later have canonical IDs, and these cannot be provided via DT. There's no need to indirect the IDs via a table; they can be used directly at callsites (and already are for SYSTEM_OFF and SYSTEM_RESET). This patch removes the unnecessary function ID indirection for AFFINITY_INFO and MIGRATE_INFO_TYPE. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Will Deacon <will.deacon@arm.com> (cherry picked from commit 2a7cd0ebfc0a5ac2e692e63871e0ff6a50d5de46) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: smp: consistently use error codes	Mark Rutland
	cpu_kill currently returns one for success and zero for failure, which is unlike all the other cpu_operations, which return zero for success and an error code upon failure. This difference is unnecessarily confusing. Make cpu_kill consistent with the other cpu_operations. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Will Deacon <will.deacon@arm.com> (cherry picked from commit 6b99c68cb5dd274d79451e5135f9450f7c01ca52) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm64: smp_plat: add get_logical_index	Mark Rutland
	The PSCI MIGRATE_INFO_UP_CPU call returns a physical ID, which we will need to map back to a Linux logical ID. Implement a reusable get_logical_index to map from a physical ID to a logical ID. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Will Deacon <will.deacon@arm.com> (cherry picked from commit 6ee3c78cecc795e87de9552baca76ea88292556d) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	arm/arm64: kvm: add missing PSCI include	Mark Rutland
	We make use of the PSCI function IDs, but don't explicitly include the header which defines them. Relying on transitive header includes is fragile and will be broken as headers are refactored. This patch includes the relevant header file directly so as to avoid future breakage. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> (cherry picked from commit 538b9b25fa5dd372eeb3eabc85cb24d5df9e317b) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	ARM64: kernel: unify ACPI and DT cpus initialization	Lorenzo Pieralisi
	The code that initializes cpus on arm64 is currently split in two different code paths that carry out DT and ACPI cpus initialization. Most of the code executing SMP initialization is common and should be merged to reduce discrepancies between ACPI and DT initialization and to have code initializing cpus in a single common place in the kernel. This patch refactors arm64 SMP cpus initialization code to merge ACPI and DT boot paths in a common file and to create sanity checks that can be reused by both boot methods. Current code assumes PSCI is the only available boot method when arm64 boots with ACPI; this can be easily extended if/when the ACPI parking protocol is merged into the kernel. Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Hanjun Guo <hanjun.guo@linaro.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Mark Rutland <mark.rutland@arm.com> [DT] Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (cherry picked from commit 0f0783365cbb7ec13a8f02198f6e1a146d94a5a9) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-20	ARM64: kernel: make cpu_ops hooks DT agnostic	Lorenzo Pieralisi
	ARM64 CPU operations such as cpu_init and cpu_init_idle take a struct device_node pointer as a parameter, which corresponds to the device tree node of the logical cpu on which the operation has to be applied. With the advent of ACPI on arm64, where MADT static table entries are used to initialize cpus, the device tree node parameter in cpu_ops hooks become useless when booting with ACPI, since in that case cpu device tree nodes are not present and can not be used for cpu initialization. The current cpu_init hook requires a struct device_node pointer parameter because it is called while parsing the device tree to initialize CPUs, when the cpu_logical_map (that is used to match a cpu node reg property to a device tree node) for a given logical cpu id is not set up yet. This means that the cpu_init hook cannot rely on the of_get_cpu_node function to retrieve the device tree node corresponding to the logical cpu id passed in as parameter, so the cpu device tree node must be passed in as a parameter to fix this catch-22 dependency cycle. This patch reshuffles the cpu_logical_map initialization code so that the cpu_init cpu_ops hook can safely use the of_get_cpu_node function to retrieve the cpu device tree node, removing the need for the device tree node pointer parameter. In the process, the patch removes device tree node parameters from all cpu_ops hooks, in preparation for SMP DT/ACPI cpus initialization consolidation. Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Hanjun Guo <hanjun.guo@linaro.org> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Mark Rutland <mark.rutland@arm.com> [DT] Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (cherry picked from commit 819a88263d5dbe398edd59cc1cf725ed1fdcfd79) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-01-14	Merge branch 'linux-linaro-lsk-v4.1' into linux-linaro-lsk-v4.1-android	Alex Shi

2016-01-14	Merge remote-tracking branch 'lts/linux-4.1.y' into linux-linaro-lsk-v4.1	Alex Shi
	Merge tag 'v4.1.15' into of git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable into linux-linaro-lsk-v4.1
2015-12-14	Linux 4.1.15v4.1.15	Greg Kroah-Hartman

2015-12-14	ALSA: hda/hdmi - apply Skylake fix-ups to Broxton display codec	Lu, Han
	commit e2656412f2a7343ecfd13eb74bac0a6e6e9c5aad upstream. Broxton and Skylake have the same behavior on display audio. So this patch applys Skylake fix-ups to Broxton. Signed-off-by: Lu, Han <han.lu@intel.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	ceph: fix message length computation	Arnd Bergmann
	commit 777d738a5e58ba3b6f3932ab1543ce93703f4873 upstream. create_request_message() computes the maximum length of a message, but uses the wrong type for the time stamp: sizeof(struct timespec) may be 8 or 16 depending on the architecture, while sizeof(struct ceph_timespec) is always 8, and that is what gets put into the message. Found while auditing the uses of timespec for y2038 problems. Fixes: b8e69066d8af ("ceph: include time stamp in every MDS request") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	ocfs2: fix umask ignored issue	Junxiao Bi
	commit 8f1eb48758aacf6c1ffce18179295adbf3bd7640 upstream. New created file's mode is not masked with umask, and this makes umask not work for ocfs2 volume. Fixes: 702e5bc ("ocfs2: use generic posix ACL infrastructure") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	nfs: if we have no valid attrs, then don't declare the attribute cache valid	Jeff Layton
	commit c812012f9ca7cf89c9e1a1cd512e6c3b5be04b85 upstream. If we pass in an empty nfs_fattr struct to nfs_update_inode, it will (correctly) not update any of the attributes, but it then clears the NFS_INO_INVALID_ATTR flag, which indicates that the attributes are up to date. Don't clear the flag if the fattr struct has no valid attrs to apply. Reviewed-by: Steve French <steve.french@primarydata.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	nfs4: start callback_ident at idr 1	Benjamin Coddington
	commit c68a027c05709330fe5b2f50c50d5fa02124b5d8 upstream. If clp->cl_cb_ident is zero, then nfs_cb_idr_remove_locked() skips removing it when the nfs_client is freed. A decoding or server bug can then find and try to put that first nfs_client which would lead to a crash. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: d6870312659d ("nfs4client: convert to idr_alloc()") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	debugfs: fix refcount imbalance in start_creating	Daniel Borkmann
	commit 0ee9608c89e81a1ccee52ecb58a7ff040e2522d9 upstream. In debugfs' start_creating(), we pin the file system to safely access its root. When we failed to create a file, we unpin the file system via failed_creating() to release the mount count and eventually the reference of the vfsmount. However, when we run into an error during lookup_one_len() when still in start_creating(), we only release the parent's mutex but not so the reference on the mount. Looks like it was done in the past, but after splitting portions of __create_file() into start_creating() and end_creating() via 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() off"), this seemed missed. Noticed during code review. Fixes: 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() off") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	nfsd: eliminate sending duplicate and repeated delegations	Andrew Elble
	commit 34ed9872e745fa56f10e9bef2cf3d2336c6c8816 upstream. We've observed the nfsd server in a state where there are multiple delegations on the same nfs4_file for the same client. The nfs client does attempt to DELEGRETURN these when they are presented to it - but apparently under some (unknown) circumstances the client does not manage to return all of them. This leads to the eventual attempt to CB_RECALL more than one delegation with the same nfs filehandle to the same client. The first recall will succeed, but the next recall will fail with NFS4ERR_BADHANDLE. This leads to the server having delegations on cl_revoked that the client has no way to FREE or DELEGRETURN, with resulting inability to recover. The state manager on the server will continually assert SEQ4_STATUS_RECALLABLE_STATE_REVOKED, and the state manager on the client will be looping unable to satisfy the server. List discussion also reports a race between OPEN and DELEGRETURN that will be avoided by only sending the delegation once to the client. This is also logically in accordance with RFC5561 9.1.1 and 10.2. So, let's: 1.) Not hand out duplicate delegations. 2.) Only send them to the client once. RFC 5561: 9.1.1: "Delegations and layouts, on the other hand, are not associated with a specific owner but are associated with the client as a whole (identified by a client ID)." 10.2: "...the stateid for a delegation is associated with a client ID and may be used on behalf of all the open-owners for the given client. A delegation is made to the client as a whole and not to any specific process or thread of control within it." Reported-by: Eric Meddaugh <etmsys@rit.edu> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Andrew Elble <aweits@rit.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	nfsd: serialize state seqid morphing operations	Jeff Layton
	commit 35a92fe8770ce54c5eb275cd76128645bea2d200 upstream. Andrew was seeing a race occur when an OPEN and OPEN_DOWNGRADE were running in parallel. The server would receive the OPEN_DOWNGRADE first and check its seqid, but then an OPEN would race in and bump it. The OPEN_DOWNGRADE would then complete and bump the seqid again. The result was that the OPEN_DOWNGRADE would be applied after the OPEN, even though it should have been rejected since the seqid changed. The only recourse we have here I think is to serialize operations that bump the seqid in a stateid, particularly when we're given a seqid in the call. To address this, we add a new rw_semaphore to the nfs4_ol_stateid struct. We do a down_write prior to checking the seqid after looking up the stateid to ensure that nothing else is going to bump it while we're operating on it. In the case of OPEN, we do a down_read, as the call doesn't contain a seqid. Those can run in parallel -- we just need to serialize them when there is a concurrent OPEN_DOWNGRADE or CLOSE. LOCK and LOCKU however always take the write lock as there is no opportunity for parallelizing those. Reported-and-Tested-by: Andrew W Elble <aweits@rit.edu> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	firewire: ohci: fix JMicron JMB38x IT context discovery	Stefan Richter
	commit 100ceb66d5c40cc0c7018e06a9474302470be73c upstream. Reported by Clifford and Craig for JMicron OHCI-1394 + SDHCI combo controllers: Often or even most of the time, the controller is initialized with the message "added OHCI v1.10 device as card 0, 4 IR + 0 IT contexts, quirks 0x10". With 0 isochronous transmit DMA contexts (IT contexts), applications like audio output are impossible. However, OHCI-1394 demands that at least 4 IT contexts are implemented by the link layer controller, and indeed JMicron JMB38x do implement four of them. Only their IsoXmitIntMask register is unreliable at early access. With my own JMB381 single function controller I found: - I can reproduce the problem with a lower probability than Craig's. - If I put a loop around the section which clears and reads IsoXmitIntMask, then either the first or the second attempt will return the correct initial mask of 0x0000000f. I never encountered a case of needing more than a second attempt. - Consequently, if I put a dummy reg_read(...IsoXmitIntMaskSet) before the first write, the subsequent read will return the correct result. - If I merely ignore a wrong read result and force the known real result, later isochronous transmit DMA usage works just fine. So let's just fix this chip bug up by the latter method. Tested with JMB381 on kernel 3.13 and 4.3. Since OHCI-1394 generally requires 4 IT contexts at a minium, this workaround is simply applied whenever the initial read of IsoXmitIntMask returns 0, regardless whether it's a JMicron chip or not. I never heard of this issue together with any other chip though. I am not 100% sure that this fix works on the OHCI-1394 part of JMB380 and JMB388 combo controllers exactly the same as on the JMB381 single- function controller, but so far I haven't had a chance to let an owner of a combo chip run a patched kernel. Strangely enough, IsoRecvIntMask is always reported correctly, even though it is probed right before IsoXmitIntMask. Reported-by: Clifford Dunn Reported-by: Craig Moore <craig.moore@qenos.com> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	ext4, jbd2: ensure entering into panic after recording an error in superblock	Daeho Jeong
	commit 4327ba52afd03fc4b5afa0ee1d774c9c5b0e85c5 upstream. If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the journaling will be aborted first and the error number will be recorded into JBD2 superblock and, finally, the system will enter into the panic state in "errors=panic" option. But, in the rare case, this sequence is little twisted like the below figure and it will happen that the system enters into panic state, which means the system reset in mobile environment, before completion of recording an error in the journal superblock. In this case, e2fsck cannot recognize that the filesystem failure occurred in the previous run and the corruption wouldn't be fixed. Task A Task B ext4_handle_error() -> jbd2_journal_abort() -> __journal_abort_soft() -> __jbd2_journal_abort_hard() \| -> journal->j_flags \|= JBD2_ABORT; \| \| __ext4_abort() \| -> jbd2_journal_abort() \| \| -> __journal_abort_soft() \| \| -> if (journal->j_flags & JBD2_ABORT) \| \| return; \| -> panic() \| -> jbd2_journal_update_sb_errno() Tested-by: Hobin Woo <hobin.woo@samsung.com> Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	ext4: fix potential use after free in __ext4_journal_stop	Lukas Czerner
	commit 6934da9238da947628be83635e365df41064b09b upstream. There is a use-after-free possibility in __ext4_journal_stop() in the case that we free the handle in the first jbd2_journal_stop() because we're referencing handle->h_err afterwards. This was introduced in 9705acd63b125dee8b15c705216d7186daea4625 and it is wrong. Fix it by storing the handle->h_err value beforehand and avoid referencing potentially freed handle. Fixes: 9705acd63b125dee8b15c705216d7186daea4625 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	ext4 crypto: fix memory leak in ext4_bio_write_page()	Theodore Ts'o
	commit 937d7b84dca58f2565715f2c8e52f14c3d65fb22 upstream. There are times when ext4_bio_write_page() is called even though we don't actually need to do any I/O. This happens when ext4_writepage() gets called by the jbd2 commit path when an inode needs to force its pages written out in order to provide data=ordered guarantees --- and a page is backed by an unwritten (e.g., uninitialized) block on disk, or if delayed allocation means the page's backing store hasn't been allocated yet. In that case, we need to skip the call to ext4_encrypt_page(), since in addition to wasting CPU, it leads to a bounce page and an ext4 crypto context getting leaked. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	rbd: don't put snap_context twice in rbd_queue_workfn()	Ilya Dryomov
	commit 70b16db86f564977df074072143284aec2cb1162 upstream. Commit 4e752f0ab0e8 ("rbd: access snapshot context and mapping size safely") moved ceph_get_snap_context() out of rbd_img_request_create() and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the error path in rbd_queue_workfn(). However, rbd_img_request_create() consumes a ref on snapc, so calling ceph_put_snap_context() after a successful rbd_img_request_create() leads to an extra put. Fix it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	Btrfs: fix race when listing an inode's xattrs	Filipe Manana
	commit f1cd1f0b7d1b5d4aaa5711e8f4e4898b0045cb6d upstream. When listing a inode's xattrs we have a time window where we race against a concurrent operation for adding a new hard link for our inode that makes us not return any xattr to user space. In order for this to happen, the first xattr of our inode needs to be at slot 0 of a leaf and the previous leaf must still have room for an inode ref (or extref) item, and this can happen because an inode's listxattrs callback does not lock the inode's i_mutex (nor does the VFS does it for us), but adding a hard link to an inode makes the VFS lock the inode's i_mutex before calling the inode's link callback. If we have the following leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 XATTR_ITEM 12345), ... ] slot N - 2 slot N - 1 slot 0 The race illustrated by the following sequence diagram is possible: CPU 1 CPU 2 btrfs_listxattr() searches for key (257 XATTR_ITEM 0) gets path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() releases the path adds key (257 INODE_REF 666) to the end of leaf X (slot N), and leaf X now has N + 1 items searches for the key (257 INODE_REF 256), with path->keep_locks == 1, because that is the last key it saw in leaf X before releasing the path ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in leaf X, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 666) btrfs_listxattr's loop iteration sees that the type of the key pointed by the path is different from the type BTRFS_XATTR_ITEM_KEY and so it breaks the loop and stops looking for more xattr items --> the application doesn't get any xattr listed for our inode So fix this by breaking the loop only if the key's type is greater than BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow	Filipe Manana
	commit 1d512cb77bdbda80f0dd0620a3b260d697fd581d upstream. If we are using the NO_HOLES feature, we have a tiny time window when running delalloc for a nodatacow inode where we can race with a concurrent link or xattr add operation leading to a BUG_ON. This happens because at run_delalloc_nocow() we end up casting a leaf item of type BTRFS_INODE_[REF\|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a file extent item (struct btrfs_file_extent_item) and then analyse its extent type field, which won't match any of the expected extent types (values BTRFS_FILE_EXTENT_[REG\|PREALLOC\|INLINE]) and therefore trigger an explicit BUG_ON(1). The following sequence diagram shows how the race happens when running a no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 (Note the implicit hole for inode 257 regarding the [0, 8K[ range) CPU 1 CPU 2 run_dealloc_nocow() btrfs_lookup_file_extent() --> searches for a key with value (257 EXTENT_DATA 4096) in the fs/subvol tree --> returns us a path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() --> releases the path hard link added to our inode, with key (257 INODE_REF 500) added to the end of leaf X, so leaf X now has N + 1 keys --> searches for the key (257 INODE_REF 256), because it was the last key in leaf X before it released the path, with path->keep_locks set to 1 --> ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in the leaf, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 500) the loop iteration of run_dealloc_nocow() does not break out the loop and continues because the key referenced in the path at path->nodes[0] and path->slots[0] is for inode 257, its type is < BTRFS_EXTENT_DATA_KEY and its offset (500) is less then our delalloc range's end (8192) the item pointed by the path, an inode reference item, is (incorrectly) interpreted as a file extent item and we get an invalid extent type, leading to the BUG_ON(1): if (extent_type == BTRFS_FILE_EXTENT_REG \|\| extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) } else { BUG_ON(1) } The same can happen if a xattr is added concurrently and ends up having a key with an offset smaller then the delalloc's range end. So fix this by skipping keys with a type smaller than BTRFS_EXTENT_DATA_KEY. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	Btrfs: fix race leading to incorrect item deletion when dropping extents	Filipe Manana
	commit aeafbf8486c9e2bd53f5cc3c10c0b7fd7149d69c upstream. While running a stress test I got the following warning triggered: [191627.672810] ------------[ cut here ]------------ [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]() (...) [191627.701485] Call Trace: [191627.702037] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [191627.702992] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2 [191627.704091] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [191627.705380] [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs] [191627.706637] [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c [191627.707789] [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs] [191627.709155] [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0 [191627.712444] [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18 [191627.714162] [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs] [191627.715887] [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs] [191627.717287] [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs] [191627.728865] [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs] [191627.730045] [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs] [191627.731256] [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs] [191627.732661] [<ffffffff81061119>] process_one_work+0x24c/0x4ae [191627.733822] [<ffffffff810615b0>] worker_thread+0x206/0x2c2 [191627.734857] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f [191627.736052] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f [191627.737349] [<ffffffff810669a6>] kthread+0xef/0xf7 [191627.738267] [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28 [191627.739330] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad [191627.741976] [<ffffffff81465592>] ret_from_fork+0x42/0x70 [191627.743080] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad [191627.744206] ---[ end trace bbfddacb7aaada8d ]--- $ cat -n fs/btrfs/file.c 691 int __btrfs_drop_extents(struct btrfs_trans_handle *trans, (...) 758 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 759 if (key.objectid > ino \|\| 760 key.type > BTRFS_EXTENT_DATA_KEY \|\| key.offset >= end) 761 break; 762 763 fi = btrfs_item_ptr(leaf, path->slots[0], 764 struct btrfs_file_extent_item); 765 extent_type = btrfs_file_extent_type(leaf, fi); 766 767 if (extent_type == BTRFS_FILE_EXTENT_REG \|\| 768 extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) 774 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) 778 } else { 779 WARN_ON(1); 780 extent_end = search_start; 781 } (...) This happened because the item we were processing did not match a file extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this case we cast the item to a struct btrfs_file_extent_item pointer and then find a type field value that does not match any of the expected values (BTRFS_FILE_EXTENT_[REG\|PREALLOC\|INLINE]). This scenario happens due to a tiny time window where a race can happen as exemplified below. For example, consider the following scenario where we're using the NO_HOLES feature and we have the following two neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather than explicit because NO_HOLES is enabled). Now if our inode has an ordered extent for the range [4K, 8K[ that is finishing, the following can happen: CPU 1 CPU 2 btrfs_finish_ordered_io() insert_reserved_file_extent() __btrfs_drop_extents() Searches for the key (257 EXTENT_DATA 4096) through btrfs_lookup_file_extent() Key not found and we get a path where path->nodes[0] == leaf X and path->slots[0] == N Because path->slots[0] is >= btrfs_header_nritems(leaf X), we call btrfs_next_leaf() btrfs_next_leaf() releases the path inserts key (257 INODE_REF 4096) at the end of leaf X, leaf X now has N + 1 keys, and the new key is at slot N btrfs_next_leaf() searches for key (257 INODE_REF 256), with path->keep_locks set to 1, because it was the last key it saw in leaf X finds it in leaf X again and notices it's no longer the last key of the leaf, so it returns 0 with path->nodes[0] == leaf X and path->slots[0] == N (which is now < btrfs_header_nritems(leaf X)), pointing to the new key (257 INODE_REF 4096) __btrfs_drop_extents() casts the item at path->nodes[0], slot path->slots[0], to a struct btrfs_file_extent_item - it does not skip keys for the target inode with a type less than BTRFS_EXTENT_DATA_KEY (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY) sees a bogus value for the type field triggering the WARN_ON in the trace shown above, and sets extent_end = search_start (4096) does the if-then-else logic to fixup 0 length extent items created by a past bug from hole punching: if (extent_end == key.offset && extent_end >= search_start) goto delete_extent_item; that evaluates to true and it ends up deleting the key pointed to by path->slots[0], (257 INODE_REF 4096), from leaf X The same could happen for example for a xattr that ends up having a key with an offset value that matches search_start (very unlikely but not impossible). So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are skipped, never casted to struct btrfs_file_extent_item and never deleted by accident. Also protect against the unexpected case of getting a key for a lower inode number by skipping that key and issuing a warning. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14	Btrfs: fix truncation of compressed and inlined extents	Filipe Manana
	commit 0305cd5f7fca85dae392b9ba85b116896eb7c1c7 upstream. When truncating a file to a smaller size which consists of an inline extent that is compressed, we did not discard (or made unusable) the data between the new file size and the old file size, wasting metadata space and allowing for the truncated data to be leaked and the data corruption/loss mentioned below. We were also not correctly decrementing the number of bytes used by the inode, we were setting it to zero, giving a wrong report for callers of the stat(2) syscall. The fsck tool also reported an error about a mismatch between the nbytes of the file versus the real space used by the file. Now because we weren't discarding the truncated region of the file, it was possible for a caller of the clone ioctl to actually read the data that was truncated, allowing for a security breach without requiring root access to the system, using only standard filesystem operations. The scenario is the following: 1) User A creates a file which consists of an inline and compressed extent with a size of 2000 bytes - the file is not accessible to any other users (no read, write or execution permission for anyone else); 2) The user truncates the file to a size of 1000 bytes; 3) User A makes the file world readable; 4) User B creates a file consisting of an inline extent of 2000 bytes; 5) User B issues a clone operation from user A's file into its own file (using a length argument of 0, clone the whole range); 6) User B now gets to see the 1000 bytes that user A truncated from its file before it made its file world readbale. User B also lost the bytes in the range [1000, 2000[ bytes from its own file, but that might be ok if his/her intention was reading stale data from user A that was never supposed to be public. Note that this contrasts with the case where we truncate a file from 2000 bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In this case reading any byte from the range [1000, 2000[ will return a value of 0x00, instead of the original data. This problem exists since the clone ioctl was added and happens both with and without my recent data loss and file corruption fixes for the clone ioctl (patch "Btrfs: fix file corruption and data loss after cloning inline extents"). So fix this by truncating the compressed inline extents as we do for the non-compressed case, which involves decompressing, if the data isn't already in the page cache, compressing the truncated version of the extent, writing the compressed content into the inline extent and then truncate it. The following test case for fstests reproduces the problem. In order for the test to pass both this fix and my previous fix for the clone ioctl that forbids cloning a smaller inline extent into a larger one, which is titled "Btrfs: fix file corruption and data loss after cloning inline extents", are needed. Without that other fix the test fails in a different way that does not leak the truncated data, instead part of destination file gets replaced with zeroes (because the destination file has a larger inline extent than the source). seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create our test files. File foo is going to be the source of a clone operation # and consists of a single inline extent with an uncompressed size of 512 bytes, # while file bar consists of a single inline extent with an uncompressed size of # 256 bytes. For our test's purpose, it's important that file bar has an inline # extent with a size smaller than foo's inline extent. $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128" \ -c "pwrite -S 0x2a 128 384" \ $SCRATCH_MNT/foo \| _filter_xfs_io $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar \| _filter_xfs_io # Now durably persist all metadata and data. We do this to make sure that we get # on disk an inline extent with a size of 512 bytes for file foo. sync # Now truncate our file foo to a smaller size. Because it consists of a # compressed and inline extent, btrfs did not shrink the inline extent to the # new size (if the extent was not compressed, btrfs would shrink it to 128 # bytes), it only updates the inode's i_size to 128 bytes. $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo # Now clone foo's inline extent into bar. # This clone operation should fail with errno EOPNOTSUPP because the source # file consists only of an inline extent and the file's size is smaller than # the inline extent of the destination (128 bytes < 256 bytes). However the # clone ioctl was not prepared to deal with a file that has a size smaller # than the size of its inline extent (something that happens only for compressed # inline extents), resulting in copying the full inline extent from the source # file into the destination file. # # Note that btrfs' clone operation for inline extents consists of removing the # inline extent from the destination inode and copy the inline extent from the # source inode into the destination inode, meaning that if the destination # inode's inline extent is larger (N bytes) than the source inode's inline # extent (M bytes), some bytes (N - M bytes) will be lost from the destination # file. Btrfs could copy the source inline extent's data into the destination's # inline extent so that we would not lose any data, but that's currently not # done due to the complexity that would be needed to deal with such cases # (specially when one or both extents are compressed), returning EOPNOTSUPP, as # it's normally not a very common case to clone very small files (only case # where we get inline extents) and copying inline extents does not save any # space (unlike for normal, non-inlined extents). $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Now because the above clone operation used to succeed, and due to foo's inline # extent not being shinked by the truncate operation, our file bar got the whole # inline extent copied from foo, making us lose the last 128 bytes from bar # which got replaced by the bytes in range [128, 256[ from foo before foo was # truncated - in other words, data loss from bar and being able to read old and # stale data from foo that should not be possible to read anymore through normal # filesystem operations. Contrast with the case where we truncate a file from a # size N to a smaller size M, truncate it back to size N and then read the range # [M, N[, we should always get the value 0x00 for all the bytes in that range. # We expected the clone operation to fail with errno EOPNOTSUPP and therefore # not modify our file's bar data/metadata. So its content should be 256 bytes # long with all bytes having the value 0xbb. # # Without the btrfs bug fix, the clone operation succeeded and resulted in # leaking truncated data from foo, the bytes that belonged to its range # [128, 256[, and losing data from bar in that same range. So reading the # file gave us the following content: # # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 # * # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a # * # 0000400 echo "File bar's content after the clone operation:" od -t x1 $SCRATCH_MNT/bar # Also because the foo's inline extent was not shrunk by the truncate # operation, btrfs' fsck, which is run by the fstests framework everytime a # test completes, failed reporting the following error: # # root 5 inode 257 errors 400, nbytes wrong status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>