linux-linaro-lng.git - Linaro Networking Group Kernels

Age	Commit message (Collapse)	Author
2014-05-07	Linux 3.10.37-rt38 REBASElinux-lng-preempt-rt-v3.10.37-rt38-final linux-lng-preempt-rt-v3.10.x linux-lng-preempt-rt	Steven Rostedt (Red Hat)

2014-05-07	rcu: make RCU_BOOST default on RT	Sebastian Andrzej Siewior
	Since it is no longer invoked from the softirq people run into OOM more often if the priority of the RCU thread is too low. Making boosting default on RT should help in those case and it can be switched off if someone knows better. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	net: gianfar: do not try to cleanup TX packets if they are not done	Sebastian Andrzej Siewior
	What I observe is that the TX queue is not empty and does not make any progress. gfar_clean_tx_ring() does not clean up the packet because it is not completed yet. The root cause is that the DMA engine did not start yet (it was preempted before doing so) and that dumb loop, loops until that packet is gone. This is broken since c233cf4 ("gianfar: Fix tx napi polling"). What remains are spurious interrupts if CPU0 cleans up TX packages and CPU1 returns with IRQ_NONE. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> [ added return howmany; ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	net: gianfar: do not disable interrupts	Sebastian Andrzej Siewior
	each per-queue lock is taken with spin_lock_irqsave() except in the case where all of them are taken for some kind of serialisation. As an optimisation local_irq_save() is used so that lock_tx_qs() and lock_rx_qs() can use just the spin_lock() variant instead. On RT local_irq_save() behaves differently so we use the nort() variant. Lockdep screems easily by "ethtool -K eth0 rx off tx off" What remains is missing lockdep annotation that makes lockdep think lock_tx_qs() may cause a dead lock. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	crypto: Reduce preempt disabled regions, more algos	Sebastian Andrzej Siewior
	Don Estabrook reported \| kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() \| kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2462 migrate_enable+0x17b/0x200() \| kernel: WARNING: CPU: 3 PID: 865 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() and his backtrace showed some crypto functions which looked fine. The problem is the following sequence: glue_xts_crypt_128bit() { blkcipher_walk_virt(); /* normal migrate_disable() / glue_fpu_begin(); / get atomic / while (nbytes) { __glue_xts_crypt_128bit(); blkcipher_walk_done(); / with nbytes = 0, migrate_enable() * while we are atomic / }; glue_fpu_end() / no longer atomic */ } and this is why the counter get out of sync and the warning is printed. The other problem is that we are non-preemptible between glue_fpu_begin() and glue_fpu_end() and the latency grows. To fix this, I shorten the FPU off region and ensure blkcipher_walk_done() is called with preemption enabled. This might hurt the performance because we now enable/disable the FPU state more often but we gain lower latency and the bug is gone. Cc: stable-rt@vger.kernel.org Reported-by: Don Estabrook <don.estabrook@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	cpu_chill: Add a UNINTERRUPTIBLE hrtimer_nanosleep	Steven Rostedt
	We hit another bug that was caused by switching cpu_chill() from msleep() to hrtimer_nanosleep(). This time it is a livelock. The problem is that hrtimer_nanosleep() calls schedule with the state == TASK_INTERRUPTIBLE. But these means that if a signal is pending, the scheduler wont schedule, and will simply change the current task state back to TASK_RUNNING. This nullifies the whole point of cpu_chill() in the first place. That is, if a task is spinning on a try_lock() and it preempted the owner of the lock, if it has a signal pending, it will never give up the CPU to let the owner of the lock run. I made a static function __hrtimer_nanosleep() that takes a fifth parameter "state", which determines the task state of that the nanosleep() will be in. The normal hrtimer_nanosleep() will act the same, but cpu_chill() will call the __hrtimer_nanosleep() directly with the TASK_UNINTERRUPTIBLE state. cpu_chill() only cares that the first sleep happens, and does not care about the state of the restart schedule (in hrtimer_nanosleep_restart). Cc: stable-rt@vger.kernel.org Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	fs: jbd2: pull your plug when waiting for space	Sebastian Andrzej Siewior
	Two cps in parallel managed to stall the the ext4 fs. It seems that journal code is either waiting for locks or sleeping waiting for something to happen. This seems similar to what Mike observed on ext3, here is his description: \|With an -rt kernel, and a heavy sync IO load, tasks can jam \|up on journal locks without unplugging, which can lead to \|terminal IO starvation. Unplug and schedule when waiting \|for space. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait ↵	Marc Kleine-Budde
	for outstanding qdisc_run calls On PREEMPT_RT enabled systems the interrupt handler run as threads at prio 50 (by default). If a high priority userspace process tries to shut down a busy network interface it might spin in a yield loop waiting for the device to become idle. With the interrupt thread having a lower priority than the looping process it might never be scheduled and so result in a deadlock on UP systems. With Magic SysRq the following backtrace can be produced: > test_app R running 0 174 168 0x00000000 > [<c02c7070>] (__schedule+0x220/0x3fc) from [<c02c7870>] (preempt_schedule_irq+0x48/0x80) > [<c02c7870>] (preempt_schedule_irq+0x48/0x80) from [<c0008fa8>] (svc_preempt+0x8/0x20) > [<c0008fa8>] (svc_preempt+0x8/0x20) from [<c001a984>] (local_bh_enable+0x18/0x88) > [<c001a984>] (local_bh_enable+0x18/0x88) from [<c025316c>] (dev_deactivate_many+0x220/0x264) > [<c025316c>] (dev_deactivate_many+0x220/0x264) from [<c023be04>] (__dev_close_many+0x64/0xd4) > [<c023be04>] (__dev_close_many+0x64/0xd4) from [<c023be9c>] (__dev_close+0x28/0x3c) > [<c023be9c>] (__dev_close+0x28/0x3c) from [<c023f7f0>] (__dev_change_flags+0x88/0x130) > [<c023f7f0>] (__dev_change_flags+0x88/0x130) from [<c023f904>] (dev_change_flags+0x10/0x48) > [<c023f904>] (dev_change_flags+0x10/0x48) from [<c024c140>] (do_setlink+0x370/0x7ec) > [<c024c140>] (do_setlink+0x370/0x7ec) from [<c024d2f0>] (rtnl_newlink+0x2b4/0x450) > [<c024d2f0>] (rtnl_newlink+0x2b4/0x450) from [<c024cfa0>] (rtnetlink_rcv_msg+0x158/0x1f4) > [<c024cfa0>] (rtnetlink_rcv_msg+0x158/0x1f4) from [<c0256740>] (netlink_rcv_skb+0xac/0xc0) > [<c0256740>] (netlink_rcv_skb+0xac/0xc0) from [<c024bbd8>] (rtnetlink_rcv+0x18/0x24) > [<c024bbd8>] (rtnetlink_rcv+0x18/0x24) from [<c02561b8>] (netlink_unicast+0x13c/0x198) > [<c02561b8>] (netlink_unicast+0x13c/0x198) from [<c025651c>] (netlink_sendmsg+0x264/0x2e0) > [<c025651c>] (netlink_sendmsg+0x264/0x2e0) from [<c022af98>] (sock_sendmsg+0x78/0x98) > [<c022af98>] (sock_sendmsg+0x78/0x98) from [<c022bb50>] (___sys_sendmsg.part.25+0x268/0x278) > [<c022bb50>] (___sys_sendmsg.part.25+0x268/0x278) from [<c022cf08>] (__sys_sendmsg+0x48/0x78) > [<c022cf08>] (__sys_sendmsg+0x48/0x78) from [<c0009320>] (ret_fast_syscall+0x0/0x2c) This patch works around the problem by replacing yield() by msleep(1), giving the interrupt thread time to finish, similar to other changes contained in the rt patch set. Using wait_for_completion() instead would probably be a better solution. Cc: stable-rt@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	rcu: Eliminate softirq processing from rcutree	Paul E. McKenney
	Running RCU out of softirq is a problem for some workloads that would like to manage RCU core processing independently of other softirq work, for example, setting kthread priority. This commit therefore moves the RCU core work from softirq to a per-CPU/per-flavor SCHED_OTHER kthread named rcuc. The SCHED_OTHER approach avoids the scalability problems that appeared with the earlier attempt to move RCU core processing to from softirq to kthreads. That said, kernels built with RCU_BOOST=y will run the rcuc kthreads at the RCU-boosting priority. Cc: stable-rt@vger.kernel.org Reported-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	leds: trigger: disable CPU trigger on -RT	Sebastian Andrzej Siewior
	as it triggers: \|CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.8-rt10 #141 \|[<c0014aa4>] (unwind_backtrace+0x0/0xf8) from [<c0012788>] (show_stack+0x1c/0x20) \|[<c0012788>] (show_stack+0x1c/0x20) from [<c043c8dc>] (dump_stack+0x20/0x2c) \|[<c043c8dc>] (dump_stack+0x20/0x2c) from [<c004c5e8>] (__might_sleep+0x13c/0x170) \|[<c004c5e8>] (__might_sleep+0x13c/0x170) from [<c043f270>] (__rt_spin_lock+0x28/0x38) \|[<c043f270>] (__rt_spin_lock+0x28/0x38) from [<c043fa00>] (rt_read_lock+0x68/0x7c) \|[<c043fa00>] (rt_read_lock+0x68/0x7c) from [<c036cf74>] (led_trigger_event+0x2c/0x5c) \|[<c036cf74>] (led_trigger_event+0x2c/0x5c) from [<c036e0bc>] (ledtrig_cpu+0x54/0x5c) \|[<c036e0bc>] (ledtrig_cpu+0x54/0x5c) from [<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) \|[<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) from [<c00590b8>] (cpu_startup_entry+0xa8/0x234) \|[<c00590b8>] (cpu_startup_entry+0xa8/0x234) from [<c043b2cc>] (rest_init+0xb8/0xe0) \|[<c043b2cc>] (rest_init+0xb8/0xe0) from [<c061ebe0>] (start_kernel+0x2c4/0x380) Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	net: ip_send_unicast_reply: add missing local serialization	Nicholas Mc Guire
	in response to the oops in ip_output.c:ip_send_unicast_reply under high network load with CONFIG_PREEMPT_RT_FULL=y, reported by Sami Pietikainen <Sami.Pietikainen@wapice.com>, this patch adds local serialization in ip_send_unicast_reply. from ip_output.c: /* * Generic function to send a packet as reply to another packet. * Used to send some TCP resets/acks so far. * * Use a fake percpu inet socket to avoid false sharing and contention. / static DEFINE_PER_CPU(struct inet_sock, unicast_sock) = { ... which was added in commit be9f4a44 in linux-stable. The git log, wich introduced the PER_CPU unicast_sock, states: <snip> commit be9f4a44e7d41cee50ddb5f038fc2391cbbb4046 Author: Eric Dumazet <edumazet@google.com> Date: Thu Jul 19 07:34:03 2012 +0000 ipv4: tcp: remove per net tcp_sock tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace. This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations. To better resist to attacks, we use a percpu socket. Each cpu can run without contention, using appropriate memory (local node) <snip> The per-cpu here thus is assuming exclusivity serializing per cpu - so the use of get_cpu_ligh introduced in net-use-cpu-light-in-ip-send-unicast-reply.patch, which droped the preempt_disable in favor of a migrate_disable is probably wrong as this only handles the referencial consistency but not the serialization. To evade a preempt_disable here a local lock would be needed. Therapie: add local lock: * and re-introduce local serialization: Tested on x86 with high network load using the testcase from Sami Pietikainen while : ; do wget -O - ftp://LOCAL_SERVER/empty_file > /dev/null 2>&1; done Link: http://www.spinics.net/lists/linux-rt-users/msg11007.html Cc: stable-rt@vger.kernel.org Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	arm/unwind: use a raw_spin_lock	Sebastian Andrzej Siewior
	Mostly unwind is done with irqs enabled however SLUB may call it with irqs disabled while creating a new SLUB cache. I had system freeze while loading a module which called kmem_cache_create() on init. That means SLUB's __slab_alloc() disabled interrupts and then ->new_slab_objects() ->new_slab() ->setup_object() ->setup_object_debug() ->init_tracking() ->set_track() ->save_stack_trace() ->save_stack_trace_tsk() ->walk_stackframe() ->unwind_frame() ->unwind_find_idx() =>spin_lock_irqsave(&unwind_lock); Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	irq_work: allow certain work in hard irq context	Sebastian Andrzej Siewior
	irq_work is processed in softirq context on -RT because we want to avoid long latencies which might arise from processing lots of perf events. The noHZ-full mode requires its callback to be called from real hardirq context (commit 76c24fb ("nohz: New APIs to re-evaluate the tick on full dynticks CPUs")). If it is called from a thread context we might get wrong results for checks like "is_idle_task(current)". This patch introduces a second list (hirq_work_list) which will be used if irq_work_run() has been invoked from hardirq context and process only work items marked with IRQ_WORK_HARD_IRQ. This patch also removes arch_irq_work_raise() from sparc & powerpc like it is already done for x86. Atleast for powerpc it is somehow superfluous because it is called from the timer interrupt which should invoke update_process_times(). Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	kernel/hrtimer: be non-freezeable in cpu_chill()	Sebastian Andrzej Siewior
	Since we replaced msleep() by hrtimer I see now and then (rarely) this: \| [....] Waiting for /dev to be fully populated... \| ===================================== \| [ BUG: udevd/229 still has locks held! ] \| 3.12.11-rt17 #23 Not tainted \| ------------------------------------- \| 1 lock held by udevd/229: \| #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 \| \| stack backtrace: \| CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 \| (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) \| (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) \| (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) \| (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) \| (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) \| (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) \| (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) \| (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) \| (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) \| (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) \| (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) \| (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) \| (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) \| (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) \| (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) \| (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) \| (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) For now I see no better way but to disable the freezer the sleep the period. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	rt: Make cpu_chill() use hrtimer instead of msleep()	Steven Rostedt
	Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is called from softirq context, it may block the ksoftirqd() from running, in which case, it may never wake up the msleep() causing the deadlock. I checked the vmcore, and irq/74-qla2xxx is stuck in the msleep() call, running on CPU 8. The one ksoftirqd that is stuck, happens to be the one that runs on CPU 8, and it is blocked on a lock held by irq/74-qla2xxx. As that ksoftirqd is the one that will wake up irq/74-qla2xxx, and it happens to be blocked on a lock that irq/74-qla2xxx holds, we have our deadlock. The solution is not to convert the cpu_chill() back to a cpu_relax() as that will re-create a possible live lock that the cpu_chill() fixed earlier, and may also leave this bug open on other softirqs. The fix is to remove the dependency on ksoftirqd from cpu_chill(). That is, instead of calling msleep() that requires ksoftirqd to wake it up, use the hrtimer_nanosleep() code that does the wakeup from hard irq context. \|Looks to be the lock of the block softirq. I don't have the core dump \|anymore, but from what I could tell the ksoftirqd was blocked on the \|block softirq lock, where the block softirq handler did a msleep \|(called by the qla2xxx interrupt handler). \| \|Looking at trigger_softirq() in block/blk-softirq.c, it can do a \|smp_callfunction() to another cpu to run the block softirq. If that \|happens to be the cpu where the qla2xx irq handler is doing the block \|softirq and is in a middle of a msleep(), I believe the ksoftirqd will \|try to run the softirq. If it does that, then BOOM, it's deadlocked \|because the ksoftirqd will never run the timer softirq either. \|I should have also stated that it was only one lock that was involved. \|But the lock owner was doing a msleep() that requires a wakeup by \|ksoftirqd to continue. If ksoftirqd happens to be blocked on a lock \|held by the msleep() caller, then you have your deadlock. \| \|It's best not to have any softirqs going to sleep requiring another \|softirq to wake it up. Note, if we ever require a timer softirq to do a \|cpu_chill() it will most definitely hit this deadlock. Cc: stable-rt@vger.kernel.org Found-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bigeasy: add the 4 \| chapters from email] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	Revert "x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT"	Sebastian Andrzej Siewior
	where do I start. Let me explain what is going on here. The code sequence \| pushf \| pop %edx \| or $0x1,%dh \| push %edx \| mov $0xe0,%eax \| popf \| sysenter triggers the bug. On 64bit kernel we see the double fault (with 32bit and 64bit userland) and on 32bit kernel there is no problem. The reporter said that double fault does not happen on 64bit kernel with 64bit userland and this is because in that case the VDSO uses the "syscall" interface instead of "sysenter". The bug. "popf" loads the flags with the TF bit set which enables "single stepping" and this leads to a debug exception. Usually on 64bit we have a special IST stack for the debug exception. Due to patch [0] we do not use the IST stack but the kernel stack instead. On 64bit the sysenter instruction starts in kernel with the stack address NULL. The code sequence above enters the debug exception (TF flag) after the sysenter instruction was executed which sets the stack pointer to NULL and we have a fault (it seems that the debug exception saves some bytes on the stack). To fix the double fault I'm going to drop patch [0]. It is completely pointless. In do_debug() and do_stack_segment() we disable preemption which means the task can't leave the CPU. So it does not matter if we run on IST or on kernel stack. There is a patch [1] which drops preempt_disable() call for a 32bit kernel but not for 64bit so there should be no regression. And [1] seems valid even for this code sequence. We enter the debug exception with a 256bytes long per cpu stack and migrate to the kernel stack before calling do_debug(). [0] x86-disable-debug-stack.patch [1] fix-rt-int3-x86_32-3.2-rt.patch Cc: stable-rt@vger.kernel.org Reported-by: Brian Silverman <bsilver16384@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	rcutree/rcu_bh_qs: disable irq while calling rcu_preempt_qs()	Tiejun Chen
	Any callers to the function rcu_preempt_qs() must disable irqs in order to protect the assignment to ->rcu_read_unlock_special. In RT case, rcu_bh_qs() as the wrapper of rcu_preempt_qs() is called in some scenarios where irq is enabled, like this path, do_single_softirq() \| + local_irq_enable(); + handle_softirq() \| \| \| + rcu_bh_qs() \| \| \| + rcu_preempt_qs() \| + local_irq_disable() So here we'd better disable irq directly inside of rcu_bh_qs() to fix this, otherwise the kernel may be freezable sometimes as observed. And especially this way is also kind and safe for the potential rcu_bh_qs() usage elsewhere in the future. Cc: stable-rt@vger.kernel.org Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Bin Jiang <bin.jiang@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	timer/rt: Always raise the softirq if there's irq_work to be done	Steven Rostedt
	It was previously discovered that some systems would hang on boot up with a previous version of 3.12-rt. This was due to RCU using irq_work, and RT defers the irq_work to a softirq. But if there's no active timers, the softirq will not be raised, and RCU work will not get done, causing the system to hang. The fix was to check that if there was no active timers but irq_work to be done, then we should raise the softirq. But this fix was not 100% correct. It left out the case that there were active timers that were not expired yet. This would have the softirq not get raised even if there was irq work to be done. If there is irq_work to be done, then we must raise the timer softirq regardless of if there is active timers or whether they are expired or not. The softirq can handle those cases. But we can never ignore irq_work. As it is only PREEMPT_RT_FULL that requires irq_work to be done in the softirq, we can pull out the check in the active_timers condition, and make the code a bit cleaner by having the irq_work check separate, and put the code in with the other #ifdef PREEMPT_RT. If there is irq_work to be done, there's no need to check the active timers or if they are expired. Just raise the time softirq and be done with it. Otherwise, we can do the timer checks just like we do with non -rt. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	timer: Raise softirq if there's irq_work	Steven Rostedt
	[ Talking with Sebastian on IRC, it seems that doing the irq_work_run() from the interrupt in -rt is a bad thing. Here we simply raise the softirq if there's irq work to do. This too boots on my i7 ] After trying hard to figure out why my i7 box was locking up with the new active_timers code, that does not run the timer softirq if there are no active timers, I took an extra look at the softirq handler and noticed that it doesn't just run timer softirqs, it also runs irq work. This was the bug that was locking up the system. It wasn't missing a timer, it was missing irq work. By always doing the irq work callbacks, the system boots fine. The missing irq work callback was the RCU's sp_wakeup() function. No need to check for defined(CONFIG_IRQ_WORK). When that's not set the "irq_work_needs_cpu()" is a static inline that returns false. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	timers: do not raise softirq unconditionally	Thomas Gleixner
	Mike, On Thu, 7 Nov 2013, Mike Galbraith wrote: > On Thu, 2013-11-07 at 04:26 +0100, Mike Galbraith wrote: > > On Wed, 2013-11-06 at 18:49 +0100, Thomas Gleixner wrote: > > > > I bet you are trying to work around some of the side effects of the > > > occasional tick which is still necessary despite of full nohz, right? > > > > Nope, I wanted to check out cost of nohz_full for rt, and found that it > > doesn't work at all instead, looked, and found that the sole running > > task has just awakened ksoftirqd when it wants to shut the tick down, so > > that shutdown never happens. > > Like so in virgin 3.10-rt. Box is x3550 M3 booted nowatchdog > rcu_nocbs=1-3 nohz_full=1-3, and CPUs1-3 are completely isolated via > cpusets as well. well, that very same problem is in mainline if you add "threadirqs" to the command line. But we can be smart about this. The untested patch below should address that issue. If that works on mainline we can adapt it for RT (needs a trylock(&base->lock) there). Though it's not a full solution. It needs some thought versus the softirq code of timers. Assume we have only one timer queued 1000 ticks into the future. So this change will cause the timer softirq not to be called until that timer expires and then the timer softirq is going to do 1000 loops until it catches up with jiffies. That's anything but pretty ... What worries me more is this one: pert-5229 [003] d..h1.. 684.482618: softirq_raise: vec=9 [action=RCU] The CPU has no callbacks as you shoved them over to cpu 0, so why is the RCU softirq raised? Thanks, tglx ------------------ Message-id: <alpine.DEB.2.02.1311071158350.23353@ionos.tec.linutronix.de> \|CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	rcu: Don't activate RCU core on NO_HZ_FULL CPUs	Paul E. McKenney
	Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see if the RCU core needs anything from this CPU. If so, RCU raises RCU_SOFTIRQ to carry out any needed processing. This approach has worked well historically, but it is undesirable on NO_HZ_FULL CPUs. Such CPUs are expected to spend almost all of their time in userspace, so that scheduling-clock interrupts can be disabled while there is only one runnable task on the CPU in question. Unfortunately, raising any softirq has the potential to wake up ksoftirqd, which would provide the second runnable task on that CPU, preventing disabling of scheduling-clock interrupts. What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone, relying on the grace-period kthreads' quiescent-state forcing to do any needed RCU work on behalf of those CPUs. This commit therefore refrains from raising RCU_SOFTIRQ on any NO_HZ_FULL CPUs during any grace periods that have been in effect for less than one second. The one-second limit handles the case where an inappropriate workload is running on a NO_HZ_FULL CPU that features lots of scheduling-clock interrupts, but no idle or userspace time. Cc: stable-rt@vger.kernel.org Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Mike Galbraith <bitbucket@online.de> Tested-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	net: make neigh_priv_len in struct net_device 16bit instead of 8bit	Sebastian Siewior
	neigh_priv_len is defined as u8. With all debug enabled struct ipoib_neigh has 200 bytes. The largest part is sk_buff_head with 96 bytes and here the spinlock with 72 bytes. The size value still fits in this u8 leaving some room for more. On -RT struct ipoib_neigh put on weight and has 392 bytes. The main reason is sk_buff_head with 288 and the fatty here is spinlock with 192 bytes. This does no longer fit into into neigh_priv_len and gcc complains. This patch changes neigh_priv_len from being 8bit to 16bit. Since the following element (dev_id) is 16bit followed by a spinlock which is aligned, the struct remains with a total size of 3200 (allmodconfig) / 2048 (with as much debug off as possible) bytes on x86-64. On x86-32 the struct is 1856 (allmodconfig) / 1216 (with as much debug off as possible) bytes long. The numbers were gained with and without the patch to prove that this change does not increase the size of the struct. Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	rtmutex: use a trylock for waiter lock in trylock	Sebastian Andrzej Siewior
	Mike Galbraith captered the following: \| >#11 [ffff88017b243e90] _raw_spin_lock at ffffffff815d2596 \| >#12 [ffff88017b243e90] rt_mutex_trylock at ffffffff815d15be \| >#13 [ffff88017b243eb0] get_next_timer_interrupt at ffffffff81063b42 \| >#14 [ffff88017b243f00] tick_nohz_stop_sched_tick at ffffffff810bd1fd \| >#15 [ffff88017b243f70] tick_nohz_irq_exit at ffffffff810bd7d2 \| >#16 [ffff88017b243f90] irq_exit at ffffffff8105b02d \| >#17 [ffff88017b243fb0] reschedule_interrupt at ffffffff815db3dd \| >--- <IRQ stack> --- \| >#18 [ffff88017a2a9bc8] reschedule_interrupt at ffffffff815db3dd \| > [exception RIP: task_blocks_on_rt_mutex+51] \| >#19 [ffff88017a2a9ce0] rt_spin_lock_slowlock at ffffffff815d183c \| >#20 [ffff88017a2a9da0] lock_timer_base.isra.35 at ffffffff81061cbf \| >#21 [ffff88017a2a9dd0] schedule_timeout at ffffffff815cf1ce \| >#22 [ffff88017a2a9e50] rcu_gp_kthread at ffffffff810f9bbb \| >#23 [ffff88017a2a9ed0] kthread at ffffffff810796d5 \| >#24 [ffff88017a2a9f50] ret_from_fork at ffffffff815da04c lock_timer_base() does a try_lock() which deadlocks on the waiter lock not the lock itself. This patch takes the waiter_lock with trylock so it should work from interrupt context as well. If the fastpath doesn't work and the waiter_lock itself is taken then it seems that the lock itself taken. This patch also adds a "rt_spin_try_unlock" to keep lockdep happy. If we managed to take the wait_lock in the first place we should also be able to take it in the unlock path. Cc: stable-rt@vger.kernel.org Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	lockdep: Correctly annotate hardirq context in irq_exit()	Peter Zijlstra
	There was a reported deadlock on -rt which lockdep didn't report. It turns out that in irq_exit() we tell lockdep that the hardirq context ends and then do all kinds of locking afterwards. To fix it, move trace_hardirq_exit() to the very end of irq_exit(), this ensures all locking in tick_irq_exit() and rcu_irq_exit() are properly recorded as happening from hardirq context. This however leads to the 'fun' little problem of running softirqs while in hardirq context. To cure this make the softirq code a little more complex (in the CONFIG_TRACE_IRQFLAGS case). Due to stack swizzling arch dependent trickery we cannot pass an argument to __do_softirq() to tell it if it was done from hardirq context or not; so use a side-band argument. When we do __do_softirq() from hardirq context, 'atomically' flip to softirq context and back, so that no locking goes without being in either hard- or soft-irq context. I didn't find any new problems in mainline using this patch, but it did show the -rt problem. Cc: stable-rt@vger.kernel.org Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-dgwc5cdksbn0jk09vbmcc9sa@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	swait: Add a few more users	Sebastian Andrzej Siewior
	The wait-simple queue is lighter weight and more efficient than the full wait queue, and may be used in atomic context on PREEMPT_RT. Fix up some places that needed to call the swait_() functions instead of the wait_() functions. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	cpu_down: move migrate_enable() back	Tiejun Chen
	Commit 08c1ab68, "hotplug-use-migrate-disable.patch", intends to use migrate_enable()/migrate_disable() to replace that combination of preempt_enable() and preempt_disable(), but actually in !CONFIG_PREEMPT_RT_FULL case, migrate_enable()/migrate_disable() are still equal to preempt_enable()/preempt_disable(). So that followed cpu_hotplug_begin()/cpu_unplug_begin(cpu) would go schedule() to trigger schedule_debug() like this: _cpu_down() \| + migrate_disable() = preempt_disable() \| + cpu_hotplug_begin() or cpu_unplug_begin() \| + schedule() \| + __schedule() \| + preempt_disable(); \| + __schedule_bug() is true! So we should move migrate_enable() as the original scheme. Cc: stable-rt@vger.kernel.org Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	kconfig-preempt-rt-full.patch	Thomas Gleixner
	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	kconfig-disable-a-few-options-rt.patch	Thomas Gleixner
	Disable stuff which is known to have issues on RT Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	md: disable bcache	Sebastian Andrzej Siewior
	It uses anon semaphores \|drivers/md/bcache/request.c: In function ‘cached_dev_write_complete’: \|drivers/md/bcache/request.c:1007:2: error: implicit declaration of function ‘up_read_non_owner’ [-Werror=implicit-function-declaration] \| up_read_non_owner(&dc->writeback_lock); \| ^ \|drivers/md/bcache/request.c: In function ‘request_write’: \|drivers/md/bcache/request.c:1033:2: error: implicit declaration of function ‘down_read_non_owner’ [-Werror=implicit-function-declaration] \| down_read_non_owner(&dc->writeback_lock); \| ^ either we get rid of those or we have to introduce them… Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	rt,ntp: Move call to schedule_delayed_work() to helper thread	Steven Rostedt
	The ntp code for notify_cmos_timer() is called from a hard interrupt context. schedule_delayed_work() under PREEMPT_RT_FULL calls spinlocks that have been converted to mutexes, thus calling schedule_delayed_work() from interrupt is not safe. Add a helper thread that does the call to schedule_delayed_work and wake up that thread instead of calling schedule_delayed_work() directly. This is only for CONFIG_PREEMPT_RT_FULL, otherwise the code still calls schedule_delayed_work() directly in irq context. Note: There's a few places in the kernel that do this. Perhaps the RT code should have a dedicated thread that does the checks. Just register a notifier on boot up for your check and wake up the thread when needed. This will be a todo. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-07	completion: Use simple wait queues	Thomas Gleixner
	Completions have no long lasting callbacks and therefor do not need the complex waitqueue variant. Use simple waitqueues which reduces the contention on the waitqueue lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	rcu-more-swait-conversions.patch	Thomas Gleixner
	Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Merged Steven's static void rcu_nocb_gp_cleanup(struct rcu_state rsp, struct rcu_node rnp) { - swait_wake(&rnp->nocb_gp_wq[rnp->completed & 0x1]); + wake_up_all(&rnp->nocb_gp_wq[rnp->completed & 0x1]); } Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	kernel/treercu: use a simple waitqueue	Sebastian Andrzej Siewior
	Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	rcutiny: Use simple waitqueue	Thomas Gleixner
	Simple waitqueues can be handled from interrupt disabled contexts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	simple-wait: rename and export the equivalent of waitqueue_active()	Paul Gortmaker
	The function "swait_head_has_waiters()" was internalized into wait-simple.c but it parallels the waitqueue_active of normal waitqueue support. Given that there are over 150 waitqueue_active users in drivers/ fs/ kernel/ and the like, lets make it globally visible, and rename it to parallel the waitqueue_active accordingly. We'll need to do this if we expect to expand its usage beyond RT. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	wait-simple: Rework for use with completions	Thomas Gleixner
	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	wait-simple: Simple waitqueue implementation	Thomas Gleixner
	wait_queue is a swiss army knife and in most of the cases the complexity is not needed. For RT waitqueues are a constant source of trouble as we can't convert the head lock to a raw spinlock due to fancy and long lasting callbacks. Provide a slim version, which allows RT to replace wait queues. This should go mainline as well, as it lowers memory consumption and runtime overhead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> smp_mb() added by Steven Rostedt to fix a race condition with swait wakeups vs adding items to the list.
2014-05-07	drm/i915: drop trace_i915_gem_ring_dispatch on rt	Sebastian Andrzej Siewior
	This tracepoint is responsible for: \|[<814cc358>] __schedule_bug+0x4d/0x59 \|[<814d24cc>] __schedule+0x88c/0x930 \|[<814d3b90>] ? _raw_spin_unlock_irqrestore+0x40/0x50 \|[<814d3b95>] ? _raw_spin_unlock_irqrestore+0x45/0x50 \|[<810b57b5>] ? task_blocks_on_rt_mutex+0x1f5/0x250 \|[<814d27d9>] schedule+0x29/0x70 \|[<814d3423>] rt_spin_lock_slowlock+0x15b/0x278 \|[<814d3786>] rt_spin_lock+0x26/0x30 \|[<a00dced9>] gen6_gt_force_wake_get+0x29/0x60 [i915] \|[<a00e183f>] gen6_ring_get_irq+0x5f/0x100 [i915] \|[<a00b2a33>] ftrace_raw_event_i915_gem_ring_dispatch+0xe3/0x100 [i915] \|[<a00ac1b3>] i915_gem_do_execbuffer.isra.13+0xbd3/0x1430 [i915] \|[<810f8943>] ? trace_buffer_unlock_commit+0x43/0x60 \|[<8113e8d2>] ? ftrace_raw_event_kmem_alloc+0xd2/0x180 \|[<8101d063>] ? native_sched_clock+0x13/0x80 \|[<a00acf29>] i915_gem_execbuffer2+0x99/0x280 [i915] \|[<a00114a3>] drm_ioctl+0x4c3/0x570 [drm] \|[<8101d0d9>] ? sched_clock+0x9/0x10 \|[<a00ace90>] ? i915_gem_execbuffer+0x480/0x480 [i915] \|[<810f1c18>] ? rb_commit+0x68/0xa0 \|[<810f1c6c>] ? ring_buffer_unlock_commit+0x1c/0xa0 \|[<81197467>] do_vfs_ioctl+0x97/0x540 \|[<81021318>] ? ftrace_raw_event_sys_enter+0xd8/0x130 \|[<811979a1>] sys_ioctl+0x91/0xb0 \|[<814db931>] tracesys+0xe1/0xe6 Chris Wilson does not like to move i915_trace_irq_get() out of the macro \|No. This enables the IRQ, as well as making a number of \|very expensively serialised read, unconditionally. so it is gone now on RT. Cc: stable-rt@vger.kernel.org Reported-by: Joakim Hernberg <jbh@alchemy.lu> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	gpu/i915: don't open code these things	Sebastian Andrzej Siewior
	The opencode part is gone in 1f83fee0 ("drm/i915: clear up wedged transitions") the owner check is still there. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	drm: remove preempt_disable() from drm_calc_vbltimestamp_from_scanoutpos()	Sebastian Andrzej Siewior
	Luis captured the following: \| BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 \| in_atomic(): 1, irqs_disabled(): 0, pid: 517, name: Xorg \| 2 locks held by Xorg/517: \| #0: \| ( \| &dev->vbl_lock \| ){......} \| , at: \| [<ffffffffa0024c60>] drm_vblank_get+0x30/0x2b0 [drm] \| #1: \| ( \| &dev->vblank_time_lock \| ){......} \| , at: \| [<ffffffffa0024ce1>] drm_vblank_get+0xb1/0x2b0 [drm] \| Preemption disabled at: \| [<ffffffffa008bc95>] i915_get_vblank_timestamp+0x45/0xa0 [i915] \| CPU: 3 PID: 517 Comm: Xorg Not tainted 3.10.10-rt7+ #5 \| Call Trace: \| [<ffffffff8164b790>] dump_stack+0x19/0x1b \| [<ffffffff8107e62f>] __might_sleep+0xff/0x170 \| [<ffffffff81651ac4>] rt_spin_lock+0x24/0x60 \| [<ffffffffa0084e67>] i915_read32+0x27/0x170 [i915] \| [<ffffffffa008a591>] i915_pipe_enabled+0x31/0x40 [i915] \| [<ffffffffa008a6be>] i915_get_crtc_scanoutpos+0x3e/0x1b0 [i915] \| [<ffffffffa00245d4>] drm_calc_vbltimestamp_from_scanoutpos+0xf4/0x430 [drm] \| [<ffffffffa008bc95>] i915_get_vblank_timestamp+0x45/0xa0 [i915] \| [<ffffffffa0024998>] drm_get_last_vbltimestamp+0x48/0x70 [drm] \| [<ffffffffa0024db5>] drm_vblank_get+0x185/0x2b0 [drm] \| [<ffffffffa0025d03>] drm_wait_vblank+0x83/0x5d0 [drm] \| [<ffffffffa00212a2>] drm_ioctl+0x552/0x6a0 [drm] \| [<ffffffff811a0095>] do_vfs_ioctl+0x325/0x5b0 \| [<ffffffff811a03a1>] SyS_ioctl+0x81/0xa0 \| [<ffffffff8165a342>] tracesys+0xdd/0xe2 After a longer thread it was decided to drop the preempt_disable()/ enable() invocations which were meant for -RT and Mario Kleiner looks for a replacement. Cc: stable-rt@vger.kernel.org Reported-By: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	mmci: Remove bogus local_irq_save()	Thomas Gleixner
	On !RT interrupt runs with interrupts disabled. On RT it's in a thread, so no need to disable interrupts at all. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	i2c/omap: drop the lock hard irq context	Sebastian Andrzej Siewior
	The lock is taken while reading two registers. On RT the first lock is taken in hard irq where it might sleep and in the threaded irq. The threaded irq runs in oneshot mode so the hard irq does not run until the thread the completes so there is no reason to grab the lock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2014-05-07	powerpc-preempt-lazy-support.patch	Thomas Gleixner
	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	arm-preempt-lazy-support.patch	Thomas Gleixner
	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	x86-preempt-lazy.patch	Thomas Gleixner
	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	sched: Add support for lazy preemption	Thomas Gleixner
	It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	rcu: Disable RCU_FAST_NO_HZ on RT	Thomas Gleixner
	This uses a timer_list timer from the irq disabled guts of the idle code. Disable it for now to prevent wreckage. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-05-07	rcu: rcutiny: Prevent RCU stall	Thomas Gleixner
	rcu_read_unlock_special() checks in_serving_softirq() and leaves early when true. On RT this is obviously wrong as softirq processing context can be preempted and therefor such a task can be on the gp_tasks list. Leaving early here will leave the task on the list and therefor block RCU processing forever. This cannot happen on mainline because softirq processing context cannot be preempted and therefor this can never happen at all. In fact this check looks quite questionable in general. Neither irq context nor softirq processing context in mainline can ever be preempted in mainline so the special unlock case should not ever be invoked in such context. Now the only explanation might be a rcu_read_unlock() being interrupted and therefor leave the rcu nest count at 0 before the special unlock bit has been cleared. That looks fragile. At least it's missing a big fat comment. Paul ???? See mainline commits: ec433f0c5 and 8762705a for further enlightment. Reported-by: Kristian Lehmann <krleit00@hs-esslingen.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-05-07	softirq: Adapt NOHZ softirq pending check to new RT scheme	Thomas Gleixner
	We can't rely on ksoftirqd anymore and we need to check the tasks which run a particular softirq and if such a task is pi blocked ignore the other pending bits of that task as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-07	softirq: Split softirq locks	Thomas Gleixner
	The 3.x RT series removed the split softirq implementation in favour of pushing softirq processing into the context of the thread which raised it. Though this prevents us from handling the various softirqs at different priorities. Now instead of reintroducing the split softirq threads we split the locks which serialize the softirq processing. If a softirq is raised in context of a thread, then the softirq is noted on a per thread field, if the thread is in a bh disabled region. If the softirq is raised from hard interrupt context, then the bit is set in the flag field of ksoftirqd and ksoftirqd is invoked. When a thread leaves a bh disabled region, then it tries to execute the softirqs which have been raised in its own context. It acquires the per softirq / per cpu lock for the softirq and then checks, whether the softirq is still pending in the per cpu local_softirq_pending() field. If yes, it runs the softirq. If no, then some other task executed it already. This allows for zero config softirq elevation in the context of user space tasks or interrupt threads. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>