remote: Fix a stuck remote call pipeline causing testing to hang

Fix a stuck remote call pipeline comprised of multiple processes causing testing to hang and requiring a manual intervention to either terminate or proceed, like below (here with the GCC `c' testsuite invoked with `execute.exp=postmod-1.c' for 8 compilation and 8 execution tests on a remote QEMU target run in the system emulation mode): PASS: gcc.c-torture/execute/postmod-1.c -O0 (test for excess errors) Executing on remote-localhost: .../gcc/testsuite/gcc/postmod-1.exe (timeout = 15) spawn [open ...] WARNING: program timed out got a INT signal, interrupted by user === gcc Summary === # of expected passes 1 by not killing the pending force-kills in `close_wait_program' and also by setting the channel associated with the pipeline to the nonblocking mode when it is about to be closed afterwards. The situation here is as follows. A connection to the remote target board is requested by `rsh_exec' with input redirection requested from `/dev/null'. The request is handled by `local_exec' and the redirection causes a Tcl command pipeline channel to be opened. The list of PIDs of the processes comprising the pipeline is determined and then the channel is assigned an Expect spawn ID. The spawn ID is then waited for output produced by the remote target (here accessed with SSH) and, ultimately, completion marked by the end-of-file condition. As SSH gets stuck and does not complete the timeout eventually fires and a kill sequence is initiated, by calling `close_wait_program' with the list of PIDs previously obtained to kill given as one of the procedure's arguments. Seeing the list of PIDs rather than -1 `close_wait_program' issues SIGINT to all the requested processes right away and schedules a delayed sequence called "force-kills" to them, which sends SIGTERM and then, after a further delay, SIGKILL. Now `close_wait_program' calls `close' on the spawn ID associated with the pipeline, but this call doesn't affect the pipeline as its input has been redirected from `/dev/null'. As the next step `wait' is called on the same spawn ID and returns successfully right away with a result like {0 exp8 0 0} in `wres', where no PID is indicated, consistently with the null PID result of the original `spawn' command that assigned the spawn ID (`exp8' here) to the pipeline. The return from the `wait' command causes code to be executed for the pending force-kills to be killed. At this point the process situation is like below: PID TTY STAT TIME COMMAND 6908 pts/3 Sl 0:00 expect -- .../share/dejagnu/runtest.exp --tool gcc --target_board remote-localhost execute.exp=postmod-1.c 6976 pts/3 S 0:00 \_ ssh -p 2222 -l macro localhost sh -c '.../gcc/testsuite/gcc/postmod-1.exe ; echo XYZ${?}ZYX' 6977 pts/3 Z 0:00 \_ [cat] <defunct> 6991 pts/3 Z 0:00 \_ [sh] <defunct> so `cat' and `sh' have already terminated, the former presumably due to SIGINT sent previously and the latter having been the force-kills just killed, and only await being wait(2)ed for, however `ssh' is still live and in the interruptible sleep, presumably awaiting communication with the remote end. Since there is nothing else to do for `close_wait_program' it returns success to `local_exec', which then calls `close' on the pipeline to clean up after it. But that in turn causes wait(2) to be called on the individual PIDs comprising the pipeline and when the PID associated with `ssh' the call hangs indefinitely preventing the whole testsuite from proceeding. A similar situation triggers with GDB testing where a Tcl command pipeline channel is opened in `remote_spawn' instead, and then closed, after `close_wait_program' has been called, in `standard_close'. So the solution to the problem is twofold. First pending force-kills are not killed after `wait' if there are more than one PID in the list passed to `close_wait_program'. This follows the observation that if there was only one PID on the list, then the process must have been created directly by `spawn' rather than by assigning a spawn ID to a pipeline and the return from `wait' would mean the process associated with the PID must have already been cleaned up after, so it is only when there are more there is a possibility any may have been left behind live. Second if a pipeline has been used, then the channel associated with the pipeline is set to the nonblocking mode in case any of the processes that may have left live is stuck in the noninterruptible sleep (aka D) state. Such a process would necessarily ignore even SIGKILL so long as it remains in that state and would cause wait(2) called by `close' to hang possibly indefinitely, and we want the testsuite to proceed rather than hang even in bad circumstances. Finally it appears to be safe to leave pending force-kills to complete their job after `wait' has been called in `close_wait_program', because based on the observation made here the command does not actually call wait(2) if issued on a spawn ID associated with a pipeline created by `open' rather than a process created by `spawn'. Instead the PIDs from a pipeline are supposed to be cleaned up after by calling wait(2) from the `close' command call made on the pipeline channel. If on the other hand the channel is set to the nonblocking mode before `close', then even that command does not call wait(2) on the associated PIDs. Therefore the PIDs on the list passed are not subject to PID reuse and the force-kills won't accidentally kill an unrelated process, as a PID cannot be allocated by the kernel for a new process until any previous process's status has been consumed from its PID by wait(2). And then PIDs of any children that have actually terminated one way or another are wait(2)ed for by Tcl automatically in the event loop, so no mess is left behind. * lib/remote.exp (close_wait_program): Only kill the pending force-kills if the PID list has a single entry. (local_exec): Set the channel about to be closed to the nonblocking mode if we didn't see an EOF. (standard_close): Likewise, unconditionally. Signed-off-by: Maciej W. Rozycki <macro@wdc.com>
author: Maciej W. Rozycki <macro@wdc.com> 2020-06-11 02:31:02 +0100
committer: Jacob Bachmeyer <jcb62281+dev@gmail.com> 2020-06-22 23:33:49 -0500
commit: 1b09d3a7b9912aab0a3e3ba2a63ebaa8e61f3238 (patch)
tree: 3d7355beda896feaf3cca146f10c329eeeb68c7d /lib
parent: 04668d6771f583c5b0a782e075acb71191c33b55 (diff)
1 files changed, 17 insertions, 1 deletions
diff --git a/lib/remote.exp b/lib/remote.exp
index 83472f7..1c9971a 100644
--- a/lib/remote.exp
+++ b/lib/remote.exp
@@ -109,11 +109,15 @@ proc close_wait_program { program_id pid {wres_varname ""} } {
 
     # Reap it.
     set res [catch "wait -i $program_id" wres]
-    if {$exec_pid != -1} {
+    if { $exec_pid != -1 && [llength $pid] == 1 } {
 	# We reaped the process, so cancel the pending force-kills, as
 	# otherwise if the PID is reused for some other unrelated
 	# process, we'd kill the wrong process.
 	#
+	# Do this if the PID list only has a single entry however, as
+	# otherwise `wait' will have returned right away regardless of
+	# whether any process of the pipeline has exited.
+	#
 	# Use `catch' in case the force-kills have completed, so as not
 	# to cause TCL to choke if `kill' returns a failure.
 	catch {exec sh -c "kill -9 $exec_pid" >& /dev/null}
@@ -242,6 +246,12 @@ proc local_exec { commandline inp outp timeout } {
     }
     set r2 [close_wait_program $spawn_id $pid wres]
     if { $id > 0 } {
+	if { $pid > 0 } {
+	    # If timed-out, don't wait for all the processes associated
+	    # with the pipeline to terminate as a stuck one would cause
+	    # us to hang.
+	    catch {fconfigure $id -blocking false}
+	}
 	set r2 [catch "close $id" res]
     } else {
 	verbose "waitres is $wres" 2
@@ -387,6 +397,12 @@ proc standard_close { host } {
 	close_wait_program $shell_id $pid
 
 	if {[info exists oid]} {
+	    if { $pid > 0 } {
+		# Don't wait for all the processes associated with the
+		# pipeline to terminate as a stuck one would cause us
+		# to hang.
+		catch {fconfigure $oid -blocking false}
+	    }
 	    catch "close $oid"
 	}
author	Maciej W. Rozycki <macro@wdc.com>	2020-06-11 02:31:02 +0100
committer	Jacob Bachmeyer <jcb62281+dev@gmail.com>	2020-06-22 23:33:49 -0500
commit	1b09d3a7b9912aab0a3e3ba2a63ebaa8e61f3238 (patch)
tree	3d7355beda896feaf3cca146f10c329eeeb68c7d /lib
parent	04668d6771f583c5b0a782e075acb71191c33b55 (diff)