« Text adventures | Main | Felines and kidney failure. »

Code

January 23, 2006

Signals, C and copyover

Moderate technical stuff ahead

I've been playing in the mud code lately, a lot of it at a signal handling and stack-frame level, which has given us a nice stack tracing function as part of the signal handler (which traps larger errors and pinpoints their location in the code).

Next was copyover, a clever command that allows you to restart the mud without disconnecting anyone. It starts a new process using execl(), handing over control, and lets it know which characters to reconnect (and how). I implemented it (well, I pasted in working copyover code from the Chuck codebase) thinking that should be fairly straight forward.

And in testing it worked fairly well. Except for the crash recovery, which for reasons that left me stumped a long while, would recover from one crash but never a second one. Ever. Further testing showed it wasn't just me - the behaviour existed in copyover previously.

Debugging was made slightly harder becuase it does create a new process - and the debugger can't follow code execution into there. The code was tweaked to start the debugger rather than the new process. It didn't help much. The signal continued to be unhandled.

The first guess was bad and wasted a bit of time wondering if the signals were somehow still hooked and locked up by the previous code. Stepping through the signal hooking code (which had always worked) showed that the new process was still hooking all the signals correctly.

Which means it was time to consult an internet reference for the details of execl() and signals. I'm glad I did it in that order.

Delving into the Single Unix Group documentation reveals some interesting bits about execl() that no printed matter I've seen mentions:


The new process also inherits at least the following attributes from the calling process image:
nice value (see nice())
semadj values (see semop())
process ID
parent process ID
process group ID
session membership
real user ID
real group ID
supplementary group IDs
time left until an alarm clock signal (see alarm())
current working directory
root directory
file mode creation mask (see umask())
file size limit (see ulimit())
process signal mask (see sigprocmask())
pending signal (see sigpending())
tms_utime, tms_stime, tms_cutime, and tms_cstime (see times())
resource limits
controlling terminal
interval timers

(my bold) When a signal is handled inside a program, that signal is masked (so it will be ignored by the error-handling code in future). So after a SIGSEGV crash-copyover, the signal mask is set to ignore SIGSEGV errors. This mask is subsequently passed on to the new copy - the second SIGSEGV was ignored by the code's error handling and crashed the mud.

The crash copyover code now passes the signal number to the new process, allowing it to unmask that signal with sigprocmask()and handle everything correctly.

Scrawled illegibly by Meathe at January 23, 2006 03:15 PM

Comments