Friday, October 28, 2022

[SOLVED] How to detect Register Smashing on an Intel CPU in a Multithreaded CUDA application written in C and Python under Linux?

Issue

I am currently trying to debug a very large application with many different modules, some written in C, and some in Python. It uses both multithreading and CUDA. It is running on a modern Intel processor under Linux.

Currently I have a test use case that runs for about an hour in a loop and then segfaults with an assertion error. Looking at the stack trace, it shows that I am calling g_signal_disconnect(obj, sig) with a valid value for sig, but that g_signal_disconnect is seeing a nonsensical value for sig. It appears that between the registers being set up for the call and the actual call something happens to change the %rsi register that holds the sig value. That is, the stack frame for the caller shows the correct value for sig in the local variable and in the register, but the callee sees a large random number instead. I'm guessing some other task runs or an external interrupt occurs and causes the issue but that is completely a guess.

This bug is consistent in that its always this particular call that gets smashed, but it only happens randomly once in thousands (hundreds of thousands?) of executions of this call. It also doesn't seem to matter if I am running natively, under gdb, or under valgrind. It still happens.

Because its a register being changed, I can't get gdb to set a watchpoint on it to see what is changing it. Nor can gdb run code in reverse in a multithreaded environment.

Because its a CUDA application, I cannot use rr-debugger to record the exact stream of instructions that causes the issue.

And although I can run the program under valgrind and get some results, it only tells me that the sig value is undefined when I go to use it, not when something made it undefined. Nor does valgrind show any memory or multitasking errors that might reasonably be the culprit.

Now, I do have full access to the source code of the module in which the bug happens, so I can instrument it anyway that makes sense, or recompile it so long as those compilation options are compatible with the rest of the linux stack it runs on, so there may be something I can do, but I don't know what.

Just finding some way to know which tasks runs and/or interrupts occur during the register-smashing window would go a long way to narrowing things down, but I don't know how to obtain that info either.

Does anyone know of any tools, tips, techniques, or whatnot that will allow me to catch the register-smasher in the act? Once I know what routine is to blame, it should be possible to fix it.


Solution

Okay, thanks to everyone for their help. To address the actual question I asked, this kind of thing is currently best addressed by a debugger that can record and replay multithreaded instruction streams. RR-Debugger does that and is open source but does not support CUDA. Undo UDB is commercial and has partial support for CUDA. Currently its your best bet in a similar circumstance (although in my case it's CUDA support was insufficient). Both of these are add-ons to GDB's recording facility.

Now, as to the actual bug, which has finally been found and fixed, it turned out NOT to be Register Corruption, but merely looked like it. It turned out to be a data race issue. I'm rather embarrassed to have made this particular mistake, but it is what it is. A rough paraphrase of the code follows:

void signal_setup(...)
  { struct signal_data * data = malloc(sizeof(struct signal_data));

    data->a = ...
    data->b = ...
    data->sig = g_signal_connect(obj, "sig", signal_cb, data,...);

    ...
  }

void signal_cb( GObject * obj, void * user_data )
  { struct signal_data * data = user_data;

    g_signal_disconnect(obj, data->sig);

    ...

    free(data);
  }

It turns out that about one time in every 200,000 calls or so, the signal would be triggered between the call the to g_signal_connect and its signal id being stored in data->sig. This would result in the value being pulled out of data->sig in the callback being random junk, which g_signal_disconnect would (rightly) complain about.

However, because the callback was in a different thread than the signal_setup routine, the signal_setup would complete a few milliseconds later and finish filling in the struct signal_data so that it would be correct. The upshot was that when I looked at the stack frames in the debugger, the data structure had valid data, but the register that had been read from that structure was garbage. I thus assumed register corruption in a narrow window.

I didn't find the real bug until I put in timestamped logging of each signal setup and each signal callback, and saw a callback before the setup, just before the crash.



Answered By - swestrup
Answer Checked By - Dawn Plyler (WPSolving Volunteer)