How to survive obscure random hangs and crashes?

A client was having random hangs and crashes in one of our applications. How to troubleshoot this kind of issues? You can see next a possible approach.

Due to the nature of these issues and the fact that the user wasn’t giving too much information about the circumstances surrounding the cases we didn’t have a clue about their origin, and of course we also were not able to apply our developer tools to investigate. The customized application log was not giving useful information either.
The only information we were getting is “…the application freezes and then we have to go to Task Manager and kill the application and start over…”

So, let’s start with getting better information. I prepared a small batch script to install and setup a noninvasive tool and to create a shortcut in their desktop and/or toolbar.
From now on, every time the user was detecting a freeze we asked them to click on the button we had installed to collect better information. In that way the tool created a screenshot, created a memory dump, and finally it killed the application.
The first time we saw the screenshot and confirmed with the memory dump we learned that there was a crash in the application not a hang.

Now with this information and using a “.reg” file to create specifics windows registry key and values, [WER settings|https://msdn.microsoft.com/en-us/library/windows/desktop/bb513638%28v=vs.85%29.aspx||WER Settings] , we configured the workstation to create memory dumps every time the application crashes.
Ok, now we have a couple of memory dump files and learned from them that we’re having the feared heap corruptions. Sometimes when the application was closing or other times when the garbage collector was trying to release some memory we noted the heap corruptions were occurring.
Using “gflags” first and later “application verifier” from Microsoft we tried to reproduce in house and could detect where the heap corruption was happening.

Fine, we have now pinpointed the source code involved in this heap corruption so let’s start in on how to fix it.
Well, this part was not easy either. It was happening in a transition between C++ native and C++ managed code (C++/CLI).
It was happening in a shared class between these two worlds. The transition between these two worlds were not considering the byte alignment of the different primitive types specifically in this case the “bool” type.

Although there are several ways to solve this situation I chose the easiest way, replacing the “bool” type with a “int” type in the class declaration.
Rebuilt the module and Voila! The client has not had any more random freezes/crashes since then.

Happy debugging!