Disclaimer
This post is not for the faint of heart. This post assumes you know the C language (especially the Microsoft implementation), C compilers (especially Visual C++), what intrinsic functions are, what SEH is and how it works, and how to read x86 assembly. I could spend entire blog posts, no, entire blogs, just explaining you what the words alone mean, so I'll just go ahead and assume you know who you are.
… baby Jesus?
Implementing open source intrinsics for commercial compilers is a horrible liability, and a thankless job: almost nobody will even notice you have done anything, and half of those who do notice will accuse you of copyright violation. Since my work will not only be used by ReactOS, but also by projects like mingw-w64 whose reputation has not been compromised yet, I have to be extra careful with it.
For this reason, and because I believe I'm not alone in enjoying this kind of crazy shit, I have decided to blog to document my thought processes while writing obscure, low-level code. I hope at least some of you will enjoy.
Why do you do it, anyway?
Strictly speaking, this work is not required. For the purposes of compiling ReactOS with Visual C++, we could simply link the original Microsoft implementations.
However, I collaborate with the mingw-w64 project as well, which aims to create a Windows version of gcc that's as close as possible to both the GNU and the Win32/Win64 platforms. Being able to link code (e.g. static libraries) compiled with Microsoft tools would be a nice plus, and to achieve that, the runtime library that comes with mingw-w64 has to provide all the Microsoft compiler intrinsics.
Additionally, I'm very nervous about linking code in ReactOS executables that comes from outside our source tree, for several reasons (for one, we have no guarantee about the compiler options that were used to compile the third-party code, and this has come to bite us in the ass before). This seems to be a common sentiment among operating system developers, as I've seen more than one tutorial on how to provide missing compiler intrinsics.
Finally, writing compiler intrinsics is work I enjoy a lot (I am the author of PSEH). I have no idea why.
And now for the main course
We are all familiar with the code that Visual C++ generates for a function containing SEH, as it has not changed in more than ten years. Given C code like this:
void a(void)
{
__try { }
__except(1) { }
}
the compiler will emit assembler like this, with the recognizeable and well-known prologs and epilogs (which I marked in bold):
_TEXT SEGMENT
__$SEHRec$ = -24
_a PROC
push ebp
mov ebp, esp
push -1
push OFFSET __sehtable$_a
push OFFSET __except_handler3
mov eax, DWORD PTR fs:0
push eax
mov DWORD PTR fs:0, esp
sub esp, 8
push ebx
push esi
push edi
mov DWORD PTR __$SEHRec$[ebp], esp
mov DWORD PTR __$SEHRec$[ebp+20], 0
jmp SHORT $LN9@a
$LN5@a:
mov eax, 1
$LN7@a:
ret 0
$LN6@a:
mov esp, DWORD PTR __$SEHRec$[ebp]
$LN9@a:
mov DWORD PTR __$SEHRec$[ebp+20], -1
mov ecx, DWORD PTR __$SEHRec$[ebp+8]
mov DWORD PTR fs:0, ecx
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
ret 0
_a ENDP
_TEXT ENDS
But, like many things, this changed with the release of Visual Studio .NET, which was a huge leap forward from its predecessor Visual Studio 6.0, and in many ways a departure from the old ways.
Specifically, since Visual Studio .NET, when the Microsoft compiler optimizes for code size (option /O1 on the command line), rather than speed (/O2), it will emit assembler like this, instead:
_TEXT SEGMENT
__$SEHRec$ = -24
_a PROC
push 8
push OFFSET __sehtable$_a
call __SEH_prolog
and DWORD PTR __$SEHRec$[ebp+20], 0
jmp SHORT $LN9@a
$LN5@a:
xor eax, eax
inc eax
$LN7@a:
ret 0
$LN6@a:
mov esp, DWORD PTR __$SEHRec$[ebp]
$LN9@a:
or DWORD PTR __$SEHRec$[ebp+20], -1
call __SEH_epilog
ret 0
_a ENDP
_TEXT ENDS
The prolog code was collapsed into a call to _SEH_prolog, and the epilog code into a call to _SEH_epilog.
It's easy to see why: a fixed per-function overhead of 54 bytes is turned into a fixed per-function overhead of 11 bytes (to the benefit of code size) by moving the invariant code to a library function (to the detriment of code speed, because function calls are expensive).
Apparently, nobody documents what _SEH_prolog and _SEH_epilog do, so I had to find out on my own. As soon as I discovered how to reliably make the compiler emit the inline code or the calls for the same code (/O2 vs /O1 on the command line, it turned out), though, what each function did was self-evident. If we look at the assembly listings above, we'll see that they are virtually identical, save for the bolded sections, therefore the functions used in the second listing are completely equivalent to the prolog/epilog code of the first listing. Since the functions have only one implementation each, the inline code we are looking at is the only possible prolog/epilog. In other words, the only requirement for _SEH_prolog is that, if it's invoked like this:
push 8
push OFFSET __sehtable$_func
call __SEH_prolog
it will have the same effect as the following code:
push ebp
mov ebp, esp
push -1
push OFFSET __sehtable$_func
push OFFSET __except_handler3
mov eax, DWORD PTR fs:0
push eax
mov DWORD PTR fs:0, esp
sub esp, 8
push ebx
push esi
push edi
mov DWORD PTR __$SEHRec$[ebp], esp
and the only requirement for _SEH_epilog is that, when invoked, it will have the same effect as the following code:
mov ecx, DWORD PTR __$SEHRec$[ebp+8]
mov DWORD PTR fs:0, ecx
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
We can see that _SEH_prolog and _SEH_epilog cannot be regular functions, because they modify the stack of the caller: _SEH_prolog allocates stack space, saves registers and sets up a SEH frame, and _SEH_epilog undoes it. Since the stack pointer when the functions exit is at the "wrong" height (_SEH_prolog allocates stack space, _SEH_epilog deallocates), we can also reasonably assume that the return address is extracted from the stack into a register. We can also reasonably assume that the functions end with a ret instruction, so they don't screw too much with the CPU's branch predictor (which expects each call to be followed by a ret, not by a jmp that sometimes happens to go to the right place), so their last instructions have to look a lot like this:
; put the return address on the stack
push register
; return to the return address
ret
The stack tricks alone are a guarantee that the functions cannot be written in C, and must be written in raw x86 assembler.
After this discovery, I decided to take a closer look at how, exactly, the prolog code sets up the stack (and what, by extension, would the epilog code have to do to undo it).
Stack layout of a SEH-using function
By "manually" executing the instructions of the prolog, one by one, it's easy to see how the stack is being set up, and which special locations on the stack will be pointed to by which special registers:
-00000004: original ebp ; ebp points here
-00000008: -1
-0000000c: OFFSET __sehtable$_func
-00000010: OFFSET __except_handler3
-00000014: original fs:0 ; fs:0 points here
-00000018: undefined
-0000001C: new esp ; (ebp+__$SEHRec$)
-00000020: ebx
-00000024: esi
-00000028: edi ; esp points here
(instead of actual, absolute stack addresses, I'll use offsets from the initial value)
The 8 in the sub esp, 8 instruction is the size of the space between stack locations -00000014 and -0000001C. We can easily verify that this number is actually 8+0, where 0 is the combined size of the stack-allocated variables.
Summing up, for a function with N bytes of local variables, the final layout will be:
-00000004: original ebp
-00000008: -1
-0000000c: OFFSET __sehtable$_func
-00000010: OFFSET __except_handler3
-00000014: original fs:0
-00000018: undefined
-0000001c: new esp ; (ebp+__$SEHRec$)
-00000020: variable -4 ; (ebp+__$SEHRec$) -4
-00000024: variable -8 ; (ebp+__$SEHRec$) -8
... variable ... ; (ebp+__$SEHRec$) ...
-00000020 - N: ebx
-00000024 - N: esi
-00000028 - N: edi ; esp points here
The _SEH_prolog is therefore defined as the function that can turn the stack from its initial state to the final layout, and set special registers ebp and fs:0 (actually a special memory location, but a software register in practice) to the expected values, and _SEH_epilog as the function that can turn the final layout back into the initial state, and reset the special registers ebx, esi, edi, ebp and fs:0 to their initial values.
Implementing _SEH_prolog
We have to write the _SEH_prolog function so that it can create the final layout depicted above from an initial layout of:
-00000004: 8 + N
-00000008: OFFSET __sehtable$_func
-0000000c: return address
Additionaly, _SEH_prolog has to do so without disturbing the calling function's execution:
- it must return to the instruction after the call __SEH_prolog;
- it must not overwrite the ecx register (which contains the this pointer in __thiscall functions, and the first argument in __fastcall functions) nor the edx register (which contains the second argument in __fastcall functions);
- although I cannot prove this, it's most probably expected to not overwrite the ebx, esi or edi registers.
These limitations only leave the eax register free. We will try our best to only ever use eax.
The current contents of the stack are all wrong, but they all are information that we will need later, so we leave them alone. The next word on the stack is the address of __except_handler3, and the word after that is the current contents of fs:0. We can simply push them on the stack:
__SEH_prolog PROC
push OFFSET __except_handler3
mov eax, DWORD PTR fs:0
push eax
...
__SEH_prolog ENDP
At this point, the stack would be at the correct height to store its pointer in fs:0, but we can't do it yet: we cannot push a SEH frame before we have finished initializing it. We hold our horses, and move on to the next stack location.
The next stack location is the variable-sized part. We know the size of this part, because it's on the stack, passed as an argument to _SEH_prolog. All we need to do is to get it, and lower the stack by that amount. The current layout of the stack is:
-00000004: 8 + N
-00000008: OFFSET __sehtable$_func
-0000000c: return address
-00000010: OFFSET __except_handler3
-00000014: original fs:0 ; esp points here
so the argument is at [esp+16]. We cannot lower the stack yet, though, because then we wouldn't have a nice fixed offset to access the second argument (the pointer to __sehtable$_func) and the return address. Sounds like a good time as any to initialize the ebp register. We'll copy the argument from offset -00000004 to a register, save the current value of ebp to offset -00000004, and then set the new, final value of ebp:
__SEH_prolog PROC
push OFFSET __except_handler3
mov eax, DWORD PTR fs:0
push eax
mov eax, [esp+16]
mov [esp+16], ebp
lea ebp, [esp+16]
...
__SEH_prolog ENDP
With a pointer to a well-known location of the stack, we don't need to worry anymore about the variable-sized part, which we can now simply allocate:
__SEH_prolog PROC
push OFFSET __except_handler3
mov eax, DWORD PTR fs:0
push eax
mov eax, [esp+16]
mov [esp+16], ebp
lea ebp, [esp+16]
sub esp, eax
...
__SEH_prolog ENDP
The stack is looking a little better now:
-00000004: original ebp
-00000008: OFFSET __sehtable$_func
-0000000c: return address
-00000010: OFFSET __except_handler3
-00000014: original fs:0
-00000018: undefined
-0000001c: new esp
-00000020: ...
We have to move the return address to eax, and the pointer to __sehtable$_func to stack offset -0000000c. Unfortunately, we cannot do two overlapping moves with a single register, so we'll store the return address on top of the stack instead, where it belongs anyway. To do so, we must first push the ebx, esi and edi registers, because _func expects them to be on the top of the stack when _SEH_prolog returns; we will then store the current value of esp in [ebp-24], since the stack is now at the expected height:
__SEH_prolog PROC
push OFFSET __except_handler3
mov eax, DWORD PTR fs:0
push eax
mov eax, [esp+16]
mov [esp+16], ebp
lea ebp, [esp+16]
sub esp, eax
mov [ebp-24], esp
push ebx
push esi
push edi
mov [ebp-24], esp
...
__SEH_prolog ENDP
We can then move the return address from -0000000c ([ebp-8]) to the top of the stack, then move __sehtable$_func from -00000008 ([ebp-4]) to -0000000c ([ebp-8]), and finally set -00000008 ([ebp-4]) to constant value -1:
__SEH_prolog PROC
push OFFSET __except_handler3
mov eax, DWORD PTR fs:0
push eax
mov eax, [esp+16]
mov [esp+16], ebp
lea ebp, [esp+16]
sub esp, eax
mov [ebp-24], esp
push ebx
push esi
push edi
mov [ebp-24], esp
mov eax, [ebp-8]
push eax
mov eax, [ebp-4]
mov [ebp-8], eax
mov [ebp-4], -1
...
__SEH_prolog ENDP
The stack is now ready, with an extra word on top containing the return address. Initialization is done, and we can, at last, set fs:0 to the new SEH frame, starting at -00000014 ([ebp-16]), and return to the caller:
__SEH_prolog PROC
push OFFSET __except_handler3
mov eax, DWORD PTR fs:0
push eax
mov eax, [esp+16]
mov [esp+16], ebp
lea ebp, [esp+16]
sub esp, eax
mov [ebp-24], esp
push ebx
push esi
push edi
mov eax, [ebp-8]
push eax
mov eax, [ebp-4]
mov [ebp-8], eax
mov [epb-4], -1
lea eax, [ebp-16]
mov fs:0, eax
ret
__SEH_prolog ENDP
And with this, _SEH_prolog is done.
Implementing _SEH_epilog
The _SEH_epilog function has restrictions similar to those of _SEH_prolog, and then more:
- it must return to the instruction after the call __SEH_epilog;
- it cannot use the eax or edx registers, because they contain the return value of the function;
- it must restore the initial values of ebx, esi, edi, ebp and fs:0;
As we limited ourselves to using eax in _SEH_prolog, we'll limit ourselves to ecx in _SEH_epilog. We know ecx is safe to overwrite in the epilog (even in __fastcall and __thiscall functions) because the inline epilog code generated by the compiler does so.
The stack layout on entering _SEH_epilog is:
-00000004: original ebp ; ebp points here
-00000008: -1
-0000000c: OFFSET __sehtable$_func
-00000010: OFFSET __except_handler3
-00000014: original fs:0 ; fs:0 points here
-00000018: undefined
-0000001C: new esp
-00000020: ebx
-00000024: esi
-00000028: edi
-0000002c: return address ; esp points here
Following the example of the inline code, we restore the original value of fs:0 first:
__SEH_epilog PROC
mov ecx, DWORD PTR [ebp-16]
mov DWORD PTR fs:0, ecx
...
__SEH_epilog ENDP
Then we pop the return address off the stack, so we can restore ebx, esi and edi:
__SEH_epilog PROC
mov ecx, DWORD PTR [ebp-16]
mov DWORD PTR fs:0, ecx
pop ecx
pop edi
pop esi
pop ebx
...
__SEH_epilog ENDP
The next operation the inline code does is restoring esp and ebp, so let's do it too:
__SEH_epilog PROC
mov ecx, DWORD PTR [ebp-16]
mov DWORD PTR fs:0, ecx
pop ecx
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
...
__SEH_epilog ENDP
Finally, let's put the return address back on the stack, and then return to it:
__SEH_epilog PROC
mov ecx, DWORD PTR [ebp-16]
mov DWORD PTR fs:0, ecx
pop ecx
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
push ecx
ret 0
__SEH_epilog ENDP
And we're done. Any questions?
Q&A
What about functions that have more than 4096 bytes of local variables?
If a function has more than 4096 bytes of local variables (4068, actually — the hidden 28 bytes are the SEH frame minus the saved ebx, esi and edi), then the compiler will emit inline prologs/epilogs; the prolog will use _chkstk instead of sub esp.
What about code compiled with stack checks (/GS)?
Functions compiled with stack checks have different prologs/epilogs; the external implementations are called _SEH_prolog4 and _SEH_epilog4, and maybe we'll see them in a future episode.
Conclusions
In this first installment of "Inside the mind of a ReactOS developer" we have seen what the _SEH_prolog and _SEH_epilog functions are, what they do and how to implement them without looking at as little as one byte of copyrighted Microsoft code.
This blog post is the first public source on the internet that described the aforementioned functions in detail (that I know of), and if there is interest, I will write more like this.
Comments are welcome.
That's all.