Why GCC and Windows don't play nicely together when using SSE

GCC, SSE and why you may want to use them

The SSE instruction set was introduced to the Pentium 3 processor in 1999 and now has pretty widespread support. Compilers perform auto-vectorization so all that's required to take advantage of SSE is a compiler switch (if it's not already enabled by default). SSE typically gives faster floating point performance and can be useful when wishing to perform single-precision arithmetic without higher precision intermediate results.

GCC is available under Windows with MinGW. It's a good compiler and has the advantage of making cross-platform development straightforward - it's available for all major platforms and you just need a single Makefile (with some minor platform-specific tweaks) to build the same code across Windows, Mac OS X and Linux (and possibly others). You don't need to keep multiple project workspaces (e.g. Visual Studio, XCode etc.) in sync just to do a build, and the operations and settings are all clear and unambiguous (in my experience XCode seems particularly bad at hiding what it does, when you might want to know about it).

However, there is a pitfall that may trip you up!

SSE, the stack, Windows and GCC

SSE and alignment

Some SSE instructions need the data to be aligned to certain boundaries in memory (e.g. begin on a 16-byte boundary). For the casual SSE user (like me), writing no assembly code and relying on the compiler working out auto-vectorization where it can, this should not be of any significance - the compiler should sort out these details. Indeed, it does - but not quite. Sometimes there is a problem.

SSE data on the stack

Some data used by SSE instructions is going to be put on the stack. The alignment (or not) of this is something that the compiler should take care of. Here is where there is a problem with the interaction between GCC and Windows.

Windows and GCC stack alignment conventions

Windows aligns the stack on a 4-byte boundary. GCC decides to keep the stack aligned to 16-byte boundary, so fixes up what it gets from Windows (or whatever operating system) when it initially gets control in main() (or WinMain()). Then it assumes the stack remains 16-byte aligned througout the rest of the program. This is to allow function entry and exit to be more efficient. I think the Microsoft compiler fixes up the stack to a 16-byte boundary on entry to every function that uses SSE instructions, so this problem does not occur there.

This strategy has a flaw though - when there are callbacks (such as WndProc() or CreateThread()), your program gets control without GCC knowing it needs to fix up the stack to a 16-byte boundary. It's assuming it's already 16-byte aligned, but it isn't, so your program may crash because it contains an SSE instruction that requires 16-byte aligned data. For me, this was a movaps instruction and not knowing much about assembly, working out what was going wrong took me ages! Here's the debugger output:

Program received signal SIGSEGV, Segmentation fault.
Renderer::Initialise (clear=..., text=...) at ../../../Renderer.cpp:659
659             GLfloat globalAmbient[4] = {0.0f, 0.0f, 0.0f, 1.0f};

(gdb) disassemble
Dump of assembler code for function Renderer::Initialise(Colour const&, Colour const&):
   0x0041c630 <+0>:     push   %ebp
   0x0041c631 <+1>:     push   %edi
   0x0041c632 <+2>:     push   %esi
   0x0041c633 <+3>:     push   %ebx
   0x0041c634 <+4>:     sub    $0x6c,%esp
   0x0041c637 <+7>:     cmpb   $0x0,0x50a0e0
   0x0041c63e <+14>:    movaps 0x4d96c0,%xmm0
   0x0041c645 <+21>:    mov    0x80(%esp),%esi
   0x0041c64c <+28>:    mov    0x84(%esp),%ebx
=> 0x0041c653 <+35>:    movaps %xmm0,0x50(%esp)
   0x0041c658 <+40>:    je     0x41cbe0 <Renderer::Initialise(Colour const&, Colour const&)+1456>
   0x0041c65e <+46>:    movl   $0x4bc820,(%esp)
   0x0041c665 <+53>:    lea    0x4c(%esp),%edi
   0x0041c669 <+57>:    movb   $0x0,0x50a0e0
   0x0041c670 <+64>:    call   0x425bd0 <Output(char const*, ...)>

(gdb) info register esp
esp            0x22f73c 0x22f73c

x86 assembly instructions have the form 'operation destination source'. So the problem here is that 0x22f73c + 0x50 is not 16-byte aligned.

How to fix the problem

There are two ways you can fix this problem.

  1. Tell GCC to assume the stack is not 16-byte aligned and fix it up on entry to every function. To do this, pass the -mstackrealign flag to GCC.
  2. Tell GCC to fix up the stack on each of the possible entry points. To do this, mark every entry point function in your code with __attribute__((force_align_arg_pointer)).

The second option is more efficient and in my case was easy - there were only 3 entry points in my code: WinMain(), WndProc() and the function I pass to CreateThread().


I hope this is useful to someone - I've found lots of useful information on the Internet that people have taken the time to write, and I decided that this was something I could do here.

A discussion that I found that helped me understand the problem and the solution.
A nice tutorial covering the basics of using GDB.