A better explanation of the 32-bit x86 calling convention
---------------------------------------------------------

In gcc's cdecl,  the stack looks like this after a function call:

BP?-->	
	.
	.
	arg n
	.
	.
	arg 1
	arg 0
SP-->	return address
  

The standard function prologue saves EBP, other registers as needed,
then creates space for as many local variables and function-arguments
as it needs.  If no saved registers are needed,  the stack looks like:

	.
	.
	arg n
	.
	.
	arg 0
	return address
BP-->	saved BP
	var 0
	var 1
	.
	.
	var n
	subarg n
	.
	.
SP-->	subarg 0

Functions returning 32-bit results (int,long,pointer) do so in EAX,  so
that register is obviously not preserved across a function-call.  (short)
and (char) are returned in AX and AL, of course.  (long long) is returned
int EDX:EAX.  

CX is also used as a scratch register.  On the other hand,  BX, SI and DI
are supposed to be saved across function-calls.  If any are needed,  the 
compiler chooses them in that order and pushes them in reverse-order 
during the prologue.

At the end of the function (or as soon as the stack is not needed,  under
optimization),  the compiler resets the stack by moving EBP to ESP and
popping the saved EBP into EBP.  All the "ret" instruction does is pop
the return address into IP.

So far,  the function epilogue could contain a constant addition rather
than a move instruction.  That would even save one clock cycle on a 486;
however,  its implications on a superscalar chip are uncertain.  It does
make it easier to "walk up" the stack in a debugger.

Decoupling SP/BP allows implementing the "alloca" function.  All the
compiler needs to do is decrease SP by the requested size, then return
a pointer to the bottom of the now-untouched space (which is still
above SP, if a function will be called).  A similar operation can be
used to allocate a local array whose size is known only at runtime;
gcc supports this construct,  but actually generates more code for it,
even with '-O2' optimization.

Float and double arguments are passed on the stack.  A float or double
return-value is returned in the first floating-point register.

To recap,  as real example,  before the call to "printf" but after "fgets"
in the little program "fgd.c":

					     
0xBF89F420:  0x00000000      0x442d0cc0      0xbf89f498      0x0804845b
               ........	       ........	      .(old BP).     .(ret addr).

0xbf89f430:  0xbf89f440      0x00000050      0x0804a008      0x00000000 
               arg0=buf        arg1=80         arg2=f         (padding)

		"#inc"		"lude"		" <st"		"dio."
0xbf89f440:  0x636e6923      0x6564756c      0x74733c20      0x2e6f6964
	       ssssssss	       ssssssss	       ssssssss	       ssssssss

	     "h>",LF,NUL 
0xbf89f450:  0x000a3e68      0x444065f4      0xbf89f468      0x0804838e
	       zzssssss	       ........	       ........        ........
.
.  [64 more bytes of unknown data]
.		  		  
0xbf89f490:  0x00000000      0x44407ff4      0xbf89f4e8      0x442e8eb0
	       (padding)       (old BX)        (old BP)        (ret addr)

0xbf89f4a0:  0x00000001      0xbf89f514      0xbf89f51c      0x00000001
               (argc)          (argv)          (env)           ????????

While this particular example was compiled using gcc and executed
under gdb, the calling-convention I have described is the Intel
standard for all i386 and later chips.  It is based on the calling
convention of the 8086;  the primary difference is that the stack
word size is 32 bits,  and some operations are faster when done on
8-byte or even 16-byte boundaries,  hence the (optional) padding.

--
David Lee Lambert <davidl@lmert.com>
12Oct'06