(preface - boring stuff, feel free to skip down to the implementation details)
I need to provide exception handling to a language I am working on. It "compiles" to a subset of C, and since I don't want to make C++ a dependency, and found the few available C libraries rather stiff and lacking, the only way is to come up with an implementation of my own.
Naturally, I'd like it to be as efficient as possible. Exception handling schemes seem to always come with overheads, in some cases drastic performance hits, in other cases - (almost) zero cost, at least when it comes to executing the code, but there are still overheads in terms of code size, memory usage and CPU time when exceptions are thrown.
Going down to assembly level in order to get control of the stack is not an option either - although I could cook up a basic compiler that targets assembly for a platform or two, I am not in the capacity to produce a professional grade compiler with all its optimizations and such that can target as many platforms as say GCC. So I am stuck at C.
Which is not necessarily a bad thing. How would one go about handling errors in C? Check return values, check array bounds, check pointers before usage, check for 0 before division - good old tried and true.
But not necessarily convenient. Exceptions are intended to be more "coarse grain", involving deep trees of operations that may fail in different configuration, making it an (understatement) arduous task to propagate such a naive error checking scheme over all the code.
Implementation details
Luckily, the language I am working on has advanced meta capabilities, one of which - context aware code that can modify itself and generate extra code. Meaning that tasks, which would otherwise be considered impossibly tedious if done manually can be automated fairly easy. Thus my exception handling strategy begins to take shape.
Functions and operators that may fail are marked by a failSafe
specifier, and as such are given an implicit parameter, which is essentially an integer, where 0 means we are all good and other than zero means something went wrong. Internally, they do the good old fine grain error checking stuff, and use the integer to notify of the error, which is the actual exception throwing.
Functions can have both a unsafe and fail-safe flavor. Functions that themselves do not fail, but contain invocation of functions that may fail are also implicitly fail-safe.
A try
block essentially creates that error code integer, and forces the compiler to "roll out" all the code nested in the try block, and change to the overloads which propagate the integer to every stack frame that contains code which may fail. Every explicitly fail-safe function inserts a check after its invocation, and in case of an error, the caller function does not resume but returns, to another such check all the way down to the try block, essentially unrolling the stack and collecting all the locals along the way.
It may seem that all the checks which would reveal no error can be omitted, and instead implement "escape" code path, which is entered on the first error and normally skipped if the function succeeds, but saving the states to make all those jumps will likely cost more than skipping the checks from successful operations will save, on top of the extra complexity.
Another useful feature fairly easy to implement is the ability to filter out what kind of errors you want protection against. A try block may be set to protect only against specific errors likely or known to occur in this scenario, reducing the overheads from error checking and code duplication.
A 32 bit integer outta be efficient enough to pass around, while still providing an ample 4+ billion of error codes. Negative values are reserved for language and library use, positive are for user exceptions. The exception error string itself can be retrieved from a global enum which accumulates the available exception types transparently from the user - cryptic integer values are not a concern, both the error and the exception itself are available in the form of readable text. Also no need for any extra allocations for the exception. Destructors are forbidden from throwing exceptions in order to not interfere with the stack unrolling. Lastly - nested exceptions are possible, as in there can be nested try blocks, directly or indirectly, and exceptions which are not caught propagate down through the try blocks until the root try block. The application does not "crash" if the exception is not handled, the operation attempted in the root try block simply fails and program execution continues.
Unfortunately, it will inevitably lead to some code duplication, having both unsafe and exception safe overloads, although I am not sufficiently aware of the details of "zero cost" implementations and can't really make a comparison with the amount of code they generate. The unrolling of the stack itself seems like it will be a little more efficient, since it will not involve any sort of loading of exception handling data, progress lookup and extra jumps, just call destructors if any and return to the previous stack frame. It doesn't require any extra stuff like a particular well defined ABI, no exception frames/tables/decoding, PC lookups or any of the complexity typically associated with exception handling implementations and requires no compiler support (from the C compiler that is, not the compiler for my language). It works as if you actually went and painstakingly propagated error checking over the call chain.
A couple of things to consider in the context of CPU time overhead:
My compiler does a lot of inlining, practically all trivial calls such as accessors, operators and such are eliminated, since they cost more code to execute as function calls than to execute inline. On top of that the C compiler is free to do even more inlining and optimization. Over 90% of the calls are omitted (stats from the library code that accompanies the language) even before the code reaches the C compiler. This means a lot of the error code propagation is eliminated as well, and the variable will spend most of its time on a register, requiring only a single clock cycle per check.
Error handling is typically used in resource creation and allocation, involving latent operations such as memory allocation or random memory access, which come at a dozen to dozens of cycles of latency. I expect this to mask off and mitigate the overhead somewhat.
Having outlined the design, what I am interested in the following:
- is it a sound design, does it make sense?
- am I missing something important?
- is there room for improvement?
- are there any downsides, flaws or limitations?
- how does it compare to existing implementations in regards to memory and CPU time overheads and capabilities (no concrete numbers naturally)?
Since several people mentioned profiling, it is testing time, using GCC 4.9.2 -o3:
int datacount, callcount, repeatcount, limit;
int * data = 0;
void foo1();
void foo2();
void foo3();
void foo4();
bool bar1();
bool bar2();
bool bar3();
bool bar4();
void __attribute__ ((noinline)) foo1() {
++callcount;
for (int i = 0; i < datacount; ++i) data[i] += 1;
if ((callcount < (limit / 10)) && !(callcount % 3)) foo2();
if ((callcount < (limit / 5)) && !(callcount % 2)) foo3();
if (callcount < limit) foo4();
return;
}
bool __attribute__ ((noinline)) bar1() {
++callcount;
for (int i = 0; i < datacount; ++i) data[i] += 1;
if ((callcount < (limit / 10)) && !(callcount % 3)) if (!bar2()) return false;
if ((callcount < (limit / 5)) && !(callcount % 2)) if (!bar3()) return false;
if (callcount < limit) if (!bar4()) return false;
return callcount;
}
void __attribute__ ((noinline)) foo2() {
++callcount;
for (int i = 0; i < datacount; ++i) data[i] -= 1;
if ((callcount < (limit / 10)) && !(callcount % 3)) foo3();
if ((callcount < (limit / 5)) && !(callcount % 2)) foo4();
if (callcount < limit) foo1();
return;
}
bool __attribute__ ((noinline)) bar2() {
++callcount;
for (int i = 0; i < datacount; ++i) data[i] -= 1;
if ((callcount < (limit / 10)) && !(callcount % 3)) if (!bar3()) return false;
if ((callcount < (limit / 5)) && !(callcount % 2)) if (!bar4()) return false;
if (callcount < limit) if (!bar1()) return false;
return callcount;
}
void __attribute__ ((noinline)) foo3() {
++callcount;
for (int i = 0; i < datacount; ++i) data[i] *= 2;
if ((callcount < (limit / 10)) && !(callcount % 3)) foo4();
if ((callcount < (limit / 5)) && !(callcount % 2)) foo1();
if (callcount < limit) foo2();
return;
}
bool __attribute__ ((noinline)) bar3() {
++callcount;
for (int i = 0; i < datacount; ++i) data[i] *= 2;
if ((callcount < (limit / 10)) && !(callcount % 3)) if (!bar4()) return false;
if ((callcount < (limit / 5)) && !(callcount % 2)) if (!bar1()) return false;
if (callcount < limit) if (!bar2()) return false;
return callcount;
}
void __attribute__ ((noinline)) foo4() {
++callcount;
for (int i = 0; i < datacount; ++i) data[i] /= 2;
if ((callcount < (limit / 10)) && !(callcount % 3)) foo1();
if ((callcount < (limit / 5)) && !(callcount % 2)) foo2();
if (callcount < limit) foo3();
return;
}
bool __attribute__ ((noinline)) bar4() {
++callcount;
for (int i = 0; i < datacount; ++i) data[i] /= 2;
if ((callcount < (limit / 10)) && !(callcount % 3)) if (!bar1()) return false;
if ((callcount < (limit / 5)) && !(callcount % 2)) if (!bar2()) return false;
if (callcount < limit) if (!bar3()) return false;
return callcount;
}
Generated assembly:
foo4():
subq $8, %rsp
movl callcount(%rip), %eax
leal 1(%rax), %ecx
movl datacount(%rip), %eax
movl %ecx, callcount(%rip)
testl %eax, %eax
jle .L2
movq data(%rip), %rdx
xorl %ecx, %ecx
.L3:
movl (%rdx), %eax
addl $1, %ecx
addq $4, %rdx
movl %eax, %esi
shrl $31, %esi
addl %esi, %eax
sarl %eax
movl %eax, -4(%rdx)
cmpl %ecx, datacount(%rip)
jg .L3
movl callcount(%rip), %ecx
.L2:
movl limit(%rip), %esi
movl $1717986919, %edx
movl %esi, %eax
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl $2, %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L4
movl %ecx, %eax
movl $1431655766, %edx
imull %edx
movl %ecx, %eax
sarl $31, %eax
subl %eax, %edx
leal (%rdx,%rdx,2), %eax
cmpl %eax, %ecx
je .L10
.L4:
movl %esi, %eax
movl $1717986919, %edx
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L5
testb $1, %cl
je .L11
.L5:
cmpl %esi, %ecx
jl .L12
addq $8, %rsp
ret
.L12:
addq $8, %rsp
jmp foo3()
.L11:
call foo2()
movl callcount(%rip), %ecx
movl limit(%rip), %esi
jmp .L5
.L10:
call foo1()
movl limit(%rip), %esi
movl callcount(%rip), %ecx
jmp .L4
foo1():
subq $8, %rsp
movl callcount(%rip), %eax
leal 1(%rax), %ecx
movl datacount(%rip), %eax
movl %ecx, callcount(%rip)
testl %eax, %eax
jle .L14
movq data(%rip), %rax
xorl %edx, %edx
.L15:
addl $1, (%rax)
addl $1, %edx
addq $4, %rax
cmpl %edx, datacount(%rip)
jg .L15
movl callcount(%rip), %ecx
.L14:
movl limit(%rip), %esi
movl $1717986919, %edx
movl %esi, %eax
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl $2, %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L16
movl %ecx, %eax
movl $1431655766, %edx
imull %edx
movl %ecx, %eax
sarl $31, %eax
subl %eax, %edx
leal (%rdx,%rdx,2), %eax
cmpl %eax, %ecx
je .L21
.L16:
movl %esi, %eax
movl $1717986919, %edx
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L17
testb $1, %cl
je .L22
.L17:
cmpl %esi, %ecx
jl .L23
addq $8, %rsp
ret
.L23:
addq $8, %rsp
jmp foo4()
.L22:
call foo3()
movl callcount(%rip), %ecx
movl limit(%rip), %esi
jmp .L17
.L21:
call foo2()
movl limit(%rip), %esi
movl callcount(%rip), %ecx
jmp .L16
foo2():
subq $8, %rsp
movl callcount(%rip), %eax
leal 1(%rax), %ecx
movl datacount(%rip), %eax
movl %ecx, callcount(%rip)
testl %eax, %eax
jle .L25
movq data(%rip), %rax
xorl %edx, %edx
.L26:
subl $1, (%rax)
addl $1, %edx
addq $4, %rax
cmpl %edx, datacount(%rip)
jg .L26
movl callcount(%rip), %ecx
.L25:
movl limit(%rip), %esi
movl $1717986919, %edx
movl %esi, %eax
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl $2, %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L27
movl %ecx, %eax
movl $1431655766, %edx
imull %edx
movl %ecx, %eax
sarl $31, %eax
subl %eax, %edx
leal (%rdx,%rdx,2), %eax
cmpl %eax, %ecx
je .L32
.L27:
movl %esi, %eax
movl $1717986919, %edx
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L28
testb $1, %cl
je .L33
.L28:
cmpl %esi, %ecx
jl .L34
addq $8, %rsp
ret
.L34:
addq $8, %rsp
jmp foo1()
.L33:
call foo4()
movl callcount(%rip), %ecx
movl limit(%rip), %esi
jmp .L28
.L32:
call foo3()
movl limit(%rip), %esi
movl callcount(%rip), %ecx
jmp .L27
foo3():
subq $8, %rsp
movl callcount(%rip), %eax
leal 1(%rax), %ecx
movl datacount(%rip), %eax
movl %ecx, callcount(%rip)
testl %eax, %eax
jle .L36
movq data(%rip), %rax
xorl %edx, %edx
.L37:
sall (%rax)
addl $1, %edx
addq $4, %rax
cmpl %edx, datacount(%rip)
jg .L37
movl callcount(%rip), %ecx
.L36:
movl limit(%rip), %esi
movl $1717986919, %edx
movl %esi, %eax
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl $2, %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L38
movl %ecx, %eax
movl $1431655766, %edx
imull %edx
movl %ecx, %eax
sarl $31, %eax
subl %eax, %edx
leal (%rdx,%rdx,2), %eax
cmpl %eax, %ecx
je .L43
.L38:
movl %esi, %eax
movl $1717986919, %edx
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L39
testb $1, %cl
je .L44
.L39:
cmpl %esi, %ecx
jl .L45
addq $8, %rsp
ret
.L45:
addq $8, %rsp
jmp foo2()
.L44:
call foo1()
movl callcount(%rip), %ecx
movl limit(%rip), %esi
jmp .L39
.L43:
call foo4()
movl limit(%rip), %esi
movl callcount(%rip), %ecx
jmp .L38
bar4():
subq $8, %rsp
movl callcount(%rip), %eax
leal 1(%rax), %ecx
movl datacount(%rip), %eax
movl %ecx, callcount(%rip)
testl %eax, %eax
jle .L47
movq data(%rip), %rdx
xorl %ecx, %ecx
.L48:
movl (%rdx), %eax
addl $1, %ecx
addq $4, %rdx
movl %eax, %esi
shrl $31, %esi
addl %esi, %eax
sarl %eax
movl %eax, -4(%rdx)
cmpl %ecx, datacount(%rip)
jg .L48
movl callcount(%rip), %ecx
.L47:
movl limit(%rip), %esi
movl $1717986919, %edx
movl %esi, %eax
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl $2, %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L49
movl %ecx, %eax
movl $1431655766, %edx
imull %edx
movl %ecx, %eax
sarl $31, %eax
subl %eax, %edx
leal (%rdx,%rdx,2), %eax
cmpl %eax, %ecx
je .L63
.L49:
movl %esi, %eax
movl $1717986919, %edx
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl %edx
subl %eax, %edx
cmpl %ecx, %edx
jg .L64
.L52:
cmpl %esi, %ecx
jl .L65
.L54:
testl %ecx, %ecx
setne %al
addq $8, %rsp
ret
.L64:
testb $1, %cl
jne .L52
call bar2()
testb %al, %al
je .L53
movl callcount(%rip), %ecx
movl limit(%rip), %esi
jmp .L52
.L65:
call bar3()
testb %al, %al
je .L53
movl callcount(%rip), %ecx
jmp .L54
.L63:
call bar1()
testb %al, %al
movl limit(%rip), %esi
movl callcount(%rip), %ecx
jne .L49
.L53:
xorl %eax, %eax
addq $8, %rsp
ret
bar1():
subq $8, %rsp
movl callcount(%rip), %eax
leal 1(%rax), %ecx
movl datacount(%rip), %eax
movl %ecx, callcount(%rip)
testl %eax, %eax
jle .L67
movq data(%rip), %rax
xorl %edx, %edx
.L68:
addl $1, (%rax)
addl $1, %edx
addq $4, %rax
cmpl %edx, datacount(%rip)
jg .L68
movl callcount(%rip), %ecx
.L67:
movl limit(%rip), %esi
movl $1717986919, %edx
movl %esi, %eax
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl $2, %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L69
movl %ecx, %eax
movl $1431655766, %edx
imull %edx
movl %ecx, %eax
sarl $31, %eax
subl %eax, %edx
leal (%rdx,%rdx,2), %eax
cmpl %eax, %ecx
je .L83
.L69:
movl %esi, %eax
movl $1717986919, %edx
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl %edx
subl %eax, %edx
cmpl %ecx, %edx
jg .L84
.L72:
cmpl %esi, %ecx
jl .L85
.L74:
testl %ecx, %ecx
setne %al
addq $8, %rsp
ret
.L84:
testb $1, %cl
jne .L72
call bar3()
testb %al, %al
je .L73
movl callcount(%rip), %ecx
movl limit(%rip), %esi
jmp .L72
.L85:
call bar4()
testb %al, %al
je .L73
movl callcount(%rip), %ecx
jmp .L74
.L83:
call bar2()
testb %al, %al
movl limit(%rip), %esi
movl callcount(%rip), %ecx
jne .L69
.L73:
xorl %eax, %eax
addq $8, %rsp
ret
bar2():
subq $8, %rsp
movl callcount(%rip), %eax
leal 1(%rax), %ecx
movl datacount(%rip), %eax
movl %ecx, callcount(%rip)
testl %eax, %eax
jle .L87
movq data(%rip), %rax
xorl %edx, %edx
.L88:
subl $1, (%rax)
addl $1, %edx
addq $4, %rax
cmpl %edx, datacount(%rip)
jg .L88
movl callcount(%rip), %ecx
.L87:
movl limit(%rip), %esi
movl $1717986919, %edx
movl %esi, %eax
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl $2, %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L89
movl %ecx, %eax
movl $1431655766, %edx
imull %edx
movl %ecx, %eax
sarl $31, %eax
subl %eax, %edx
leal (%rdx,%rdx,2), %eax
cmpl %eax, %ecx
je .L103
.L89:
movl %esi, %eax
movl $1717986919, %edx
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl %edx
subl %eax, %edx
cmpl %ecx, %edx
jg .L104
.L92:
cmpl %esi, %ecx
jl .L105
.L94:
testl %ecx, %ecx
setne %al
addq $8, %rsp
ret
.L104:
testb $1, %cl
jne .L92
call bar4()
testb %al, %al
je .L93
movl callcount(%rip), %ecx
movl limit(%rip), %esi
jmp .L92
.L105:
call bar1()
testb %al, %al
je .L93
movl callcount(%rip), %ecx
jmp .L94
.L103:
call bar3()
testb %al, %al
movl limit(%rip), %esi
movl callcount(%rip), %ecx
jne .L89
.L93:
xorl %eax, %eax
addq $8, %rsp
ret
bar3():
subq $8, %rsp
movl callcount(%rip), %eax
leal 1(%rax), %ecx
movl datacount(%rip), %eax
movl %ecx, callcount(%rip)
testl %eax, %eax
jle .L107
movq data(%rip), %rax
xorl %edx, %edx
.L108:
sall (%rax)
addl $1, %edx
addq $4, %rax
cmpl %edx, datacount(%rip)
jg .L108
movl callcount(%rip), %ecx
.L107:
movl limit(%rip), %esi
movl $1717986919, %edx
movl %esi, %eax
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl $2, %edx
subl %eax, %edx
cmpl %ecx, %edx
jle .L109
movl %ecx, %eax
movl $1431655766, %edx
imull %edx
movl %ecx, %eax
sarl $31, %eax
subl %eax, %edx
leal (%rdx,%rdx,2), %eax
cmpl %eax, %ecx
je .L123
.L109:
movl %esi, %eax
movl $1717986919, %edx
imull %edx
movl %esi, %eax
sarl $31, %eax
sarl %edx
subl %eax, %edx
cmpl %ecx, %edx
jg .L124
.L112:
cmpl %esi, %ecx
jl .L125
.L114:
testl %ecx, %ecx
setne %al
addq $8, %rsp
ret
.L124:
testb $1, %cl
jne .L112
call bar1()
testb %al, %al
je .L113
movl callcount(%rip), %ecx
movl limit(%rip), %esi
jmp .L112
.L125:
call bar2()
testb %al, %al
je .L113
movl callcount(%rip), %ecx
jmp .L114
.L123:
call bar4()
testb %al, %al
movl limit(%rip), %esi
movl callcount(%rip), %ecx
jne .L109
.L113:
xorl %eax, %eax
addq $8, %rsp
ret
I ran the test against a input parameter data set loaded from file to avoid any potential optimizations. The ran 5 times in a row, total run time about 38 minutes.
Initially foo
is considerably faster - by 13.1%, but soon enough both are equalized, for a peak of 13.5% in favor of bar
in the last quarter of the test.
All in all, over the duration of the entire test, bar
actually takes the lead by an advantage of 3.36%.
I made a bunch of charts, hoping to find some correlation between the input parameters and the results, but so far I don't seem to detect any.
The following set of charts shows the relation between the result and the input parameters for every test run:
Total calls are complimentary to the data size, since I aimed to provide fairly uniform running time. But it does fluctuate, from a little less than 1 second to over 25 seconds.
Here is another set of charts, in which the test runs and results are sorted from the best for foo
to the best for bar
, overlaid by the test input parameters. Again, no correlation between parameters and results is visible to me:
Performance vs test run time, which for some reason I forgot about:
Finally, this mess, all of the charts from the last set laid over one another without any scale, in hope to reveal a correlation: