Based on the train of thought from the previous article, I wrote two code snippets and put them into Compiler Explorer, compiled with O3, to observe the generated machine code.
I tried my best. Originally wanted to do side-by-side code blocks for comparison, but unfortunately lacking HTML knowledge, couldn’t manage it. Just bear with the text view.
Source Code
pub fn recurse_without_result(arr: &[u8], start: usize, end: usize) -> usize{
if end <= start { return 0; }
if arr[start] == arr[end]{
recurse_without_result(arr, start+1, end-1)
}
else{
std::cmp::min(
recurse_without_result(arr, start+1, end),
recurse_without_result(arr, start, end-1)
) + 1
}
}
pub fn recurse_with_result(arr: &[u8], start: usize, end: usize, ret: usize) -> usize{
if end <= start { return ret; }
if arr[start] == arr[end]{
recurse_with_result(arr, start+1, end-1, ret)
}
else{
std::cmp::min(
recurse_with_result(arr, start+1, end, ret + 1),
recurse_with_result(arr, start, end-1, ret + 1)
)
}
}
To simplify the code, this time the array is not using Generics, but directly using u8, which is a byte array.
Analysis of recurse_without_result
Machine Code
Now let’s analyze line by line the first one, without the return value.
First Section (base case section)
Corresponding to if end <= start { return 0; }
example::recurse_without_result:
; Linux calling convention - Save Callee Saved Registers
; Store these registers in the stack
pushq %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
pushq %rax
;%rcx = end
;%rdx = start
;In this assembly, you can understand it as if %rcx <= %rdx { goto .LBB0_1; } corresponding to the first line in the source code
cmpq %rdx, %rcx
jbe .LBB0_1 ; Jump to .LBB0_1, will reset %r14 to zero and return %r14, as seen in the code block below
;Initialize the registers
movq %rcx, %rbx ; %rbx = end
; Rust stores Slice as a fat pointer, saving the length and a pointer to the first item
movq %rsi, %r15 ; %r15 = arr.len
movq %rdi, %r12 ; %r12 = &arr
xorl %r14d, %r14d ; %r14 = 0
movq %rdx, %r13 ; %r13 = start
The jump target .LBB0_1
.LBB0_1:
xorl %r14d, %r14d ; %r14 = 0
jmp .LBB0_7 ; Jump to .LBB0_7, this block of code is for function return, explained below, for now just understand it as returning 0
Second Section (tail recursion section)
Corresponding to if arr[start] == arr[end]{ return recurse_without_result(arr, start+1, end-1); }
.LBB0_3:
cmpq %r15, %r13
jae .LBB0_9
; If %r13 >= %r15 { goto .LBB0_9; } This section actually jumps to panic code, which Rust uses to protect array index within its length range (preventing Index Out Of Range Error)
; Just a note, in Rust, unsafe code can optimize out these bound checks
cmpq %r15, %rbx
jae .LBB0_10
; If %rbx > %r15 { goto .LBB0_10; } Also jumps to panic, similar to the previous one, indicates array slice bound checking, as we need to access these two array indexes
movzbl (%r12, %r13), %ecx ; %ecx = arr[%r13]
leaq 1(%r13), %rax ; %rax = 1 + %r13
cmpb (%r12, %rbx), %cl ; %cl is the same register as %ecx, holding the same byte
jne .LBB0_6
; If %ecx != arr[%rbx] { goto .LBB0_6; } Below you can see, it jumps to the recursive code
addq $-1, %rbx ; %rbx -= 1
movq %rax, %r13 ; %r13 = %rax
cmpq %rax, %rbx
ja .LBB0_3 ; If %rbx > %rax { goto .LBB0_3; }
jmp .LBB0_7 ; Jump to .LBB0_7, go to function return code, returns the value in current %r15
You can see, .LBB0_3 implements a while loop, with the condition %rbx > %rax
%rbx decrements by 1 each loop, %rax increments by 1 each loop, corresponding to the end and start in recursion
It’s evident that the compiler has optimized a recursion into a tail recursion, turning it into a loop. So why is this done? Remember the Linux Calling Convention at the beginning; each function call causes a large number of registers to be stored in the stack. By optimizing tail recursion, the stack remains unchanged from start to finish, avoiding additional space consumption and eliminating high time costs of reading data from memory.
Third Section (recursion section)
Corresponding to else{ return std::cmp::min( recurse_without_result(arr, start+1, end-1), recurse_without_result(arr, start, end-1) ) + 1;
.LBB0_6:
movq example::recurse_without_result@GOTPCREL(%rip), %rbp
; Save the pointer to the recursive function, similar to a function pointer in C, during the final dynamic linking, it will be converted to the start of this function's code
; The next 4 lines are the arguments to be passed to the recursive functions, think of it as passing the arguments
movq %r12, %rdi ; %rdi = %r12 &arr
movq %r15, %rsi ; %rsi = %r15 arr.len
movq %rax, %rdx ; %rdx = %rax - %r13 stored the start value, before jumping to .LBB0_6 %rax = %r13 + 1, so this section corresponds to start+1
movq %rbx, %rcx ; %rcx = %rbx - corresponds to end-1
callq *%rbp
; Recursion, the passed parameters are current %rsi, %rdi (information of Rust Slice "arr" that won't change no matter how many times it's passed), and current %rcx, %rdx, corresponding to end, start
; After the function returns, according to the Linux Calling Convention, the return value will be in %rax, but since the value of %rax will change in the next recursion, it needs to be saved first
movq %rax, %r14 ; %r14 = %rax
; Prepare the parameters for the next recursion
movq %r12, %rdi ; %rdi = %r12 &arr
movq %r15, %rsi ; %rsi = %r15 arr.len
movq %r13, %rdx ; %rdx = %r13 - the start value stored here
addq $-1, %rbx ; %rbx -= 1
movq %rbx, %rcx ; %rcx = %rbx - corresponds to end-1
callq *%rbp ; Recursion
cmpq %rax, %r14 ; Compare the return value of the second recursion (%rax) with that of the first recursion (%r14)
cmovaq %rax, %r14 ; If %r14 > %rax { %r14 = %rax; }
; Note the above is a function %r14 = min(%r14, %rax)
addq $1, %r14 ; %r14 += 1
Some may question, aren’t %rsi and %rdi supposed to be constant and not change? Why store them temporarily in %r12 and %r14 every recursion?
Good question, I think the compiler didn’t optimize this out, actually it could be optimized away.
Fourth Section (return value section)
.LBB0_7:
; If it came from the base case section, then %r14 is 0, set 0 to %rax as the return value
; If it came from the tail recursion section(.LBB0_3)
; If it came from the section above, then %r14 is the minimum of the two recursions plus 1
movq %r14, %rax
; The following lines are restoring the registers and stack values before this function was called
addq $8, %rsp ; The bottom 8 bytes stored the return value %rax before calling this function, you can verify this by looking back at the first code section for confirmation
; Why it was stored in the beginning I'm not clear as of now, maybe someday looking back can make it clear
; Restore all callee saved registers
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq ; Return to the line of code that called this function
Fifth Section (Panic section)
This code deals with the error information displayed when a Rust panic occurs
You can see .L__unnamed_* stores the code name and some error info
This section doesn’t require a deep understanding, it’s a unique feature of Rust. But one can guess that frequent operations on %r15 are related to displaying error messages like “Array length is xxx, accessed xxx”, since at this point %r15 holds the previous end value
Remember, if the bound checking was optimized away through unsafe means earlier, this code segment will disappear
Thus, knowing how the code will be compiled, one can optimize the code significantly, reducing it by a whopping 27 lines (this segment has 23 lines, and the previous bound checking has 4 lines), and also slightly boosting performance by having the CPU execute fewer lines of code (though this performance boost is minimal, just imagine if you are accessing arrays many times, sometimes most of the program is just doing bound checking).
C language doesn’t have bound checking, so directly-written code will be faster
This also highlights Rust’s feature; Rust can be fast, but you must understand that to make it fast, you might have to compromise on safety.
Extrapolate - What Exactly is Inlining
At the beginning of learning code, one might wonder if giving only one line of code the subprogram’s function, wouldn’t the program constantly jump around, and allocate a lot of memory on the stack?
In this case, one typically gets the answer, “The compiler will inline them, essentially copying that line directly.”
So, from the assembly interpretation in this article, you should gain a deeper understanding of inlining.
Every function call causes registers to be stored in the stack (the number of registers depends on the function’s complexity, registers not needed by the function won’t be stored in the stack), and inlining is directly integrating the original code into the assembly code.
As seen in this code, the compiler does everything possible to reduce the number of function calls, using tail recursion to optimize function calls into loops, and using conditional move to handle the min() function in just one line of assembly.
Extrapolate Four - Why I Should Use Standard Library as Much as Possible
Imagine if you wanted to implement a min(x,y) function yourself, how would you handle it? I guess it would be something like this:
#[inline]
fn min<T: std::cmp::PartialOrd + Copy>(x: &T, y: &T) -> T{
if x > y { y.clone() } else { x.clone() } //Note the clone here; it's easier to compare by direct pointers in C, but it's more cumbersome in Rust
// But generally, types with the Copy Trait are quite small, like integers, so cloning one costs almost the same as a pointer
// For larger types, you could use a pointer version, but this code segment is just for demonstration, not very rigorous
}
Now, using the standard library directly, it gets inlined into assembly, saving time copying, and allowing more optimization room for the compiler.
For example, in the code of this article, the compiler noticed that we wouldn’t need the larger value afterward, so it compacted it into a single line of conditional move.
Many standard library functions are more optimized than self-implemented ones, so studying a language’s standard library can greatly enhance code efficiency.
Now Let’s Look at Passing Value in Recursion
In a lateral comparison, in recursion with a passed value, an extra parameter is passed at the beginning, saved in the %r14 register
|
|
And the purpose of %r14 is the same as recursion without passing a value, storing the return value
The code resetting %r14 to zero at the beginning is absent, as 0 will be provided in the main program
*The absence of this code eliminates one jump instruction, but this jump is a jmp, a forced jump type; thus it’s predictable and has no impact on performance.
Even the tail recursion section is exactly the same
The only difference is in the recursion call section
.LBB0_5:
addq $1, %r14 ;As both recursions require ret+1, it's directly incremented here
movq example::recurse_with_result@GOTPCREL(%rip), %rbp
;Same passing as before
movq %r12, %rdi
movq %r15, %rsi
movq %rax, %rdx
movq %rbx, %rcx
movq %r14, %r8 ;ret is passed to the next recursion
callq *%rbp ;Recursion
;The following two lines are somewhat mysterious, not sure why this is done
movq %rbp, %r9 ;First store the function pointer from %rbp to %r9
movq %rax, %rbp ;Then store the return value of the previous function in the just evacuated %rbp
;The next 5 lines are just as in recursion without passing, handling parameters for the next recursion
addq $-1, %rbx
movq %r12, %rdi
movq %r15, %rsi
movq %r13, %rdx
movq %rbx, %rcx
movq %r14, %r8 ;An additional step passing the return value
callq *%r9 ;Recursion
;Taking the minimum value
cmpq %rax, %rbp
cmovaq %rax, %rbp
movq %rbp, %r14
;Follows the function return code
Back to Basics - Which Approach is Better
In terms of time, there’s no significant difference; the extra or fewer lines of code hardly make any impact with today’s 4GHz processors, a mere difference of 10 ticks maximum.
So, how much difference in space?
Answer: 0
Because the space holding the ret passes through a register (%r8), in each recursion, %r8 is saved in %r14, and %r14 originally also stores the return value in the case of recursion without passing a value. Therefore, you can see with both methods, right at the beginning, 7 registers are stored in the stack. Space occupied on the stack altogether is 7*8 = 56 bytes. So the total stack space occupied will be (recursion depth-1) * 56 bytes.
Therefore, no substantial difference exists for the two methods, at least not in this context.
Summary
This article compared recursion with and without passing values, concluding that there’s virtually no difference, almost equivalent.
However, in actual programming, not every recursive algorithm can choose whether to pass the return value on the way down or on the way up. For example, some backtracking algorithms must carry parameters down during recursion.
So, don’t take this article as a basis for comparison. The purpose here is to help readers understand how code changes once compiled into assembly language and some common optimization methods.
Appendix - Machine Code Source
|
|