How to Pass Recursive Return Values? - Let Me Check Your Assembly Code!

Based on the train of thought from the previous article, I wrote two code snippets and put them into Compiler Explorer, compiled with O3, to observe the generated machine code.

I tried my best. Originally wanted to do side-by-side code blocks for comparison, but unfortunately lacking HTML knowledge, couldn’t manage it. Just bear with the text view.

Source Code

pub fn recurse_without_result(arr: &[u8], start: usize, end: usize) -> usize{
  if end <= start { return 0; }
  if arr[start] == arr[end]{
      recurse_without_result(arr, start+1, end-1)
  }
  else{
    std::cmp::min(
      recurse_without_result(arr, start+1, end), 
      recurse_without_result(arr, start, end-1)
    ) + 1
  }
}

pub fn recurse_with_result(arr: &[u8], start: usize, end: usize, ret: usize) -> usize{
  if end <= start { return ret; }
  if arr[start] == arr[end]{
      recurse_with_result(arr, start+1, end-1, ret)
  }
  else{
    std::cmp::min(
      recurse_with_result(arr, start+1, end, ret + 1), 
      recurse_with_result(arr, start, end-1, ret + 1)
    )
  }
}

To simplify the code, this time the array is not using Generics, but directly using u8, which is a byte array.

Analysis of `recurse_without_result` Machine Code

Now let’s analyze line by line the first one, without the return value.

First Section (base case section)

Corresponding to if end <= start { return 0; }

example::recurse_without_result:
  ; Linux calling convention - Save Callee Saved Registers
  ; Store these registers in the stack
  pushq   %rbp
  pushq   %r15
  pushq   %r14
  pushq   %r13
  pushq   %r12
  pushq   %rbx
  pushq   %rax
  ;%rcx = end
  ;%rdx = start
  ;In this assembly, you can understand it as if %rcx <= %rdx { goto .LBB0_1; } corresponding to the first line in the source code
  cmpq    %rdx, %rcx
  jbe     .LBB0_1 ; Jump to .LBB0_1, will reset %r14 to zero and return %r14, as seen in the code block below

  ;Initialize the registers
  movq    %rcx, %rbx ; %rbx = end
  ; Rust stores Slice as a fat pointer, saving the length and a pointer to the first item
  movq    %rsi, %r15 ; %r15 = arr.len 
  movq    %rdi, %r12 ; %r12 = &arr
  xorl    %r14d, %r14d ; %r14 = 0
  movq    %rdx, %r13 ; %r13 = start

The jump target .LBB0_1

.LBB0_1:
  xorl    %r14d, %r14d ; %r14 = 0
  jmp     .LBB0_7 ; Jump to .LBB0_7, this block of code is for function return, explained below, for now just understand it as returning 0

Second Section (tail recursion section)

Corresponding to if arr[start] == arr[end]{ return recurse_without_result(arr, start+1, end-1); }

.LBB0_3:
  cmpq    %r15, %r13
  jae     .LBB0_9
  ; If %r13 >= %r15 { goto .LBB0_9; } This section actually jumps to panic code, which Rust uses to protect array index within its length range (preventing Index Out Of Range Error)
  ; Just a note, in Rust, unsafe code can optimize out these bound checks
  cmpq    %r15, %rbx
  jae     .LBB0_10
  ; If %rbx > %r15 { goto .LBB0_10; } Also jumps to panic, similar to the previous one, indicates array slice bound checking, as we need to access these two array indexes
  movzbl  (%r12, %r13), %ecx ; %ecx = arr[%r13]
  leaq    1(%r13), %rax ; %rax = 1 + %r13
  cmpb    (%r12, %rbx), %cl ; %cl is the same register as %ecx, holding the same byte
  jne     .LBB0_6
  ; If %ecx != arr[%rbx] { goto .LBB0_6; } Below you can see, it jumps to the recursive code
  addq    $-1, %rbx ; %rbx -= 1
  movq    %rax, %r13 ; %r13 = %rax
  cmpq    %rax, %rbx
  ja      .LBB0_3 ; If %rbx > %rax { goto .LBB0_3; } 
  jmp     .LBB0_7 ; Jump to .LBB0_7, go to function return code, returns the value in current %r15

You can see, .LBB0_3 implements a while loop, with the condition %rbx > %rax

%rbx decrements by 1 each loop, %rax increments by 1 each loop, corresponding to the end and start in recursion

It’s evident that the compiler has optimized a recursion into a tail recursion, turning it into a loop. So why is this done? Remember the Linux Calling Convention at the beginning; each function call causes a large number of registers to be stored in the stack. By optimizing tail recursion, the stack remains unchanged from start to finish, avoiding additional space consumption and eliminating high time costs of reading data from memory.

Third Section (recursion section)

Corresponding to else{ return std::cmp::min( recurse_without_result(arr, start+1, end-1), recurse_without_result(arr, start, end-1) ) + 1;

.LBB0_6:
  movq    example::recurse_without_result@GOTPCREL(%rip), %rbp
  ; Save the pointer to the recursive function, similar to a function pointer in C, during the final dynamic linking, it will be converted to the start of this function's code
  ; The next 4 lines are the arguments to be passed to the recursive functions, think of it as passing the arguments
  movq    %r12, %rdi ; %rdi = %r12 &arr
  movq    %r15, %rsi ; %rsi = %r15 arr.len
  movq    %rax, %rdx ; %rdx = %rax - %r13 stored the start value, before jumping to .LBB0_6 %rax = %r13 + 1, so this section corresponds to start+1
  movq    %rbx, %rcx ; %rcx = %rbx - corresponds to end-1
  callq   *%rbp
  ; Recursion, the passed parameters are current %rsi, %rdi (information of Rust Slice "arr" that won't change no matter how many times it's passed), and current %rcx, %rdx, corresponding to end, start
  ; After the function returns, according to the Linux Calling Convention, the return value will be in %rax, but since the value of %rax will change in the next recursion, it needs to be saved first
  movq    %rax, %r14 ; %r14 = %rax
  ; Prepare the parameters for the next recursion
  movq    %r12, %rdi ; %rdi = %r12 &arr
  movq    %r15, %rsi ; %rsi = %r15 arr.len
  movq    %r13, %rdx ; %rdx = %r13 - the start value stored here
  addq    $-1, %rbx ; %rbx -= 1
  movq    %rbx, %rcx ; %rcx = %rbx - corresponds to end-1
  callq   *%rbp ; Recursion
  cmpq    %rax, %r14 ; Compare the return value of the second recursion (%rax) with that of the first recursion (%r14)
  cmovaq  %rax, %r14 ; If %r14 > %rax { %r14 = %rax; }
  ; Note the above is a function %r14 = min(%r14, %rax)
  addq    $1, %r14 ; %r14 += 1

Some may question, aren’t %rsi and %rdi supposed to be constant and not change? Why store them temporarily in %r12 and %r14 every recursion?

Good question, I think the compiler didn’t optimize this out, actually it could be optimized away.

Fourth Section (return value section)

.LBB0_7:
  ; If it came from the base case section, then %r14 is 0, set 0 to %rax as the return value
  ; If it came from the tail recursion section(.LBB0_3)
  ; If it came from the section above, then %r14 is the minimum of the two recursions plus 1
  movq    %r14, %rax
  ; The following lines are restoring the registers and stack values before this function was called
  addq    $8, %rsp ; The bottom 8 bytes stored the return value %rax before calling this function, you can verify this by looking back at the first code section for confirmation
  ; Why it was stored in the beginning I'm not clear as of now, maybe someday looking back can make it clear
  ; Restore all callee saved registers
  popq    %rbx
  popq    %r12
  popq    %r13
  popq    %r14
  popq    %r15
  popq    %rbp
  retq ; Return to the line of code that called this function

Fifth Section (Panic section)

This code deals with the error information displayed when a Rust panic occurs

You can see .L__unnamed_* stores the code name and some error info

This section doesn’t require a deep understanding, it’s a unique feature of Rust. But one can guess that frequent operations on %r15 are related to displaying error messages like “Array length is xxx, accessed xxx”, since at this point %r15 holds the previous end value

Remember, if the bound checking was optimized away through unsafe means earlier, this code segment will disappear

Thus, knowing how the code will be compiled, one can optimize the code significantly, reducing it by a whopping 27 lines (this segment has 23 lines, and the previous bound checking has 4 lines), and also slightly boosting performance by having the CPU execute fewer lines of code (though this performance boost is minimal, just imagine if you are accessing arrays many times, sometimes most of the program is just doing bound checking).

C language doesn’t have bound checking, so directly-written code will be faster

This also highlights Rust’s feature; Rust can be fast, but you must understand that to make it fast, you might have to compromise on safety.

Extrapolate - What Exactly is Inlining

At the beginning of learning code, one might wonder if giving only one line of code the subprogram’s function, wouldn’t the program constantly jump around, and allocate a lot of memory on the stack?

In this case, one typically gets the answer, “The compiler will inline them, essentially copying that line directly.”

So, from the assembly interpretation in this article, you should gain a deeper understanding of inlining.

Every function call causes registers to be stored in the stack (the number of registers depends on the function’s complexity, registers not needed by the function won’t be stored in the stack), and inlining is directly integrating the original code into the assembly code.

As seen in this code, the compiler does everything possible to reduce the number of function calls, using tail recursion to optimize function calls into loops, and using conditional move to handle the min() function in just one line of assembly.

Extrapolate Four - Why I Should Use Standard Library as Much as Possible

Imagine if you wanted to implement a min(x,y) function yourself, how would you handle it? I guess it would be something like this:

#[inline]
fn min<T: std::cmp::PartialOrd + Copy>(x: &T, y: &T) -> T{
	if x > y { y.clone() } else { x.clone() } //Note the clone here; it's easier to compare by direct pointers in C, but it's more cumbersome in Rust
  // But generally, types with the Copy Trait are quite small, like integers, so cloning one costs almost the same as a pointer
  // For larger types, you could use a pointer version, but this code segment is just for demonstration, not very rigorous
}

Now, using the standard library directly, it gets inlined into assembly, saving time copying, and allowing more optimization room for the compiler.

For example, in the code of this article, the compiler noticed that we wouldn’t need the larger value afterward, so it compacted it into a single line of conditional move.

Many standard library functions are more optimized than self-implemented ones, so studying a language’s standard library can greatly enhance code efficiency.

Now Let’s Look at Passing Value in Recursion

In a lateral comparison, in recursion with a passed value, an extra parameter is passed at the beginning, saved in the %r14 register

 6
 7
 8
 9
10
11


example::recurse_with_result:
	;...
	pushq   %rax
	movq    %r8, %r14
	cmpq    %rdx, %rcx
	;...

And the purpose of %r14 is the same as recursion without passing a value, storing the return value

The code resetting %r14 to zero at the beginning is absent, as 0 will be provided in the main program

*The absence of this code eliminates one jump instruction, but this jump is a jmp, a forced jump type; thus it’s predictable and has no impact on performance.

Even the tail recursion section is exactly the same

The only difference is in the recursion call section

.LBB0_5:
  addq    $1, %r14 ;As both recursions require ret+1, it's directly incremented here
  
  movq    example::recurse_with_result@GOTPCREL(%rip), %rbp
  ;Same passing as before
  movq    %r12, %rdi
  movq    %r15, %rsi
  movq    %rax, %rdx
  movq    %rbx, %rcx 
  movq    %r14, %r8 ;ret is passed to the next recursion
  callq   *%rbp ;Recursion
  
  ;The following two lines are somewhat mysterious, not sure why this is done
  movq    %rbp, %r9 ;First store the function pointer from %rbp to %r9
  movq    %rax, %rbp ;Then store the return value of the previous function in the just evacuated %rbp
  
  ;The next 5 lines are just as in recursion without passing, handling parameters for the next recursion
  addq    $-1, %rbx
  movq    %r12, %rdi
  movq    %r15, %rsi
  movq    %r13, %rdx
  movq    %rbx, %rcx
  movq    %r14, %r8 ;An additional step passing the return value
  callq   *%r9 ;Recursion
  
  ;Taking the minimum value
  cmpq    %rax, %rbp
  cmovaq  %rax, %rbp
  movq    %rbp, %r14
  ;Follows the function return code

Back to Basics - Which Approach is Better

In terms of time, there’s no significant difference; the extra or fewer lines of code hardly make any impact with today’s 4GHz processors, a mere difference of 10 ticks maximum.

So, how much difference in space?

Answer: 0

Because the space holding the ret passes through a register (%r8), in each recursion, %r8 is saved in %r14, and %r14 originally also stores the return value in the case of recursion without passing a value. Therefore, you can see with both methods, right at the beginning, 7 registers are stored in the stack. Space occupied on the stack altogether is 7*8 = 56 bytes. So the total stack space occupied will be (recursion depth-1) * 56 bytes.

Therefore, no substantial difference exists for the two methods, at least not in this context.

Summary

This article compared recursion with and without passing values, concluding that there’s virtually no difference, almost equivalent.

However, in actual programming, not every recursive algorithm can choose whether to pass the return value on the way down or on the way up. For example, some backtracking algorithms must carry parameters down during recursion.

So, don’t take this article as a basis for comparison. The purpose here is to help readers understand how code changes once compiled into assembly language and some common optimization methods.

Appendix - Machine Code Source

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


example::recurse_without_result:
        pushq   %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        pushq   %rax
        cmpq    %rdx, %rcx
        jbe     .LBB0_1