In my first summer of graduate school my code suddenly stopped working
because Fortran and Python (via `pow()`

in C) do exponentiation differently.
Once I debugged and understood the problem, I learned about the highly
optimized assembly code produced by Fortran for integer exponents.

To give a sample of the tool I wrote, here is an analysis of the assembly
generated by the `gfortran`

compiler for `b = a**5`

:

```
$ python ./describe_assembly.py --exponent 5
+-------------------------+------------+
| %xmm0[:64] | %xmm1[:64] |
+-------------------------+------------+
| a | | f2 0f 10 07 movsd [(%rdi), %xmm0 ]
| a | a | 66 0f 28 c8 movapd [ %xmm0, %xmm1 ]
| a | a * a | f2 0f 59 c8 mulsd [ %xmm0, %xmm1 ]
| (a * a) * a | a * a | f2 0f 59 c1 mulsd [ %xmm1, %xmm0 ]
| (a * a) * ((a * a) * a) | a * a | f2 0f 59 c1 mulsd [ %xmm1, %xmm0 ]
| b | | f2 0f 11 06 movsd [ %xmm0, (%rsi)]
+-------------------------+------------+
```

### Contents

### Motivation

In my first summer of graduate school, I had a very productive June. I was
working on a research project^{1} on optimization of triangular meshes for
compression and storage. A core part of this involved taking fourth powers of
floating point numbers^{2}. Once I
reached a point where my slow Python code worked well enough, I started porting
to Fortran 95. I foolishly expected bit-for-bit identical behavior but instead
the changes — which I later learned were exclusively from the fourth
power operation — were so dramatic I lost a **full week** of productivity
debugging the existing test cases.

As with most optimization problems, this involved applying
gradient descent^{3} to a number of
handpicked objective functions. Since this involved triangles and compression,
we were trying to modify the mesh^{4} by
"binning" triangles into similar shapes. For our early objective functions,
this could lead to flattened out triangles that looked more like lines than
triangles. Luckily there exist a number of ways to measure the "quality"
$q$ of a triangle as a number between $0$ and
$1$; a very low quality corresponds exactly
to this flatness. To penalize low-quality triangles we included functions of
$1 / q$ in our objective functions and several of the most
successful used a loop over all triangles $T$ in the mesh
$\mathcal{M}$

$\sum_{T \in \mathcal{M}} \frac{1}{q(T)^4}$

### What's the Difference?

When the exponent `n`

is an integer, Fortran `x**n`

behaves
much differently than Python. In Fortran, optimized assembly is generated to
minimize the number of multiplications, whereas Python calls out to `pow()`

from `math.h`

in C. Compare an explicit vs. implicit implementation in Python:

```
def fourth_explicit(value):
squared = value * value
return squared * squared
def fourth_pow(value):
return value ** 4
```

For well-behaved values, these two functions compute $1 / q^4$ without much difference

```
>>> 1 / fourth_explicit(0.25)
256.0
>>> 1 / fourth_pow(0.25)
256.0
>>> quality = 0.2689565746627065
>>> 1 / fourth_explicit(quality)
191.1046874202122
>>> 1 / fourth_pow(quality)
191.10468742021223
```

however when $q$ becomes small (the type of values we are penalizing) the magnitude of $1 / q^4$ means that small relative differences are still large absolute differences in our objective function:

```
>>> quality = 2.3824755061912883e-06
>>> 1 / fourth_explicit(quality)
3.103746353229757e+22
>>> 1 / fourth_pow(quality)
3.103746353229758e+22
>>>
>>> 1 / fourth_explicit(quality) - 1 / fourth_pow(quality)
-8388608.0
```

The difference is caused by the fact that `pow()`

treats everything like a
floating point number and uses `exp()`

, `log()`

and multiplication via

$x^y = e^{y \log(x)}.$

### Fortran's Nifty Algorithm

Breaking down the assembly for `b = a**4`

in Fortran we see the repeated
squaring

```
pow4:
movsd xmm0, QWORD PTR [rdi]
mulsd xmm0, xmm0
mulsd xmm0, xmm0
movsd QWORD PTR [rsi], xmm0
ret
```

as compared to a jump (`jmp`

) instruction in C

```
pow4:
movsd xmm1, QWORD PTR .LC0[rip]
jmp pow
.LC0:
.long 0
.long 1074790400
```

Similarly for `q**7`

the assembly shows only multiplications (vs. a `jmp`

)

```
pow7:
movsd xmm0, QWORD PTR [rdi]
movapd xmm1, xmm0
mulsd xmm1, xmm0
mulsd xmm0, xmm1
mulsd xmm1, xmm1
mulsd xmm0, xmm1
movsd QWORD PTR [rsi], xmm0
ret
```

### I Don't Speak Assembly

Like most professional software engineers (and science PhDs who write code) I
have no formal training, so reading assembly is never something I've done.
So I wrote a script to generate a Fortran subroutine that computes
`b = a**n`

for a fixed value of `n`

, compiles the code to an object file
and then disassembles it. For example:

```
$ gfortran -c -O3 ./pow7.f90 -o ./pow7.o -J ./
$ objdump --disassemble ./pow7.o
```

However, these instructions are still hard to visualize so I parsed the
disassembled instructions from the object file and tracked all of the
active registers (`%xmm{N}`

) as they were updated. In addition to the `xmm`

registers, the `%rsi`

(source) and `%rdi`

(destination) registers are used
for copying data **from** `a`

and **to** `b`

after the computation is done.

Being able to visualize this really helped! For `n = 4`

we can see the
generated assembly is efficient enough that it only requires one register

```
$ python ./describe_assembly.py --exponent 4
+-------------------+
| %xmm0[:64] |
+-------------------+
| a | f2 0f 10 07 movsd [(%rdi), %xmm0 ]
| a * a | f2 0f 59 c0 mulsd [ %xmm0, %xmm0 ]
| (a * a) * (a * a) | f2 0f 59 c0 mulsd [ %xmm0, %xmm0 ]
| b | f2 0f 11 06 movsd [ %xmm0, (%rsi)]
+-------------------+
```

and `n = 7`

can do all of its work in two registers

```
$ python ./describe_assembly.py --exponent 7
+-------------------------------------+-------------------+
| %xmm0[:64] | %xmm1[:64] |
+-------------------------------------+-------------------+
| a | | f2 0f 10 07 movsd [(%rdi), %xmm0 ]
| a | a | 66 0f 28 c8 movapd [ %xmm0, %xmm1 ]
| a | a * a | f2 0f 59 c8 mulsd [ %xmm0, %xmm1 ]
| (a * a) * a | a * a | f2 0f 59 c1 mulsd [ %xmm1, %xmm0 ]
| (a * a) * a | (a * a) * (a * a) | f2 0f 59 c9 mulsd [ %xmm1, %xmm1 ]
| ((a * a) * (a * a)) * ((a * a) * a) | (a * a) * (a * a) | f2 0f 59 c1 mulsd [ %xmm1, %xmm0 ]
| b | | f2 0f 11 06 movsd [ %xmm0, (%rsi)]
+-------------------------------------+-------------------+
```

PS: It's worth noting here that black box analysis of the generated assembly is not the only way to understand what`gfortran`

will do. I attempted to dive into the source for`gfortran`

to understand exactly how this is generated (in particular if it differs when the exponent`n`

is not known until runtime). However, I was not able to find any conclusive section of code responsible and ran out of time digging.

- The project unfortunately never led to results that could be published. ↩
- I.e., values of type
`double`

in C ↩ - and related algorithms ↩
- It's worth noting that for our purposes a trianglular mesh is just a collection of triangles that share edges, so if one triangle changes, the neighboring triangles must as well. ↩