Bossy Lobster

A blog by Danny Hermes; musing on tech, mathematics, etc.

Edit on GitHub

pow Confusion

In my first summer of graduate school my code suddenly stopped working because Fortran and Python (via pow() in C) do exponentiation differently. Once I debugged and understood the problem, I learned about the highly optimized assembly code produced by Fortran for integer exponents.

To give a sample of the tool I wrote, here is an analysis of the assembly generated by the gfortran compiler for b = a**5:

$ python ./describe_assembly.py --exponent 5
+-------------------------+------------+
|        %xmm0[:64]       | %xmm1[:64] |
+-------------------------+------------+
|            a            |            | f2 0f 10 07     movsd   [(%rdi), %xmm0 ]
|            a            |     a      | 66 0f 28 c8     movapd  [ %xmm0, %xmm1 ]
|            a            |   a * a    | f2 0f 59 c8     mulsd   [ %xmm0, %xmm1 ]
|       (a * a) * a       |   a * a    | f2 0f 59 c1     mulsd   [ %xmm1, %xmm0 ]
| (a * a) * ((a * a) * a) |   a * a    | f2 0f 59 c1     mulsd   [ %xmm1, %xmm0 ]
|            b            |            | f2 0f 11 06     movsd   [ %xmm0, (%rsi)]
+-------------------------+------------+

Contents

Motivation

In my first summer of graduate school, I had a very productive June. I was working on a research project1 on optimization of triangular meshes for compression and storage. A core part of this involved taking fourth powers of floating point numbers2. Once I reached a point where my slow Python code worked well enough, I started porting to Fortran 95. I foolishly expected bit-for-bit identical behavior but instead the changes — which I later learned were exclusively from the fourth power operation — were so dramatic I lost a full week of productivity debugging the existing test cases.

As with most optimization problems, this involved applying gradient descent3 to a number of handpicked objective functions. Since this involved triangles and compression, we were trying to modify the mesh4 by "binning" triangles into similar shapes. For our early objective functions, this could lead to flattened out triangles that looked more like lines than triangles. Luckily there exist a number of ways to measure the "quality" qq of a triangle as a number between 00 and 11; a very low quality corresponds exactly to this flatness. To penalize low-quality triangles we included functions of 1/q1 / q in our objective functions and several of the most successful used a loop over all triangles TT in the mesh M\mathcal{M}

TM1q(T)4\sum_{T \in \mathcal{M}} \frac{1}{q(T)^4}

What's the Difference?

When the exponent n is an integer, Fortran x**n behaves much differently than Python. In Fortran, optimized assembly is generated to minimize the number of multiplications, whereas Python calls out to pow() from math.h in C. Compare an explicit vs. implicit implementation in Python:

def fourth_explicit(value):
    squared = value * value
    return squared * squared


def fourth_pow(value):
    return value ** 4

For well-behaved values, these two functions compute 1/q41 / q^4 without much difference

>>> 1 / fourth_explicit(0.25)
256.0
>>> 1 / fourth_pow(0.25)
256.0
>>> quality = 0.2689565746627065
>>> 1 / fourth_explicit(quality)
191.1046874202122
>>> 1 / fourth_pow(quality)
191.10468742021223

however when qq becomes small (the type of values we are penalizing) the magnitude of 1/q41 / q^4 means that small relative differences are still large absolute differences in our objective function:

>>> quality = 2.3824755061912883e-06
>>> 1 / fourth_explicit(quality)
3.103746353229757e+22
>>> 1 / fourth_pow(quality)
3.103746353229758e+22
>>>
>>> 1 / fourth_explicit(quality) - 1 / fourth_pow(quality)
-8388608.0

The difference is caused by the fact that pow() treats everything like a floating point number and uses exp(), log() and multiplication via

xy=eylog(x).x^y = e^{y \log(x)}.

Fortran's Nifty Algorithm

Breaking down the assembly for b = a**4 in Fortran we see the repeated squaring

pow4:
  movsd xmm0, QWORD PTR [rdi]
  mulsd xmm0, xmm0
  mulsd xmm0, xmm0
  movsd QWORD PTR [rsi], xmm0
  ret

as compared to a jump (jmp) instruction in C

pow4:
  movsd xmm1, QWORD PTR .LC0[rip]
  jmp pow
.LC0:
  .long 0
  .long 1074790400

Similarly for q**7 the assembly shows only multiplications (vs. a jmp)

pow7:
  movsd xmm0, QWORD PTR [rdi]
  movapd xmm1, xmm0
  mulsd xmm1, xmm0
  mulsd xmm0, xmm1
  mulsd xmm1, xmm1
  mulsd xmm0, xmm1
  movsd QWORD PTR [rsi], xmm0
  ret

I Don't Speak Assembly

Like most professional software engineers (and science PhDs who write code) I have no formal training, so reading assembly is never something I've done. So I wrote a script to generate a Fortran subroutine that computes b = a**n for a fixed value of n, compiles the code to an object file and then disassembles it. For example:

$ gfortran -c -O3 ./pow7.f90 -o ./pow7.o -J ./
$ objdump --disassemble ./pow7.o

However, these instructions are still hard to visualize so I parsed the disassembled instructions from the object file and tracked all of the active registers (%xmm{N}) as they were updated. In addition to the xmm registers, the %rsi (source) and %rdi (destination) registers are used for copying data from a and to b after the computation is done.

Being able to visualize this really helped! For n = 4 we can see the generated assembly is efficient enough that it only requires one register

$ python ./describe_assembly.py --exponent 4
+-------------------+
|     %xmm0[:64]    |
+-------------------+
|         a         | f2 0f 10 07     movsd   [(%rdi), %xmm0 ]
|       a * a       | f2 0f 59 c0     mulsd   [ %xmm0, %xmm0 ]
| (a * a) * (a * a) | f2 0f 59 c0     mulsd   [ %xmm0, %xmm0 ]
|         b         | f2 0f 11 06     movsd   [ %xmm0, (%rsi)]
+-------------------+

and n = 7 can do all of its work in two registers

$ python ./describe_assembly.py --exponent 7
+-------------------------------------+-------------------+
|              %xmm0[:64]             |     %xmm1[:64]    |
+-------------------------------------+-------------------+
|                  a                  |                   | f2 0f 10 07     movsd   [(%rdi), %xmm0 ]
|                  a                  |         a         | 66 0f 28 c8     movapd  [ %xmm0, %xmm1 ]
|                  a                  |       a * a       | f2 0f 59 c8     mulsd   [ %xmm0, %xmm1 ]
|             (a * a) * a             |       a * a       | f2 0f 59 c1     mulsd   [ %xmm1, %xmm0 ]
|             (a * a) * a             | (a * a) * (a * a) | f2 0f 59 c9     mulsd   [ %xmm1, %xmm1 ]
| ((a * a) * (a * a)) * ((a * a) * a) | (a * a) * (a * a) | f2 0f 59 c1     mulsd   [ %xmm1, %xmm0 ]
|                  b                  |                   | f2 0f 11 06     movsd   [ %xmm0, (%rsi)]
+-------------------------------------+-------------------+

PS: It's worth noting here that black box analysis of the generated assembly is not the only way to understand what gfortran will do. I attempted to dive into the source for gfortran to understand exactly how this is generated (in particular if it differs when the exponent n is not known until runtime). However, I was not able to find any conclusive section of code responsible and ran out of time digging.

  1. The project unfortunately never led to results that could be published.
  2. I.e., values of type double in C
  3. and related algorithms
  4. It's worth noting that for our purposes a trianglular mesh is just a collection of triangles that share edges, so if one triangle changes, the neighboring triangles must as well.

Comments