Untitled

Because so many people are complaining about formatting, I transcribed it into markdown. I hope that's more readable for people.

; Lines starting with ';' are comments. Transcribed in vim, so apologies if formatting is messed up subtly somewhere.

Begin
-----

The death of optimizing compilers

Daniel J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven

-----

"Programmers waste enormous amounts of time thinking about
or worrying about, the speed of noncritical parts of their
programs, and these attempts at efficiency actually have a strong
negative impact when debugging and maintenance are considered.
We should forget about small efficiencies, say about 97% of
the time; premature optimization is the root of all evil."
(Donald E. Knuth, "Structured programming with go to statements", 1974)

-----

The oversimplified story

Once upon a time:

- CPUs were painfully slow.
- Software speed mattered.
- Software was carefully
- hand-tuned in machine language.

Today:

- CPUs are so fast that software speed is irrelevant.
- "Unoptimized" is fast enough.
- Programmers have stopped thinking about performance.
- Compilers will do the same: easier to write, test, verify.

-----

The actual story:
Wait! It's not that simple.
Software speed still matters.
Users are often waiting for their computers.
To avoid unacceptably slow computations, users are often
    limiting what they compute.

Example: In your favorite sword-fighting video game, are light reflections
    affected realistically by sword vibration?

-----

; Random pictures showing things are approximated in graphics, rather than 100% accurate

-----

Old CPU displaying a file:

- 0ms: Start opening file
- 400ms: Start displaying contents
- 1200ms: Start cleaning up
- 1600ms: Finish.

; Due to CPU speedups, eventually progresses to:

- 0ms: Start opening file
- 200ms: Start displaying contents
- 600ms: Start cleaning up
- 800ms: Finish.

User displays bigger file:

- 0ms: Start opening file
- 200ms: Start displaying contents
- 1000ms: Start cleaning up
- 1200ms: Finish.

; Due to CPU speedups, progresses... etc.

-----

Cheaper computation => users process more data

- Performance issues disappear for most operations. (e.g. open file, clean up)
- Inside the top operations: Performance issues disappear for most subroutines.
- Performance remains important for occasional *hot spots*: small segments of
    code applied to tons of data.

-----

"Except, uh, a lot of people have applications whose profiles are mostly flat
because they've spent a lot of time optimizing them"

This view is obsolete.

Flat profiles are dying.

- Already dead for most programs.

Larger and larger fraction of code runs freezingly cold, while hot spots run
    hotter.

Underlying phenomena:

- Optimization tends to converge.
- Data volume tends to diverge.

-----

Speed matters: an example

2015.02.23 CloudFlare blog post

"Do the ChaCha: better mobile performance with cryptography"

(boldface added): "Until today, Google services were the only major sites on the
Internet that supported this new algorithm.  Now all sites on CloudFlare support
it, too. ChaCha20- Poly1305 is three times faster than AES-128-GCM on mobile devices.
Spending less time on decryption means faster page rendering and better battery
life... In order to support over a million HTTPS sites on our servers, we have
to make sure CPU usage is low. To help improve performance we are using an open
source assembly code version of ChaCha/Poly by CloudFlare engineer Vlad Krasnov
and others that has been optimized for our servers' Intel CPUs. This keeps the
cost of encrypting data with this new cipher to a minimum."

-----

Typical excerpt from inner loop of server code:

    vpaddd $b, $a, $a
    vpxor $a, $d, $d
    vpshufb .rol16(%rip), $d, $d
    vpaddd $d, $c, $c
    vpxor $c, $b, $b
    vpslld \$12, $b, $tmp
    vpsrld \$20, $b, $b
    vpxor $tmp, $b, $b

Mobile code similarly has heavy vectorization + asm.

-----

Wikipedia: "By the late 1990s for even performance sensitive code, optimizing
compilers exceeded the performance of human experts."

The experts disagree, and hold the speed records.

Mike Pall, LuaJIT author, 2011:
"If you write an interpreter loop in assembler, you can do much better. There's
just no way you can reasonably expect even the most advanced C compilers to do
this on your behalf."

-----

"We come so close to optimal on most architectures that we can't do much more
without using NP complete algorithms instead of heuristics. We can only try to
get little niggles here and there where the heuristics get slightly wrong
answers."

"Which compiler is this which can, for instance, take Netlib LAPACK and run
serial Linpack as fast as OpenBLAS on recent x86-64? (Other common hotspots
are available.) Enquiring HPC minds want to know."

-----

; The algorithm designer's job

Context: What's the metric that we're trying to optimize?
> CS 101 view: "Time".

What exactly does this mean?
> Need to specify machine model in enough detail to analyze.

Simple defn of "RAM" model has pathologies: e.g., can factor integers in poly
"time".

With more work can build more reasonable "RAM" mode

-----

Many other choices of metrics: space, cache utilization, etc.
Many physical metrics such as real time and energy defined by physical machines:
e.g.

- my smartphone
- my laptop
- a cluster
- a data center
- the entire Internet.

Many other abstract models.
e.g. Simplify: Turing machine.
e.g. Allow parallelism: PRAM

-----

Output of algorithm design: an algorithm—specification of instructions for
machine.

Try to minimize cost of the algorithm in the specified metric (or combinations
of metrics).

Input to algorithm design: specification of function that we want to compute.
> Typically a simpler algorithm in a higher-level language: e.g., a mathematical
     formula.

-----

Algorithm design is hard.

Massive research topic.

State of the art is extremely complicated.

Some general techniques with broad applicability (e.g., dynamic programming) but
most progress is heavily domain-specific:

- Karatsuba's algorithm
- Strassen's algorithm
- the Boyer–Moore algorithm
- the Ford–Fulkerson algorithm
- Shor's algorithm

-----

Wikipedia: "An optimizing compiler is a compiler that tries to minimize or
maximize some attributes of an executable computer program."

...So the algorithm designer (viewed as a machine) is an optimizing compiler?

Nonsense. Compiler designers have narrower focus. Example: "A compiler will not
change an implementation of bubble sort to use mergesort." — Why not?

-----

In fact, compiler designers take responsibility only for "machine-specific
optimization".  Outside this bailiwick they freely blame algorithm designers:

    --------------------------
    | Function specification |
    --------------------------
               | [Algorithm Designer]
               V
    ----------------------------------------------------------
    | Source code with all machine-independent optimizations |
    ----------------------------------------------------------
               | [Optimizing compiler]
               V
    ---------------------------------------------------
    | Object code with machine-specific optimizations |
    ---------------------------------------------------

-----

Output of optimizing compiler is algorithm for target machine.

Algorithm designer could have targeted this machine directly.

Why build a new designer as compiler composed with old designer?

Advantages of this composition:
1. save designer's time in handling complex machines
2. save designer's time in handling many machines.

Optimizing compiler is generalpurpose, used by many designers

-----

And the compiler designers say the results are great!

 Remember the typical quote: "We come so close to optimal on most architectures
We can only try to get little niggles here and there where the heuristics get
slightly wrong answers."

...But they're wrong.
Their results are becoming less and less satisfactory, despite clever compiler
research; more CPU time for compilation; extermination of many targets

-----

How the code base is evolving:

Fastest code:
hot spots targeted directly by algorithm designers, using domain-specific tools.

Mediocre code:
output of optimizing compilers; hot spots not yet reached by algorithm designers

Slowest code:
code with optimization turned off; so cold that optimization isn't worth the
     costs.

-----

; Mediocre code slowly fades out, until...

Fastest code:
hot spots targeted directly by algorithm designers, using domain-specific tools.

Slowest code:
code with optimization turned off; so cold that optimization isn't worth the
     costs.

-----

2013 Wang–Zhang–Zhang–Yi
"AUGEM: automatically generate high performance dense linear algebra kernels on
x86 CPUs":

"Many DLA kernels in ATLAS are manually implemented in assembly by domain
experts... Our template-based approach [allows] multiple machine-level
optimizations in a domain/ application specific setting and allows the expert
knowledge of how best to optimize varying kernels to be seamlessly integrated
in the process."

-----

Why this is happening
> The actual machine is evolving farther and farther away from the source
    machine.

Minor optimization challenges:

- Pipelining.
- Superscalar processing.

Major optimization challenges:

- Vectorization.
- Many threads; many cores.
- The memory hierarchy; the ring; the mesh.
- Larger-scale parallelism.
- Larger-scale networking.

-----

CPU Design in a nutshell:

; Please just see slide 77-94ish -- I'm not drawing all of this out in ASCII.
; He just goes over some some foundations of modern processor design, incl.
; pipelining

-----

"Vector" processing:
Expand each 32-bit integer into n-vector of 32-bit integers.

- ARM "NEON" has n = 4
- Intel "AVX2" has n = 8
- Intel "AVX-512" has n = 16
- GPUs have larger n.

n (times) speedup if n arithmetic circuits, n read/write circuits
Benefit: Amortizes insn circuits.

Huge effect on higher-level algorithms and data structures.

-----

How expensive is sorting?
Input: array of n numbers.
Each number in [1, 2, ... n^2] represented in binary

Output: array of n numbers, in increasing order, represented in binary
*same multiset as input*

Metric: seconds used by circuit of area n 1+o(1)
(For simplicity assume n = 4k)

-----

Spread array across square mesh of n small cells, each of area n o(1) with
    near-neighbor wiring:

; diagram of a massive mesh of cells

-----

Sort all n cells in n^(0.5+o(1)) seconds:

- Recursively sort quadrants in parallel, if n > 1.
- Sort each column in parallel.
- Sort each row in parallel.
- Sort each column in parallel.
- Sort each row in parallel.

With proper choice of left-to-right/right-to-left for each row, can prove that
    this sorts whole array.

; Magical examples showing the power of parallel sorting... [slides 102-107]

-----

Chips are in fact evolving towards having this much parallelism and
    communication.

- GPUs: parallel + global RAM.
- Old Xeon Phi: parallel + ring.
- New Xeon Phi: parallel + mesh.

Algorithm designers don't even get the right exponent without taking this into
     account.

Shock waves into high levels of domain-specific algorithm design: e.g., for
     "NFS" factorization, replace "sieving" with "ECM"

-----

The future of compilers

At this point many readers will say, "But he should only write P, and an
optimizing compiler will produce Q." To this I say, "No, the optimizing compiler
would have to be so complicated (much more so than anything we have now) that
it will in fact be unreliable."

I have another alternative to propose, a new class of software which will be far
better.

-----

For 15 years or so I have been trying to think of how to write a compiler that
really produces top quality code. For example, most of the Mix programs in my
books are considerably more efficient than any of today's most visionary
compiling schemes would be able to produce. I've tried to study the various
techniques that a handcoder like myself uses, and to fit them into some
systematic and automatic system. A few years ago, several students and I looked
at a typical sample of FORTRAN programs [51], and we all tried hard to see how a
machine could produce code that would compete with our best handoptimized object
programs. We found ourselves always running up against the same problem: the
compiler needs to be in a dialog with the programmer; it needs to know
properties of the data, and whether certain cases can arise, etc. And we
couldn't think of a good language in which to have such a dialog.

For some reason we all (especially me) had a mental block about optimization
namely that we always regarded it as a behindthe-scenes activity, to be done
in the machine language, which the programmer isn't supposed to know. This veil
was first lifted from my eyes... when I ran across a remark by Hoare [42] that
ideally, a language should be designed so that an optimizing compiler can
describe its optimizations in the source language. Of course! ...The time is
clearly ripe for program-manipulation systems ...The programmer using such a
system will write his beautifully-structured, but possibly inefficient, program
P; then he will interactively specify transformations that make it efficient.
Such a system will be much more powerful and reliable than a completely
automatic one. ...As I say, this idea certainly isn't my own; it is so exciting
I hope that everyone soon becomes aware of its possibilities.

END
-----