Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Because so many people are complaining about formatting, I transcribed it into markdown. I hope that's more readable for people.
- ; Lines starting with ';' are comments. Transcribed in vim, so apologies if formatting is messed up subtly somewhere.
- Begin
- -----
- The death of optimizing compilers
- Daniel J. Bernstein
- University of Illinois at Chicago & Technische Universiteit Eindhoven
- -----
- "Programmers waste enormous amounts of time thinking about
- or worrying about, the speed of noncritical parts of their
- programs, and these attempts at efficiency actually have a strong
- negative impact when debugging and maintenance are considered.
- We should forget about small efficiencies, say about 97% of
- the time; premature optimization is the root of all evil."
- (Donald E. Knuth, "Structured programming with go to statements", 1974)
- -----
- The oversimplified story
- Once upon a time:
- - CPUs were painfully slow.
- - Software speed mattered.
- - Software was carefully
- - hand-tuned in machine language.
- Today:
- - CPUs are so fast that software speed is irrelevant.
- - "Unoptimized" is fast enough.
- - Programmers have stopped thinking about performance.
- - Compilers will do the same: easier to write, test, verify.
- -----
- The actual story:
- Wait! It's not that simple.
- Software speed still matters.
- Users are often waiting for their computers.
- To avoid unacceptably slow computations, users are often
- limiting what they compute.
- Example: In your favorite sword-fighting video game, are light reflections
- affected realistically by sword vibration?
- -----
- ; Random pictures showing things are approximated in graphics, rather than 100% accurate
- -----
- Old CPU displaying a file:
- - 0ms: Start opening file
- - 400ms: Start displaying contents
- - 1200ms: Start cleaning up
- - 1600ms: Finish.
- ; Due to CPU speedups, eventually progresses to:
- - 0ms: Start opening file
- - 200ms: Start displaying contents
- - 600ms: Start cleaning up
- - 800ms: Finish.
- User displays bigger file:
- - 0ms: Start opening file
- - 200ms: Start displaying contents
- - 1000ms: Start cleaning up
- - 1200ms: Finish.
- ; Due to CPU speedups, progresses... etc.
- -----
- Cheaper computation => users process more data
- - Performance issues disappear for most operations. (e.g. open file, clean up)
- - Inside the top operations: Performance issues disappear for most subroutines.
- - Performance remains important for occasional *hot spots*: small segments of
- code applied to tons of data.
- -----
- "Except, uh, a lot of people have applications whose profiles are mostly flat
- because they've spent a lot of time optimizing them"
- This view is obsolete.
- Flat profiles are dying.
- - Already dead for most programs.
- Larger and larger fraction of code runs freezingly cold, while hot spots run
- hotter.
- Underlying phenomena:
- - Optimization tends to converge.
- - Data volume tends to diverge.
- -----
- Speed matters: an example
- 2015.02.23 CloudFlare blog post
- "Do the ChaCha: better mobile performance with cryptography"
- (boldface added): "Until today, Google services were the only major sites on the
- Internet that supported this new algorithm. Now all sites on CloudFlare support
- it, too. ChaCha20- Poly1305 is three times faster than AES-128-GCM on mobile devices.
- Spending less time on decryption means faster page rendering and better battery
- life... In order to support over a million HTTPS sites on our servers, we have
- to make sure CPU usage is low. To help improve performance we are using an open
- source assembly code version of ChaCha/Poly by CloudFlare engineer Vlad Krasnov
- and others that has been optimized for our servers' Intel CPUs. This keeps the
- cost of encrypting data with this new cipher to a minimum."
- -----
- Typical excerpt from inner loop of server code:
- vpaddd $b, $a, $a
- vpxor $a, $d, $d
- vpshufb .rol16(%rip), $d, $d
- vpaddd $d, $c, $c
- vpxor $c, $b, $b
- vpslld \$12, $b, $tmp
- vpsrld \$20, $b, $b
- vpxor $tmp, $b, $b
- Mobile code similarly has heavy vectorization + asm.
- -----
- Wikipedia: "By the late 1990s for even performance sensitive code, optimizing
- compilers exceeded the performance of human experts."
- The experts disagree, and hold the speed records.
- Mike Pall, LuaJIT author, 2011:
- "If you write an interpreter loop in assembler, you can do much better. There's
- just no way you can reasonably expect even the most advanced C compilers to do
- this on your behalf."
- -----
- "We come so close to optimal on most architectures that we can't do much more
- without using NP complete algorithms instead of heuristics. We can only try to
- get little niggles here and there where the heuristics get slightly wrong
- answers."
- "Which compiler is this which can, for instance, take Netlib LAPACK and run
- serial Linpack as fast as OpenBLAS on recent x86-64? (Other common hotspots
- are available.) Enquiring HPC minds want to know."
- -----
- ; The algorithm designer's job
- Context: What's the metric that we're trying to optimize?
- > CS 101 view: "Time".
- What exactly does this mean?
- > Need to specify machine model in enough detail to analyze.
- Simple defn of "RAM" model has pathologies: e.g., can factor integers in poly
- "time".
- With more work can build more reasonable "RAM" mode
- -----
- Many other choices of metrics: space, cache utilization, etc.
- Many physical metrics such as real time and energy defined by physical machines:
- e.g.
- - my smartphone
- - my laptop
- - a cluster
- - a data center
- - the entire Internet.
- Many other abstract models.
- e.g. Simplify: Turing machine.
- e.g. Allow parallelism: PRAM
- -----
- Output of algorithm design: an algorithm—specification of instructions for
- machine.
- Try to minimize cost of the algorithm in the specified metric (or combinations
- of metrics).
- Input to algorithm design: specification of function that we want to compute.
- > Typically a simpler algorithm in a higher-level language: e.g., a mathematical
- formula.
- -----
- Algorithm design is hard.
- Massive research topic.
- State of the art is extremely complicated.
- Some general techniques with broad applicability (e.g., dynamic programming) but
- most progress is heavily domain-specific:
- - Karatsuba's algorithm
- - Strassen's algorithm
- - the Boyer–Moore algorithm
- - the Ford–Fulkerson algorithm
- - Shor's algorithm
- -----
- Wikipedia: "An optimizing compiler is a compiler that tries to minimize or
- maximize some attributes of an executable computer program."
- ...So the algorithm designer (viewed as a machine) is an optimizing compiler?
- Nonsense. Compiler designers have narrower focus. Example: "A compiler will not
- change an implementation of bubble sort to use mergesort." — Why not?
- -----
- In fact, compiler designers take responsibility only for "machine-specific
- optimization". Outside this bailiwick they freely blame algorithm designers:
- --------------------------
- | Function specification |
- --------------------------
- | [Algorithm Designer]
- V
- ----------------------------------------------------------
- | Source code with all machine-independent optimizations |
- ----------------------------------------------------------
- | [Optimizing compiler]
- V
- ---------------------------------------------------
- | Object code with machine-specific optimizations |
- ---------------------------------------------------
- -----
- Output of optimizing compiler is algorithm for target machine.
- Algorithm designer could have targeted this machine directly.
- Why build a new designer as compiler composed with old designer?
- Advantages of this composition:
- 1. save designer's time in handling complex machines
- 2. save designer's time in handling many machines.
- Optimizing compiler is generalpurpose, used by many designers
- -----
- And the compiler designers say the results are great!
- Remember the typical quote: "We come so close to optimal on most architectures
- We can only try to get little niggles here and there where the heuristics get
- slightly wrong answers."
- ...But they're wrong.
- Their results are becoming less and less satisfactory, despite clever compiler
- research; more CPU time for compilation; extermination of many targets
- -----
- How the code base is evolving:
- Fastest code:
- hot spots targeted directly by algorithm designers, using domain-specific tools.
- Mediocre code:
- output of optimizing compilers; hot spots not yet reached by algorithm designers
- Slowest code:
- code with optimization turned off; so cold that optimization isn't worth the
- costs.
- -----
- ; Mediocre code slowly fades out, until...
- Fastest code:
- hot spots targeted directly by algorithm designers, using domain-specific tools.
- Slowest code:
- code with optimization turned off; so cold that optimization isn't worth the
- costs.
- -----
- 2013 Wang–Zhang–Zhang–Yi
- "AUGEM: automatically generate high performance dense linear algebra kernels on
- x86 CPUs":
- "Many DLA kernels in ATLAS are manually implemented in assembly by domain
- experts... Our template-based approach [allows] multiple machine-level
- optimizations in a domain/ application specific setting and allows the expert
- knowledge of how best to optimize varying kernels to be seamlessly integrated
- in the process."
- -----
- Why this is happening
- > The actual machine is evolving farther and farther away from the source
- machine.
- Minor optimization challenges:
- - Pipelining.
- - Superscalar processing.
- Major optimization challenges:
- - Vectorization.
- - Many threads; many cores.
- - The memory hierarchy; the ring; the mesh.
- - Larger-scale parallelism.
- - Larger-scale networking.
- -----
- CPU Design in a nutshell:
- ; Please just see slide 77-94ish -- I'm not drawing all of this out in ASCII.
- ; He just goes over some some foundations of modern processor design, incl.
- ; pipelining
- -----
- "Vector" processing:
- Expand each 32-bit integer into n-vector of 32-bit integers.
- - ARM "NEON" has n = 4
- - Intel "AVX2" has n = 8
- - Intel "AVX-512" has n = 16
- - GPUs have larger n.
- n (times) speedup if n arithmetic circuits, n read/write circuits
- Benefit: Amortizes insn circuits.
- Huge effect on higher-level algorithms and data structures.
- -----
- How expensive is sorting?
- Input: array of n numbers.
- Each number in [1, 2, ... n^2] represented in binary
- Output: array of n numbers, in increasing order, represented in binary
- *same multiset as input*
- Metric: seconds used by circuit of area n 1+o(1)
- (For simplicity assume n = 4k)
- -----
- Spread array across square mesh of n small cells, each of area n o(1) with
- near-neighbor wiring:
- ; diagram of a massive mesh of cells
- -----
- Sort all n cells in n^(0.5+o(1)) seconds:
- - Recursively sort quadrants in parallel, if n > 1.
- - Sort each column in parallel.
- - Sort each row in parallel.
- - Sort each column in parallel.
- - Sort each row in parallel.
- With proper choice of left-to-right/right-to-left for each row, can prove that
- this sorts whole array.
- ; Magical examples showing the power of parallel sorting... [slides 102-107]
- -----
- Chips are in fact evolving towards having this much parallelism and
- communication.
- - GPUs: parallel + global RAM.
- - Old Xeon Phi: parallel + ring.
- - New Xeon Phi: parallel + mesh.
- Algorithm designers don't even get the right exponent without taking this into
- account.
- Shock waves into high levels of domain-specific algorithm design: e.g., for
- "NFS" factorization, replace "sieving" with "ECM"
- -----
- The future of compilers
- At this point many readers will say, "But he should only write P, and an
- optimizing compiler will produce Q." To this I say, "No, the optimizing compiler
- would have to be so complicated (much more so than anything we have now) that
- it will in fact be unreliable."
- I have another alternative to propose, a new class of software which will be far
- better.
- -----
- For 15 years or so I have been trying to think of how to write a compiler that
- really produces top quality code. For example, most of the Mix programs in my
- books are considerably more efficient than any of today's most visionary
- compiling schemes would be able to produce. I've tried to study the various
- techniques that a handcoder like myself uses, and to fit them into some
- systematic and automatic system. A few years ago, several students and I looked
- at a typical sample of FORTRAN programs [51], and we all tried hard to see how a
- machine could produce code that would compete with our best handoptimized object
- programs. We found ourselves always running up against the same problem: the
- compiler needs to be in a dialog with the programmer; it needs to know
- properties of the data, and whether certain cases can arise, etc. And we
- couldn't think of a good language in which to have such a dialog.
- For some reason we all (especially me) had a mental block about optimization
- namely that we always regarded it as a behindthe-scenes activity, to be done
- in the machine language, which the programmer isn't supposed to know. This veil
- was first lifted from my eyes... when I ran across a remark by Hoare [42] that
- ideally, a language should be designed so that an optimizing compiler can
- describe its optimizations in the source language. Of course! ...The time is
- clearly ripe for program-manipulation systems ...The programmer using such a
- system will write his beautifully-structured, but possibly inefficient, program
- P; then he will interactively specify transformations that make it efficient.
- Such a system will be much more powerful and reliable than a completely
- automatic one. ...As I say, this idea certainly isn't my own; it is so exciting
- I hope that everyone soon becomes aware of its possibilities.
- END
- -----
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement