Advertisement
Guest User

Untitled

a guest
Apr 17th, 2015
562
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 13.74 KB | None | 0 0
  1. Because so many people are complaining about formatting, I transcribed it into markdown. I hope that's more readable for people.
  2.  
  3. ; Lines starting with ';' are comments. Transcribed in vim, so apologies if formatting is messed up subtly somewhere.
  4.  
  5. Begin
  6. -----
  7.  
  8. The death of optimizing compilers
  9.  
  10. Daniel J. Bernstein
  11.  
  12. University of Illinois at Chicago & Technische Universiteit Eindhoven
  13.  
  14. -----
  15.  
  16. "Programmers waste enormous amounts of time thinking about
  17. or worrying about, the speed of noncritical parts of their
  18. programs, and these attempts at efficiency actually have a strong
  19. negative impact when debugging and maintenance are considered.
  20. We should forget about small efficiencies, say about 97% of
  21. the time; premature optimization is the root of all evil."
  22. (Donald E. Knuth, "Structured programming with go to statements", 1974)
  23.  
  24. -----
  25.  
  26. The oversimplified story
  27.  
  28. Once upon a time:
  29.  
  30. - CPUs were painfully slow.
  31. - Software speed mattered.
  32. - Software was carefully
  33. - hand-tuned in machine language.
  34.  
  35. Today:
  36.  
  37. - CPUs are so fast that software speed is irrelevant.
  38. - "Unoptimized" is fast enough.
  39. - Programmers have stopped thinking about performance.
  40. - Compilers will do the same: easier to write, test, verify.
  41.  
  42. -----
  43.  
  44. The actual story:
  45. Wait! It's not that simple.
  46. Software speed still matters.
  47. Users are often waiting for their computers.
  48. To avoid unacceptably slow computations, users are often
  49. limiting what they compute.
  50.  
  51. Example: In your favorite sword-fighting video game, are light reflections
  52. affected realistically by sword vibration?
  53.  
  54. -----
  55.  
  56. ; Random pictures showing things are approximated in graphics, rather than 100% accurate
  57.  
  58. -----
  59.  
  60. Old CPU displaying a file:
  61.  
  62. - 0ms: Start opening file
  63. - 400ms: Start displaying contents
  64. - 1200ms: Start cleaning up
  65. - 1600ms: Finish.
  66.  
  67. ; Due to CPU speedups, eventually progresses to:
  68.  
  69. - 0ms: Start opening file
  70. - 200ms: Start displaying contents
  71. - 600ms: Start cleaning up
  72. - 800ms: Finish.
  73.  
  74. User displays bigger file:
  75.  
  76. - 0ms: Start opening file
  77. - 200ms: Start displaying contents
  78. - 1000ms: Start cleaning up
  79. - 1200ms: Finish.
  80.  
  81. ; Due to CPU speedups, progresses... etc.
  82.  
  83. -----
  84.  
  85. Cheaper computation => users process more data
  86.  
  87. - Performance issues disappear for most operations. (e.g. open file, clean up)
  88. - Inside the top operations: Performance issues disappear for most subroutines.
  89. - Performance remains important for occasional *hot spots*: small segments of
  90. code applied to tons of data.
  91.  
  92. -----
  93.  
  94. "Except, uh, a lot of people have applications whose profiles are mostly flat
  95. because they've spent a lot of time optimizing them"
  96.  
  97. This view is obsolete.
  98.  
  99. Flat profiles are dying.
  100.  
  101. - Already dead for most programs.
  102.  
  103. Larger and larger fraction of code runs freezingly cold, while hot spots run
  104. hotter.
  105.  
  106. Underlying phenomena:
  107.  
  108. - Optimization tends to converge.
  109. - Data volume tends to diverge.
  110.  
  111. -----
  112.  
  113. Speed matters: an example
  114.  
  115. 2015.02.23 CloudFlare blog post
  116.  
  117. "Do the ChaCha: better mobile performance with cryptography"
  118.  
  119. (boldface added): "Until today, Google services were the only major sites on the
  120. Internet that supported this new algorithm. Now all sites on CloudFlare support
  121. it, too. ChaCha20- Poly1305 is three times faster than AES-128-GCM on mobile devices.
  122. Spending less time on decryption means faster page rendering and better battery
  123. life... In order to support over a million HTTPS sites on our servers, we have
  124. to make sure CPU usage is low. To help improve performance we are using an open
  125. source assembly code version of ChaCha/Poly by CloudFlare engineer Vlad Krasnov
  126. and others that has been optimized for our servers' Intel CPUs. This keeps the
  127. cost of encrypting data with this new cipher to a minimum."
  128.  
  129. -----
  130.  
  131. Typical excerpt from inner loop of server code:
  132.  
  133. vpaddd $b, $a, $a
  134. vpxor $a, $d, $d
  135. vpshufb .rol16(%rip), $d, $d
  136. vpaddd $d, $c, $c
  137. vpxor $c, $b, $b
  138. vpslld \$12, $b, $tmp
  139. vpsrld \$20, $b, $b
  140. vpxor $tmp, $b, $b
  141.  
  142. Mobile code similarly has heavy vectorization + asm.
  143.  
  144. -----
  145.  
  146. Wikipedia: "By the late 1990s for even performance sensitive code, optimizing
  147. compilers exceeded the performance of human experts."
  148.  
  149. The experts disagree, and hold the speed records.
  150.  
  151. Mike Pall, LuaJIT author, 2011:
  152. "If you write an interpreter loop in assembler, you can do much better. There's
  153. just no way you can reasonably expect even the most advanced C compilers to do
  154. this on your behalf."
  155.  
  156. -----
  157.  
  158. "We come so close to optimal on most architectures that we can't do much more
  159. without using NP complete algorithms instead of heuristics. We can only try to
  160. get little niggles here and there where the heuristics get slightly wrong
  161. answers."
  162.  
  163. "Which compiler is this which can, for instance, take Netlib LAPACK and run
  164. serial Linpack as fast as OpenBLAS on recent x86-64? (Other common hotspots
  165. are available.) Enquiring HPC minds want to know."
  166.  
  167. -----
  168.  
  169. ; The algorithm designer's job
  170.  
  171. Context: What's the metric that we're trying to optimize?
  172. > CS 101 view: "Time".
  173.  
  174. What exactly does this mean?
  175. > Need to specify machine model in enough detail to analyze.
  176.  
  177. Simple defn of "RAM" model has pathologies: e.g., can factor integers in poly
  178. "time".
  179.  
  180. With more work can build more reasonable "RAM" mode
  181.  
  182. -----
  183.  
  184. Many other choices of metrics: space, cache utilization, etc.
  185. Many physical metrics such as real time and energy defined by physical machines:
  186. e.g.
  187.  
  188. - my smartphone
  189. - my laptop
  190. - a cluster
  191. - a data center
  192. - the entire Internet.
  193.  
  194. Many other abstract models.
  195. e.g. Simplify: Turing machine.
  196. e.g. Allow parallelism: PRAM
  197.  
  198. -----
  199.  
  200. Output of algorithm design: an algorithm—specification of instructions for
  201. machine.
  202.  
  203. Try to minimize cost of the algorithm in the specified metric (or combinations
  204. of metrics).
  205.  
  206. Input to algorithm design: specification of function that we want to compute.
  207. > Typically a simpler algorithm in a higher-level language: e.g., a mathematical
  208. formula.
  209.  
  210. -----
  211.  
  212. Algorithm design is hard.
  213.  
  214. Massive research topic.
  215.  
  216. State of the art is extremely complicated.
  217.  
  218. Some general techniques with broad applicability (e.g., dynamic programming) but
  219. most progress is heavily domain-specific:
  220.  
  221. - Karatsuba's algorithm
  222. - Strassen's algorithm
  223. - the Boyer–Moore algorithm
  224. - the Ford–Fulkerson algorithm
  225. - Shor's algorithm
  226.  
  227. -----
  228.  
  229. Wikipedia: "An optimizing compiler is a compiler that tries to minimize or
  230. maximize some attributes of an executable computer program."
  231.  
  232. ...So the algorithm designer (viewed as a machine) is an optimizing compiler?
  233.  
  234. Nonsense. Compiler designers have narrower focus. Example: "A compiler will not
  235. change an implementation of bubble sort to use mergesort." — Why not?
  236.  
  237. -----
  238.  
  239. In fact, compiler designers take responsibility only for "machine-specific
  240. optimization". Outside this bailiwick they freely blame algorithm designers:
  241.  
  242. --------------------------
  243. | Function specification |
  244. --------------------------
  245. | [Algorithm Designer]
  246. V
  247. ----------------------------------------------------------
  248. | Source code with all machine-independent optimizations |
  249. ----------------------------------------------------------
  250. | [Optimizing compiler]
  251. V
  252. ---------------------------------------------------
  253. | Object code with machine-specific optimizations |
  254. ---------------------------------------------------
  255.  
  256. -----
  257.  
  258. Output of optimizing compiler is algorithm for target machine.
  259.  
  260. Algorithm designer could have targeted this machine directly.
  261.  
  262. Why build a new designer as compiler composed with old designer?
  263.  
  264. Advantages of this composition:
  265. 1. save designer's time in handling complex machines
  266. 2. save designer's time in handling many machines.
  267.  
  268. Optimizing compiler is generalpurpose, used by many designers
  269.  
  270. -----
  271.  
  272. And the compiler designers say the results are great!
  273.  
  274. Remember the typical quote: "We come so close to optimal on most architectures
  275. We can only try to get little niggles here and there where the heuristics get
  276. slightly wrong answers."
  277.  
  278. ...But they're wrong.
  279. Their results are becoming less and less satisfactory, despite clever compiler
  280. research; more CPU time for compilation; extermination of many targets
  281.  
  282. -----
  283.  
  284. How the code base is evolving:
  285.  
  286. Fastest code:
  287. hot spots targeted directly by algorithm designers, using domain-specific tools.
  288.  
  289. Mediocre code:
  290. output of optimizing compilers; hot spots not yet reached by algorithm designers
  291.  
  292. Slowest code:
  293. code with optimization turned off; so cold that optimization isn't worth the
  294. costs.
  295.  
  296. -----
  297.  
  298. ; Mediocre code slowly fades out, until...
  299.  
  300. Fastest code:
  301. hot spots targeted directly by algorithm designers, using domain-specific tools.
  302.  
  303. Slowest code:
  304. code with optimization turned off; so cold that optimization isn't worth the
  305. costs.
  306.  
  307. -----
  308.  
  309. 2013 Wang–Zhang–Zhang–Yi
  310. "AUGEM: automatically generate high performance dense linear algebra kernels on
  311. x86 CPUs":
  312.  
  313. "Many DLA kernels in ATLAS are manually implemented in assembly by domain
  314. experts... Our template-based approach [allows] multiple machine-level
  315. optimizations in a domain/ application specific setting and allows the expert
  316. knowledge of how best to optimize varying kernels to be seamlessly integrated
  317. in the process."
  318.  
  319. -----
  320.  
  321. Why this is happening
  322. > The actual machine is evolving farther and farther away from the source
  323. machine.
  324.  
  325. Minor optimization challenges:
  326.  
  327. - Pipelining.
  328. - Superscalar processing.
  329.  
  330. Major optimization challenges:
  331.  
  332. - Vectorization.
  333. - Many threads; many cores.
  334. - The memory hierarchy; the ring; the mesh.
  335. - Larger-scale parallelism.
  336. - Larger-scale networking.
  337.  
  338. -----
  339.  
  340. CPU Design in a nutshell:
  341.  
  342. ; Please just see slide 77-94ish -- I'm not drawing all of this out in ASCII.
  343. ; He just goes over some some foundations of modern processor design, incl.
  344. ; pipelining
  345.  
  346. -----
  347.  
  348. "Vector" processing:
  349. Expand each 32-bit integer into n-vector of 32-bit integers.
  350.  
  351. - ARM "NEON" has n = 4
  352. - Intel "AVX2" has n = 8
  353. - Intel "AVX-512" has n = 16
  354. - GPUs have larger n.
  355.  
  356. n (times) speedup if n arithmetic circuits, n read/write circuits
  357. Benefit: Amortizes insn circuits.
  358.  
  359. Huge effect on higher-level algorithms and data structures.
  360.  
  361. -----
  362.  
  363. How expensive is sorting?
  364. Input: array of n numbers.
  365. Each number in [1, 2, ... n^2] represented in binary
  366.  
  367. Output: array of n numbers, in increasing order, represented in binary
  368. *same multiset as input*
  369.  
  370. Metric: seconds used by circuit of area n 1+o(1)
  371. (For simplicity assume n = 4k)
  372.  
  373. -----
  374.  
  375. Spread array across square mesh of n small cells, each of area n o(1) with
  376. near-neighbor wiring:
  377.  
  378. ; diagram of a massive mesh of cells
  379.  
  380. -----
  381.  
  382. Sort all n cells in n^(0.5+o(1)) seconds:
  383.  
  384. - Recursively sort quadrants in parallel, if n > 1.
  385. - Sort each column in parallel.
  386. - Sort each row in parallel.
  387. - Sort each column in parallel.
  388. - Sort each row in parallel.
  389.  
  390. With proper choice of left-to-right/right-to-left for each row, can prove that
  391. this sorts whole array.
  392.  
  393. ; Magical examples showing the power of parallel sorting... [slides 102-107]
  394.  
  395. -----
  396.  
  397. Chips are in fact evolving towards having this much parallelism and
  398. communication.
  399.  
  400. - GPUs: parallel + global RAM.
  401. - Old Xeon Phi: parallel + ring.
  402. - New Xeon Phi: parallel + mesh.
  403.  
  404. Algorithm designers don't even get the right exponent without taking this into
  405. account.
  406.  
  407. Shock waves into high levels of domain-specific algorithm design: e.g., for
  408. "NFS" factorization, replace "sieving" with "ECM"
  409.  
  410. -----
  411.  
  412. The future of compilers
  413.  
  414. At this point many readers will say, "But he should only write P, and an
  415. optimizing compiler will produce Q." To this I say, "No, the optimizing compiler
  416. would have to be so complicated (much more so than anything we have now) that
  417. it will in fact be unreliable."
  418.  
  419. I have another alternative to propose, a new class of software which will be far
  420. better.
  421.  
  422. -----
  423.  
  424. For 15 years or so I have been trying to think of how to write a compiler that
  425. really produces top quality code. For example, most of the Mix programs in my
  426. books are considerably more efficient than any of today's most visionary
  427. compiling schemes would be able to produce. I've tried to study the various
  428. techniques that a handcoder like myself uses, and to fit them into some
  429. systematic and automatic system. A few years ago, several students and I looked
  430. at a typical sample of FORTRAN programs [51], and we all tried hard to see how a
  431. machine could produce code that would compete with our best handoptimized object
  432. programs. We found ourselves always running up against the same problem: the
  433. compiler needs to be in a dialog with the programmer; it needs to know
  434. properties of the data, and whether certain cases can arise, etc. And we
  435. couldn't think of a good language in which to have such a dialog.
  436.  
  437. For some reason we all (especially me) had a mental block about optimization
  438. namely that we always regarded it as a behindthe-scenes activity, to be done
  439. in the machine language, which the programmer isn't supposed to know. This veil
  440. was first lifted from my eyes... when I ran across a remark by Hoare [42] that
  441. ideally, a language should be designed so that an optimizing compiler can
  442. describe its optimizations in the source language. Of course! ...The time is
  443. clearly ripe for program-manipulation systems ...The programmer using such a
  444. system will write his beautifully-structured, but possibly inefficient, program
  445. P; then he will interactively specify transformations that make it efficient.
  446. Such a system will be much more powerful and reliable than a completely
  447. automatic one. ...As I say, this idea certainly isn't my own; it is so exciting
  448. I hope that everyone soon becomes aware of its possibilities.
  449.  
  450. END
  451. -----
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement