Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- On Sandy Bridge i7. All reported numbers are thread cycles (QueryThreadCycleTime),
- best out of 7 runs over given internal.
- All runs go over the full set of floats in the given range. This tends to bias towards
- very small values (since they're more of them) and is not a realistic distribution, so
- keep that in mind. Reported values are total number of cycles and average number of cycles
- spent per 4-vector processed.
- On SNB i7s, approx is barely faster; SNB has enough execution units to handle multiple
- cases in parallel, so the extra ops don't cost that much, it's all about the critical
- path (which is roughly the same length). This should look different on older i7s, Cores
- or P4s.
- normalized values: all floats in [2^(-14),2^20-1]
- =================================================
- scalar fox tk: 942325268 cycles = 13.22 / vec
- "exact": 727730077 cycles = 10.21 / vec
- approx: 718301917 cycles = 10.07 / vec
- denormal values: all floats in [2^(-25),2^(-14)-1] - DAZ/FZ (denormals are zero/flush to zero) flags OFF
- ====================================================
- scalar fox tk: 307415404 cycles = 13.33 / vec
- "exact": 3832946063 cycles = 166.15 / vec
- approx: 3742647226 cycles = 162.24 / vec
- denormal values: all floats in [2^(-25),2^(-14)-1] - DAZ/FZ (denormals are zero/flush to zero) flags ON
- ====================================================
- scalar fox tk: 304416821 cycles = 13.20 / vec
- "exact": 247482147 cycles = 10.73 / vec
- approx: 234123906 cycles = 10.15 / vec
- large range: all floats in [2^(-25),2^20-1] - DAZ/FZ (denormals are zero/flush to zero) flags OFF
- =================================================
- scalar fox tk: 1251434493 cycles = 13.26 / vec
- "exact": 4642612090 cycles = 49.19 / vec
- approx: 4459117788 cycles = 47.25 / vec
- large range: all floats in [2^(-25),2^20-1] - DAZ/FZ (denormals are zero/flush to zero) flags ON
- =================================================
- scalar fox tk: 1249673516 cycles = 13.24 / vec
- "exact": 953292988 cycles = 10.10 / vec
- approx: 949879436 cycles = 10.07 / vec
- ----------------
- // Test code:
- // int start = (127 - 14) << 23, end = (127 + 20) << 23;
- // int start = (127 - 15 - 10) << 23, end = (127 - 14) << 23;
- int start = (127 - 15 - 10) << 23, end = (127 + 20) << 23;
- static __m128i output[1024];
- HANDLE hThread = GetCurrentThread();
- uint64 best = ~0ull;
- // comment out next line to get benchmark without FZ/DAZ
- _mm_setcsr(_mm_getcsr() | 0x8040); // set FZ/DAZ flags
- for (int runs=0; runs < 7; runs++)
- {
- __m128i vals = _mm_set_epi32(start + 3, start + 2, start + 1, start + 0);
- __m128i incr = _mm_set1_epi32(4);
- uint64 tstart, tend;
- QueryThreadCycleTime(hThread, &tstart);
- for (int i=start; i < end; i += 4)
- {
- #if 0 // scalar
- __m128i *p = &output[i & 1023];
- p->m128i_u32[0] = float_to_half_foxtk(i + 0);
- p->m128i_u32[1] = float_to_half_foxtk(i + 1);
- p->m128i_u32[2] = float_to_half_foxtk(i + 2);
- p->m128i_u32[3] = float_to_half_foxtk(i + 3);
- #else // SSE: flip between float_to_half_SSE2 and approx_float_to_half_SSE2 here
- __m128i out = approx_float_to_half_SSE2(_mm_castsi128_ps(vals));
- _mm_store_si128(&output[i & 1023], out);
- #endif
- vals = _mm_add_epi32(vals, incr);
- }
- QueryThreadCycleTime(hThread, &tend);
- uint64 time = tend - tstart;
- if (time < best)
- best = time;
- }
- printf("best: %lld cycles = %.2f / vec\n", best, 4.0f * best / (end - start));
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement