devguru letter

Thank you all your feedbacks. I can't wait to try those on the weekend.

Until that TODO:
LeeHowes, vmiura: atomic in the while()

"Workgroups are scheduled in batches on the GPU. So, Workgroups in Batch-1 can be infintely spinning waiting for other Batches to complete. The other batches dont get scheduled until Batch-1 completes and thats a classic deadlock."
Yea, I'm dealing with this:
- Simply don't let the number of WorkGroups go above 2*NumberOfCUes. (tho' it's weird that it didn't crashed at CU*2+1)
(On GCN I could use the s_sleep() instruction to let other waves increment and poll that flag. And have 'complete path' with the glc flat that drallan mentioned earlier)

This time, the bottleneck is LDS memory and not the processing power, so I hope if I will not use that many waves, ther will be no deadlocks.

The program I wanna make will simulate waves in the strings of a virtual piano. It's basically like a 2D elastic water surface effect but in 1D, and on 192K frames per sec . That's why LDS is needed as a somewhat randomly accessible and fast memory.
The synchronization will be used to let all the piano strings give and receive vibrations to and from each other when the sustain pedal is pressed (still on 129KHz).
I already did a simulation for 3..6 strings (thats only 1-2 keys pressed) on a single Phenom II 3GHz core with the help of SSE, but I want all the 200+ strings burn simultaneously on a 1..2TFlops GPU .