Advertisement
Guest User

Untitled

a guest
Jan 17th, 2023
111
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.14 KB | None | 0 0
  1.  
  2. running unary ops on pure ANE.
  3. https://pytorch.org/maskedtensor/main/unary.html
  4.  
  5. rn takes about 0.002857s to initiate a "neural network",
  6. which notably includes frankensteining pages into dart tables.
  7. with that handle (iommu ctx) alive, a single pass only takes
  8. 0.0004s for simple ops & 0.0020s for trained neural networks.
  9. needs work, but multiple handles are fully possible.
  10.  
  11.  
  12. total time(s) to run 100 batches of 2560x1600 fp16 mats
  13. CPU: batch-looped numpy equivalent.
  14. NPU: time from handle received, incl. each memcpy to chan.
  15.  
  16. sqrt
  17. CPU: 3.812597077
  18. NPU: 0.106508271
  19.  
  20. rsqrt
  21. CPU: 4.678077946
  22. NPU: 0.108868716
  23.  
  24. pow2
  25. CPU: 2.689440505
  26. NPU: 0.112532208
  27.  
  28. log
  29. CPU: 3.836632627
  30. NPU: 0.119430223
  31.  
  32. atan
  33. CPU: 7.953391293
  34. NPU: 0.202238374
  35.  
  36. exp
  37. CPU: 3.633519893
  38. NPU: 0.107100501
  39.  
  40. sigmoid
  41. CPU: 11.141025684
  42. NPU: 0.117191892
  43.  
  44. tanh
  45. CPU: 8.965773849
  46. NPU: 0.110090375
  47.  
  48.  
  49. for certain dim/kernel permutations (still working on this)
  50. tiling/DMA interleave is nonexistent because it's "fully packed"
  51. in ane's pov, so input to ane == C contiguous block.
  52.  
  53. hopefully it gets some good use for general computes :)
  54.  
  55. - eileen
  56.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement