Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- running unary ops on pure ANE.
- https://pytorch.org/maskedtensor/main/unary.html
- rn takes about 0.002857s to initiate a "neural network",
- which notably includes frankensteining pages into dart tables.
- with that handle (iommu ctx) alive, a single pass only takes
- 0.0004s for simple ops & 0.0020s for trained neural networks.
- needs work, but multiple handles are fully possible.
- total time(s) to run 100 batches of 2560x1600 fp16 mats
- CPU: batch-looped numpy equivalent.
- NPU: time from handle received, incl. each memcpy to chan.
- sqrt
- CPU: 3.812597077
- NPU: 0.106508271
- rsqrt
- CPU: 4.678077946
- NPU: 0.108868716
- pow2
- CPU: 2.689440505
- NPU: 0.112532208
- log
- CPU: 3.836632627
- NPU: 0.119430223
- atan
- CPU: 7.953391293
- NPU: 0.202238374
- exp
- CPU: 3.633519893
- NPU: 0.107100501
- sigmoid
- CPU: 11.141025684
- NPU: 0.117191892
- tanh
- CPU: 8.965773849
- NPU: 0.110090375
- for certain dim/kernel permutations (still working on this)
- tiling/DMA interleave is nonexistent because it's "fully packed"
- in ane's pov, so input to ane == C contiguous block.
- hopefully it gets some good use for general computes :)
- - eileen
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement