Untitled


running unary ops on pure ANE.
https://pytorch.org/maskedtensor/main/unary.html

rn takes about 0.002857s to initiate a "neural network",
which notably includes frankensteining pages into dart tables.
with that handle (iommu ctx) alive, a single pass only takes
0.0004s for simple ops & 0.0020s for trained neural networks.
needs work, but multiple handles are fully possible.


total time(s) to run 100 batches of 2560x1600 fp16 mats
CPU: batch-looped numpy equivalent.
NPU: time from handle received, incl. each memcpy to chan.

sqrt
CPU: 3.812597077
NPU: 0.106508271

rsqrt
CPU: 4.678077946
NPU: 0.108868716

pow2
CPU: 2.689440505
NPU: 0.112532208

log
CPU: 3.836632627
NPU: 0.119430223

atan
CPU: 7.953391293
NPU: 0.202238374

exp
CPU: 3.633519893
NPU: 0.107100501

sigmoid
CPU: 11.141025684
NPU: 0.117191892

tanh
CPU: 8.965773849
NPU: 0.110090375


for certain dim/kernel permutations (still working on this)
tiling/DMA interleave is nonexistent because it's "fully packed"
in ane's pov, so input to ane == C contiguous block.

hopefully it gets some good use for general computes :)

- eileen