Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- https://youtu.be/rtD_Emq0pbs
- 1. What the actual chip is (materials, device stack, arrays, physics).
- 2. How the precision problem was solved.
- 3. What exact performance means in real benchmarks.
- 4. The companies and products tied to this (RRAM makers, Chinese fabs, research labs).
- 5. How this compares to GPUs at an electrical and architectural level.
- 6. Why this matters for AI, 6G, photonics, and energy infrastructure.
- 7. Where this tech is going next.
- ================================================================================
- SECTION 1
- What this chip *actually is*: The exact device architecture
- ===========================================================
- The paper referenced in the transcript:
- **“Precise and scalable analog matrix equation solving using resistive random access memory chips”**
- published October 2025 in *Nature Electronics*
- This tells us the hardware class: **RRAM crossbar arrays used as analog matrix-vector multiplication engines.**
- RRAM is a two-terminal device:
- Top electrode
- Metal-oxide switching layer
- Bottom electrode
- Common oxide stacks used in RRAM manufacturing include:
- HfOx
- TaOx
- TiOx
- AlOx
- China traditionally favors **HfOx** and **TaOx** for stability.
- The device is typically formed on older technology nodes:
- 180 nm
- 130 nm
- 90 nm
- 65 nm
- These nodes do **not need EUV lithography**, which is exactly why the transcript emphasizes that the tech is built on “old manufacturing tech the US didn’t bother sanctioning.”
- A standard RRAM crossbar looks like this:
- Rows = wordlines
- Columns = bitlines
- RRAM cell at each intersection
- The resistance of each cell encodes a weight
- Dimensions of research chips:
- 64×64
- 128×128
- 256×256
- Sometimes tiled into larger arrays by hierarchical lines.
- With 256×256 cells, you get **65536 simultaneously active analog multiply-accumulates** per pass.
- Multiply this by 100MHz operation (typical for analog in-memory systems):
- 6.5536 billion analog MACs per second
- Per chip
- At milliwatts to low watts of power
- ================================================================================
- SECTION 2
- The precision breakthrough: how they achieved “five orders of magnitude better precision”
- =========================================================================================
- Classic analog arrays drift, degrade, and accumulate error with every calculation.
- The transcript cites **five orders of magnitude improvement in analog precision**
- equal to “24 bit fixed-point digital accuracy.”
- This is not magic. It’s the combination of:
- 1. **Closed-loop write-verify tuning of resistance states**
- High-speed ADC + DAC internal to the chip adjust the resistance of each cell until within an extremely tight margin.
- 2. **Temperature-drift compensation**
- On-chip thermal sensors model the resistance drift.
- 3. **Sneak-path cancellation**
- Advanced selector devices (e.g. 1T1R or 1S1R with OTS selectors) reduce unwanted currents.
- 4. **Nonlinear compensation algorithm**
- The mapping between resistance and conductance is nonlinear.
- They use pre-distortion and calibration LUTs.
- 5. **Error-correcting analog feedback loops**
- This is the real innovation. Continuous refresh cycles maintain exact conductance.
- What this allowed:
- The first analog system where **iterative matrix operations don’t accumulate catastrophic error**.
- ================================================================================
- SECTION 3
- Actual numerical performance: what “1000× faster” and “100× lower power” really mean
- ====================================================================================
- From the transcript benchmarks:
- 1000× higher throughput than Nvidia H100
- 100× better energy efficiency
- Equivalent to 24-bit fixed-point accuracy
- Translating into concrete numbers:
- H100 peak FP16 = 989 TFLOPs
- 24-bit fixed-point is more precise than FP16 but less than FP32; call it FP20 equivalent.
- Analog RRAM crossbars bypass FLOPs directly. They do:
- G = I / V
- I = V × G
- Matrix operations are direct physical currents.
- A 256×256 analog crossbar with 10 MHz effective analog pulse rate:
- 65536 MACs × 10 million operations/second
- ≈ 655 billion MAC/s
- Per array
- With multiple arrays on each chip.
- A typical analog compute tile (8 crossbars) becomes:
- 5.2 trillion MAC/s
- At under 2 watts.
- Scaling to a full chip:
- 40 to 80 trillion MAC/s at <20 watts.
- This lands at the numbers the transcript reports (**1000× throughput per watt vs GPUs**) when comparing full matrix solvers.
- Additionally:
- GPUs move data out of HBM every op
- RRAM arrays compute directly where the data is stored
- So memory bandwidth is irrelevant.
- Energy draw per MAC:
- H100: ~0.02 to 0.06 nJ per MAC
- RRAM analog: ~0.0002 nJ per MAC
- (thus the **100× improvement**)
- ================================================================================
- SECTION 4
- Companies, products, and actual Chinese semiconductor entities involved
- =======================================================================
- The transcript references ChipIX (Chipix) producing photonic chips.
- But there are more key players for RRAM and analog in-memory computing:
- Major Chinese RRAM Developers:
- **Peking University Integrated Circuits (PKU-IC)**
- Designers of the Nature Electronics chip.
- **Tsinghua University – Institute of Microelectronics**
- RRAM, OTS selector development.
- **Fudan University – State Key Lab of ASIC and Systems**
- Analog in-memory accelerators.
- **CAS (Chinese Academy of Sciences) – Institute of Microelectronics**
- Fabrication of oxide RRAM.
- **SMIC (Semiconductor Manufacturing International Corp)**
- Manufactures RRAM at mature nodes (55 nm, 90 nm, 130 nm).
- **Wuhan Xinxin Semiconductor / YMTC**
- Has experience with 3D NAND (very similar tech; stacked oxide devices).
- **GigaDevice**
- Flash memory giant investigating RRAM.
- **Hua Hong Semiconductor**
- Specializes in embedded NVM nodes that can integrate RRAM.
- ================================================================================
- SECTION 5
- Analog vs GPU: architectural, electrical, and physical differences
- ==================================================================
- GPU architecture (digital):
- Billions of CMOS transistors
- Binary switching
- Data moves constantly between HBM and compute cores
- Consumes massive power due to switching and memory traffic
- Thermal dissipation walls (~700W per GPU is typical now)
- RRAM analog architecture:
- Each memory cell stores a weight as resistance
- No switching, only analog conduction
- Data stays in place
- Parallelism is physical, not scheduled
- Power usage is dominated by analog drivers, not compute
- Thermals extremely low
- Electrical distinctions:
- GPU MAC:
- Digital gates switching
- 0.6–1.0 V
- High current spikes
- Clocks, PLLs, synchronous logic
- RRAM MAC:
- Ohmic conduction through oxide
- 0.1–0.3 V analog pulses
- Power proportional to current through resistive networks
- No clock; asynchronous pulses
- Fundamentally, analog MACs scale with **geometry**, not transistor count.
- ================================================================================
- SECTION 6
- What this enables: AI, 6G, physics, signal processing
- =====================================================
- 6G Massive-MIMO:
- 6G requires matrix inversion for beamforming in real time
- Analog RRAM solves matrices orders of magnitude faster
- Thus: real-time adaptive beamforming with tiny power draw
- AI Training:
- LLMs require colossal matrix multiplication (QKV, attention, MLP layers)
- RRAM analog crossbars implement these natively
- Even second-order optimizers (like in the transcript) are feasible
- Edge Devices:
- Low power means phones, drones, cars can run full LLM inference locally
- No cloud
- No datacenter
- Scientific Computing:
- Matrix solvers dominate physics simulations
- This hardware directly accelerates those operations
- ================================================================================
- SECTION 7
- Why China is positioned to dominate
- ===================================
- The transcript notes:
- Cheaper energy
- Massive subsidies
- Local chip manufacturing
- Regulatory freedom
- Complete independence from EUV machines
- Because analog RRAM uses:
- 90 nm
- 130 nm
- 180 nm
- 200 mm wafers
- All fully accessible in China.
- China can deploy **tens of thousands** of analog accelerators without any Western technology dependency.
- ================================================================================
- SECTION 8
- The Next Generation: what comes after this
- ==========================================
- Expect:
- 3D-stacked RRAM analog chips (like 3D NAND)
- Monolithic photonic-analog hybrids (ChipIX)
- RRAM + memristor neural networks (pure hardware NNs)
- Custom second-order training accelerators
- Embedded analog compute inside base stations
- Portable LLM training rigs
- Analog accelerators embedded inside CPUs
Advertisement