Advertisement
Guest User

Untitled

a guest
Sep 7th, 2023
1,366
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 30.71 KB | None | 0 0
  1. T239 (NX2 SoC) Detail Sheet - Revision 1.3
  2.  
  3. Confirmed Details
  4.  
  5. T239 General Overview
  6. T239 (codenamed Drake) has many similarities to the Tegra T234 (Nvidia Orin). However it is not the same chip, nor is it a cut down version of Orin. It is a separate SoC custom built for Nintendo. The following confirmed details come directly from the Nvidia hack in Q1 2022 which leaked the NVN2 API OR from Linux kernel updates that Nvidia failed to scrub from online repositories.
  7.  
  8. CPU:
  9. 1 cluster of 8x ARM Cortex-A78C cores
  10. 32k L1 data cache/32k L1 instruction cache + 256k L2 cache per core, optional or 512K-8MB of shared L3 cache for cluster OR 64k L1 data cache/64k L1 instruction cache + 512k L2 cache per core, 4 or 8MB of shared L3 cache
  11. Each core has approximately the IPC of Zen 2 at iso-frequency. 3x the IPC at iso-frequency compared to the A57 cores found in the Tegra X1 (Switch 1 SoC)
  12.  
  13. GPU:
  14. 1 GPC/12 SM Nvidia Ampere (GA10F)
  15. 1536 CUDA Cores, 12 RT Cores, 48 Tensor Cores
  16. DLSS 2 and Ray Tracing support (documented in NVN2 API)
  17. Either 1MB or 4MB L2 cache (NVN2 has conflicted details across 2 separate documents)
  18. 660MHz SM frequency at Power Level 3 (4.2W power draw), 2 TFLOPs FP32
  19. 1.125GHz SM frequency at PL1 (9.3W power draw), 3.456 TFLOPs FP32
  20.  
  21. Memory Bus:
  22. 128 bit LPDDR5
  23.  
  24. Storage:
  25. Internal: eUFS 3.1 or 3.0
  26.  
  27. Accelerators:
  28. Optical Flow Accelerator
  29. Same as the OFA in Orin, faster than desktop Ampere GPU’s OFA
  30. No DLSS 3 support in NVN2, therefore Frame Generation is exceedingly unlikely
  31. File Decompression Engine (FDE)
  32. Dedicated storage decompression accelerator
  33. Reduce (or potentially entirely eliminate) CPU cycles used for asset decompression
  34. Increases effective speed of internal (and possibly external) storage
  35. AV1 Encode/Decode
  36.  
  37.  
  38. Speculative/Unconfirmed; High Confidence
  39.  
  40. CPU
  41. 64k L1 data cache/64k L1 instruction cache + 256k L2 cache per core, 4MB of shared L3 cache
  42. Supporting evidence: “L1 Cache: 64 KB L1 instruction cache (I-cache) + 64 KB L1 data cache (D-cache) per CPU core | L2 Cache: 256 KB per CPU core | L3 Cache: 2MB per CPU cluster” Nvidia Orin AGX Data Sheet https://developer.download.nvidia.com/assets/embedded/secure/jetson/agx_orin/Jetson_AGX_Orin_Series_Data_Sheet_DS-10662-001_v1.5.pdf?PXRFHDiThyoOIPZrUw5IJLSySAN4ixkTuphV9CvVufKhPi1eQZTjh-tR8gDW3QpW4776pGJF3S0TTfCKg4xPNT3nZJYximzOmsLf7JlTZ9pDEcA3LbeKbdQaXF81GDrrRsHS7AzER2yUY7BwCHFvORZ5xu00XV-lZoIyrVffJadrTb7AqavE9yqtlWAZmAmQT2zXYCgXJ4-7FS6LCQ==&t=eyJscyI6IndlYnNpdGUiLCJsc2QiOiJkZXZlbG9wZXIubnZpZGlhLmNvbS9lbWJlZGRlZC9kb3dubG9hZHMjP3NlYXJjaD1EYXRhJTIwU2hlZXRcdTAwMjZ0eD0kcHJvZHVjdCxqZXRzb25fYWd4X29yaW4samV0c29uX29yaW5fbngsamV0c29uX29yaW5fbmFubyJ9 (page 6)
  43. Rationale: Each A78AE core in Orin has the above L1/L2 cache. Orin has 3x clusters of 4 Cortex-A78AE cores, with each cluster having 2MB of shared L3. For 8 cores in a single cluster (like that of the A78C cluster for the T239), 4MB is the most likely L3 cache size
  44. Clockspeed: Between 1.3 and 1.9 GHz. Most likely frequency range is 1.5-1.7 GHz
  45. Supporting evidence:
  46. A78 Power Draw:
  47. @ 2.1GHz = 0.49W/c; 8x = 3.92 W
  48. @ 1.9GHz = 0.4W/c; 8x = 3.20 W
  49. @ 1.3GHz = 0.19W/c; 8x = 1.52 W
  50.  
  51. Switch 1 cluster of 4x A57 cores power draw = ~1.83W (for TX1 on TSMC 20nm)
  52. Rationale: If we maintain power draw at roughly 2W for the 8x cluster of ARM Cortex A78C cores, which is roughly equivalent to the power draw of the Tegra X1 4x A57 core cluster, we end up with a power consumption that is comparable to the CPU in the Switch 1 at around 1.5-1.7GHz per core frequency.
  53. Performance figures
  54. Compared to the Switch 1: The Tegra X1 in the Nintendo Switch has 4x Cortex A57 cores clocked at 1020MHz. The Switch-Next has double the CPU cores, each A78C core has 3x the IPC of an A57 core, and the CPU frequency has increased to 1.6GHz (averaging the likely 1.5-1.7 GHz range). The Switch-Next therefore achieves a performance uplift of roughly 10 times the CPU performance of the original Switch
  55. Compared to current gen home consoles (PS5, Xbox Series X/S): We will specifically compare to the Xbox Series S CPU, however all of the aforementioned consoles have roughly equivalent CPU performance. The XSS has 8x AMD Zen 2 CPU cores clocked at 3.6 GHz without SMT (simultaneous multithreading), or 3.4 GHz with SMT. Since the A78C cores do not have SMT (as is the case with all ARM cores), we will compare to the 3.6 GHz figure, which regardless offers roughly similar performance as the 3.4 GHz SMT figure. Each A78C core has slightly higher IPC than a Zen 2 core (about 5-10% more), and we have the same number of cores (8 and 8). However our clockspeed compared to the CPU in the XSS SoC is slightly lower than 50%. Given all other factors are equal or approximately the same, I would place the CPU in the Switch-Next at about ½ the performance of the current gen consoles at the high end, but for a more conservative estimate, closer to ⅖ the CPU performance.
  56.  
  57.  
  58. GPU
  59. PL3 = handheld mode. This gives us 2 TFLOPs of compute in handheld mode
  60. PL1 = docked mode. This gives us ~3.5 TFLOPs of compute while the Switch-Next is docked
  61. 4MB L2 cache (instead of 1MB)
  62. Supporting evidence:
  63. T234’s 16 SM GPU (GA10B) has 4MB L2 cache
  64. Rationale:
  65. Lowers GPU power consumption by decreasing memory accesses. Boosts effective memory bandwidth, crucial for a relatively low bandwidth (~100GB/s) LPDDR5 memory bus/system. Improves RT and Tensor Core performance relative to a lower L2 per SM GPU. Die size at or under 100mm2 can still be easily maintained with an extra 3MB of L2, and this is a much cheaper solution that utilizing LPDDR5X (both in costs of the memory modules and redesigning the ported Orin memory controller to support LPDDR5X)
  66.  
  67.  
  68. Memory
  69. 12GB-16GB of LPDDR5 6400. 12 GB is far more likely. = 2x 48Gb LPDDR5 modules (64 bit width / 6GB each)
  70. 102GB/s memory bandwidth max
  71. Supporting evidence:
  72. Nvidia Orin uses LPDDR5 6400. “The DRAM supports a max clock speed of 3200 MHz, with 6400 Gbps per pin” NVIDIA Jetson AGX Orin Series Technical Brief v1.2 (page 10)
  73. https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf
  74. Slower LPDDR5 5500 is overwhelming end of life, with Samsung being the only manufacturer of 48Gbit LPDDR5 5500 still in production
  75. Rationale:
  76. Reuses the same memory controllers from Orin (256 bit bus, 4x 64 bit). Reduces SoC development time and cost.
  77. The cheapest memory modules (in regards to cost per GB) are those that are the highest volume. In this case that is 48Gb LPDDR5 6400 vs. 48Gb LPDDR5 5500. Memory modules from Micron could also not be dual sourced, and it is unlikely that Nvidia would choose to go with 5500MT/s rated LPDDR5 considering Samsung is very likely to EOL them soon. For 32Gb modules (4GB each), all are either x32 width that are still in production (and T239 will be using x64 width modules to save PCB area with only 2 vs 4 memory modules, and to avoid a redesign of Orin’s memory controllers which support x64 bit width modules) and ALL x64 width are EOL from both Samsung and Micron.
  78.  
  79.  
  80. Storage
  81. Game Cartridges: Macronix GAA-based 3D NAND. Up to 128GB capacity
  82. Internal storage capacity: Between 128GB and 512GB, with 256GB being the most likely
  83. External storage: UFS 2.1 reader + UHS-I microSD reader
  84.  
  85. SoC Node, Area, Price
  86. Manufacturing Process
  87. TSMC 4N
  88. Supporting evidence:
  89. “Hardware Test Engineer | April 2022 - Present | Manage execution to electrically characterize and validate I/O interfaces (IC2, SPI/QSPI) of Nvidia GeForce RTX 40-Series GPUs and T239 GPU/SOC/CPU” LinkedIn Profile of Nvidia Engineer.
  90.  
  91. Not on Samsung 8N
  92. Supporting evidence:
  93. Against 8N: Nvidia Orin is 455mm2. This includes a 16 SM GPU, 12 Cortex-A78AE cores, 4MB System Level Cache, 2x DLAs, a VPA, 22x PCIe 4.0 Lanes, various media encoders/decoders, a 256 bit LPDDR5 bus, and assorted IO. For T239, we have 4 fewer SMs and 4 fewer CPU cores, no DLAs, no VLA, fewer PCIe lanes, ½ the bus width, but a dedicated decompression accelerator (the FDE) is added. The Tegra X1 on TSMC 20nm is also 116mm2, while the Tegra X1+ on TSMC 16nm FinFet is 100mm2.
  94. The more significant evidence against 8N is power consumption. If we use the Nvidia Orin Power Estimator https://jetson-tools.nvidia.com/powerestimator/ then set the drop down to Orin NX 16GB, and configure the sliders/drop-downs as follows: 8x CPU, 1190.4 MHz max frequency, load level = low, DLAs off, 4x GPU TPCs, 408 MHz GPU max frequency, GPU load level = low, and all other settings off, we get a power draw of 9.7W.
  95. Rationale:
  96. If we assume that T239 is about 50% the area, we end up with a die size of 227mm2 on Samsung 8N. At this die size, yield would be a concern for Nintendo, since we know from NVN2 and Linux kernel updates that 12 SMs and 8 A78C cores are expected by developers (meaning they have to be functional on die). Increasing the A78C core count for redundancy is impossible as 8x is the max cluster size, and while adding additional SMs is technically possible, it increases the die area further and likely would result in only slightly higher yield due to increasing the overall chance for defects, whilst increasing the SoC cost for Nvidia and Nintendo. A larger package and heatsink would also be required, increasing the size and weight of the console.
  97. The configuration set in the Power Estimator is missing two entire TPCs and does not include any power consumption of the screen, WiFi, bluetooth, Joy Cons, storage, etc. In addition, at this wattage, the load on the GPU and CPU is low. Essentially, a power consumption of 9.7W for our SoC provides very minimal performance for gaming, removes required features expected for a Switch console, and confirms that 8N is not an efficient enough process node for T239. Both battery size and the form factor of the console would have to substantially increase to support this power draw, and battery life would drop to a level unacceptable for Nintendo to use an 8N version of T239 in a handheld.
  98.  
  99. Die Size: ~100mm2
  100. IP/Logic Blocks Present on T239 (Drake): 12 Ampere SMs with 4MB L2 cache, 8x Cortex A78C cluster with 4MB L3 cache, OFA, Encoders/Decoders: H.264, H.265, AV1; FDE, 128 bit LPDDR5 bus, TBC
  101.  
  102. SoC Cost per Die
  103. TSMC N5 price per 300mm diameter wafer = $12-14K. We will assume the conservative estimate of $14k for TSMC 4N. At an area of 100mm2 (10x10mm) and a yield of 0.95, we get 640 dies per wafer, resulting in a cost per die of ~$23. If we use a more aggressive area of 85mm2, a cost per wafer of $12k, and a yield of 0.98, the gross dies per wafer is 759 and the cost per die is $16.13. If we use more realistic values of 92mm2, $13k wafer price, and yield of 0.97, we get 698 dies per wafer, and a cost per die of $19.20. Given Nvidia’s typical margins, a reasonable maximum price paid to Nvidia by Nintendo per good die would be no more than $50.
  104.  
  105.  
  106. Features
  107. 7.9” 1080p LCD display manufactured by Innolux and Tianma
  108. Between 5500-6500 mAH battery (current battery density is higher than in 2017 when the Switch launched, if they used a battery with approximately the same dimensions, the capacity would be 5700 mAH (versus 4315 mAH). This range takes into account that they may use a slightly smaller or larger battery, both in density and physical size)
  109. Backwards compatibility via an emulation/translation layer (at the very least for digital games)
  110.  
  111.  
  112.  
  113. T239: Die Area Breakdown
  114.  
  115. To preface:
  116. Personally I think this estimate is actually a bit too high. This is both due to the scaling factor for the logic only sections, as well as the fact that many of the IP block area estimates are from AD102. Because the clock speeds targeted by Drake on the GPU portion are much lower than that of the RTX 4090/Ada 6000, the transistors of GA10F can be packed closer together than on AD102, resulting in a smaller GPU area. Overall I'd estimate that these inaccuracies account for about an extra 3-5 mm² added to T239. However there may be extra area I'm not accounting for, the most likely places include the data fabric/interconnect, or misc accelerators/IO that may turn out to also be present. If the SoC is on 4N, my best approximate guess for the final die size is somewhere between 86 and 94 mm². Anyways:
  117.  
  118. Drake T239 (4N)
  119.  
  120. GA10F
  121. 1x GPC = 24.42 mm²
  122. 4MB L2$ = 3.53 mm²
  123. GPU Wide Crossbar/Command Frontend = 1.76 mm²
  124.  
  125. Total GPU Area = 29.71 mm²
  126.  
  127. Memory
  128. 128 bit LPDDR5 PHY = 13.1 mm²
  129. 2x 64 bit LPDDR5 memory controller = 3.29 mm²
  130.  
  131. Total Memory Area = 16.39 mm²
  132.  
  133. CPU
  134. 8x A78C cluster + 8MB L3 = 12.91 mm²
  135.  
  136. Total CPU Area = 12.91 mm²
  137.  
  138. IO, Encoders, Accelerators, Misc (Other)
  139. PCIE + MPHYs = 3.71 mm²
  140. NVENC/NVDEC + NVOFA + FDE = 12.38 mm²
  141. IO Control = 8.16 mm²
  142. Interconnect / Data Fabric = 7.63 mm²
  143.  
  144. Total Other Area: 31.88 mm²
  145.  
  146. Total Die Size: 90.89 mm²
  147.  
  148.  
  149.  
  150. T239: 4N or N7?
  151.  
  152. The more I've thought about it the more I find N7 unlikely. Here is my rationale
  153. 1. Die Size
  154. On N7, Nvidia was able to achieve a transistor density of 65.6 MTr/mm² for GA100 (die used in A100). It's successor, GH100 (Hopper/H100), manufactured on 4N achieves a density of 98.3 MTr/mm². Dividing the two gets us a reasonably accurate scaling factor Nvidia gets from N7 to 4N, which is 0.6673. So taking T239 from 4N with an approximate die size of 91mm², multiplying it by the scaling factor's reciprocal (1.498), we find that T239 if manufactured on N7 would be 136.2mm². Now this is only about a 50% increase in die size which seems relatively insignificant... however let's move on to how much this die would cost
  155.  
  156.  
  157. 2. Die Cost
  158. Using the silicon cost calculator on Adapteva with the following values:
  159.  
  160. 136.2mm² die size (T239 on N7)
  161. Price per 300mm diameter N7 Wafer = $8K
  162. Yield: 0.95
  163.  
  164. We find that the cost per die is $18.227. This represents a $2.61 reduction from my prior estimate of $20.84 per T239 on 4N. But there's a few more things we should consider.
  165.  
  166. Firstly is yield.
  167. The yield figure I gave above is inaccurate for a couple of reasons. TSMC's N5 family is their best yielding processes in a long time, most likely all the way back to 28nm is when they had as high or higher yield. N7 by comparison, while still having incredibly low defect density, does not yield as exceptionally well as N5/4N. This is a problem because:
  168. A) 0.95 is too high of a baseline yield for N7, and
  169. B) T239 on N7 is also 50% larger area than on N4. Larger dies yield worse than smaller ones, as the likelihood of defects appearing on a critical portion of the die is higher, potentially rendering it entirely nonfunctional.
  170.  
  171. These represent pretty big issues for Nintendo. We know from NVN2/L4T that there must be 12 SMs present for GA10F, and 8x A78C cores for the CPU cluster. If any of these areas have critical defects present that would require disablement of an SM or CPU core, then the die is a complete dud to Nintendo. They can't be cut down like desktop GPUs or CPUs and binned as a lower tier SKU, all of these IP blocks have to be both functional, and hit the proper frequency targets at the given voltage supplied. This problem can be designed around by adding additional redundant logic blocks and transistors, but this results in an increase in die area. Which increases the cost further, and again increases the chance of critical defects appearing.
  172.  
  173. So overall what does this mean for the yield of T239 on N7? Essentially, the approximate yield figure of 0.95 for 4N decreases to a best estimate of 0.90 (which is likely still too high, but let's overestimate so N7 gets a better shot) for N7, taking the aforementioned factors that hurt yield into account. Rerunning the silicon cost calculator with 0.90 instead gets us a cost per die of $19.24. Only about $1 more expensive per die, and still about $1.50 cheaper per die than 4N. Let's move on to the next section however to see if it still remains cheaper considering other expenses.
  174.  
  175. 3. R&D Costs
  176. In terms of engineering resources, it is easier for Nvidia to create T239 for 4N versus N7. Why is this? It has to do with the logic blocks Nvidia has already created for 4N compared to N7. Let's examine 4N first:
  177.  
  178. On 4N, Nvidia already has RTX 4000 series GPUs (Ada Lovelace). While Ampere (the architecture used for GA10F GPU in T239) is not the same architecture as Ada, the two architectures resemble another more than any two prior Nvidia architectures. Specifically, the Streaming Multiprocessors (SMs) are the exact same, minus the updated Tensor and RT cores on Ada versus Ampere. However, there are many IP blocks such as the CUDA cores, private/shared cache, ROPs, 3D FFs, TMUs, etc. that are identical or essentially identical. Besides the SMs, the overall GPC layout is basically the same across most Ada and Ampere SKUs (most because there are a few exceptions, but they're not significant here).
  179.  
  180. Now let's examine what logic blocks on N7 Nvidia has that are translatable to T239 on the same node:
  181.  
  182. The closest analog to T239 here is Nvidia's GA100 die. This die is also based on Ampere. So, case closed, it's the exact same architecture as GA10F or at least has more similarities to it than Ada right? Surprisingly no. GA100 goes into Nvidia's A100 AI and HPC accelerator GPUs for the datacenter, and as such many of the logic blocks are different than that of desktop/laptop Ampere (GA107-102). Here are the notable differences between these two dies (GA100 and GA102), despite both being based on the Ampere architecture.
  183.  
  184. GA100:
  185. Each of the 4 processing blocks (partitions) per SM has 8 FP64 cores, 16 CUDA cores than handle only FP32, and 16 CUDA cores that handle only INT32
  186. Each partition has 8 load/store units
  187. 192KB L1 data cache/shared memory per SM
  188. Each Tensor core can handle 256 dense or 512 sparse FP16 FMA operations. FP64 operation capable
  189. FP32 to FP16 (non-tensor) ratio is 1:4
  190. GPC Structure: 2 SMs per TPC, 8 TPCs per GPC, 16 SMs per GPC
  191. No Raster Engine
  192. No RT Cores
  193. GA102
  194. Each partition has 8 CUDA cores for FP32 only, and 8 "hybrid" CUDA cores that can handle either FP32 or INT32. Only 2 FP64 units are present per SM, and they are separate from the partitions (unlike on GA100)
  195. 4 load/store units
  196. 128KB L1 data cache/shared memory per SM
  197. Weaker Tensor cores, 1/2 the FP16 FMA ops versus GA100. No FP64 FMA
  198. FP32 to FP16 ratio is only 1:1
  199. GPC Structure: 2 SMs per TPC, 6 TPCs per GPC, 12 SMs per GPC
  200. Raster Engine per each GPC
  201. 1x 2nd Gen RT Core per SM
  202. There are many more differences between the two dies, but hopefully the takeaway here is that regardless of both GA100 and GA102 using Ampere, the functionality of the various logic blocks and the structure of the GPCs have marked differences. Let us now do a comparison of GA102 to AD102 (Ada) and denote how these architectures contrast.
  203.  
  204. AD102
  205. 3rd Gen RT Cores (many improvements from 2nd Gen)
  206. 4th Gen Tensor Cores (addition of FP8, double throughput of Ampere 3rd Gen Tensor Cores in respective data type)
  207. Regarding the GPC structures and logic block version within the SM, that's it. Ada has a massive increase in L2 capacity and bandwidth but that's not relevant here.
  208.  
  209. Finally, let's look at T234 (Nvidia Orin) on Samsung 8N to see what IP is likely to be ported to T239 (Drake), and what will be absent.
  210.  
  211. T234
  212. Ported
  213. -64 bit LPDDR5 memory controllers and PHYs
  214. -Some of the IO control logic and connections (USB, SD/eMMC)
  215. -Data Fabric and interconnects (not identical, but they will use what's on Orin to help design Drake)
  216.  
  217. Removed
  218. -2x DLA v2
  219. -PVA v2
  220. -HDR ISP
  221. -VIC (Video Imaging Compositor) 2D Engine
  222. -At least 3x CSI (x4)
  223. -10GbE
  224. -Some other IO required for image processing and debugging
  225. -Probably a few more (not entirely relevant)[/ISPOILER]
  226.  
  227. Keep in mind that Samsung 8N, and TSMC N7 and 4N are all not design compatible. This means that regardless of if T239 is on N7 or 4N, what is ported from Orin will have to be modified according to the design rules of TSMCs node. So with all of that out of the way, let's summarize the IP blocks already existing for T239 on 4N, and those that will need to be ported from either 8N or N7.
  228.  
  229. Present on 4N:
  230. Logic blocks within the SM: Polymorph Engine, CUDA cores, ROPs, TMUs, load/store units, SFUs, warp schedulers, dispatch units, registers, L0 instruction caches, L1 data cache/shared memory. Keep in mind these are all the same amount/structured in the same way as desktop Ampere
  231. GPC structure: raster engine, 2 SMs per TPC, 6 TPCs per GPC
  232. NVENC/NVDEC (H.264, H.265, AV1) [reduced stream counts vs. Ada]
  233. Display control logic + HDMI PHYs
  234. L2$ SRAM + control logic (2048KB per tile)
  235. PCIe PHYs + control logic
  236. Ported from 8N (desktop Ampere)
  237. 2nd Gen RT Cores, 3rd Gen Tensor Cores (1/2 Orin and GA100 throughout)
  238.  
  239. Ported from 8N (Orin)
  240. NVOFA
  241. Various IO control/PHYs like USB, SD/eMMC
  242. Some design logic from data fabric, Interconnect, potentially logic controls/SRAM for caches in CPU
  243.  
  244. Present on N7 (A100)
  245. SM logic blocks: FP32 only CUDA core, FP64 CUDA core, ROPs, TMUs, load/store units, SFUs, warp schedulers, dispatch units, registers, L0 instruction caches
  246. GPC structure: 2 SMs per TPC
  247. NVDEC only (minus AV1)
  248. L2$ SRAM + control logic (512KB per tile)
  249. PCIe PHYs + control logic
  250.  
  251. To wrap this massive section up, you can see in the final comparison that there are far more logic blocks applicable to T239 on 4N already than are on N7 already.
  252. The IP on 8N will need to be ported regardless of whether T239 is manufactured on 4N or N7, this represents a "fixed" R&D cost for Nvidia. However, by going with 4N over N7, there are far more IP blocks already present, reducing the overall engineering resources required, and thus the overall design costs.
  253.  
  254. Does this narrow the gap between the cost per die of T239 on N7 versus 4N, or even tilt it in the former's favor?
  255. Yes but indirectly. It doesn't reduce the cost of the silicon itself, however it does reduce how much Nintendo either needed to pay Nvidia outright for the R&D or the cost Nintendo pays Nvidia per functional die (less margin markup), or both. It depends on how the cost structures between Nvidia and Nintendo were negotiated, but regardless, the overall cost Nintendo is paying to Nvidia will be reduced.
  256.  
  257. 4. Wafer Supply
  258.  
  259. While it's true Nvidia has capacity on N7 for various datacenter products, they also have a massive 4N wafer allocation from TSMC. You may think that Nvidia is unable to divert wafers away from H100 and RTX 4000 to fulfill the massive order volume that Nintendo will need. However there have been some important developments recently that change this calculus in my opinion.
  260.  
  261. Firstly, it has been heavily rumored that due to poorer than expected sales of RTX 4000 (Ada), Nvidia is reducing how many AD102, 103, 104, etc. dies they are producing. This is due to both an attempt to prevent an oversupply of RTX 4000 which would force down MSRPs (or at least actual pricing from retailers), as well as free up wafers to allocate toward additional H100 manufacturing. While I agree with the former, the latter comes with a large caveat.
  262. Currently, wafer supply of 4N is not the bottleneck to H100 production; Nvidia has more than enough to allocate toward their high margin AI accelerator. Instead, it is actually CoWoS packaging that is the bottleneck, meaning the packaging of HBM (high bandwidth memory) side by side with the GH100 die on an interposer. With the AI boom in full swing, TSMC is unable to package dies together with HBM quickly enough to fulfill demand, despite HBM and GH100 supply being sufficient. TSMC is increasing CoWoS packaging capacity accordingly, however it will take time to build up this additional manufacturing.
  263.  
  264. Interestingly, Nvidia has also increased their wafer supply for 4N, even with RTX 4000 not meeting sales expectations and H100 unable to be manufactured relative to the increase in 4N wafers allocated. So, what might all this extra capacity be used for? In my opinion, the beginning of high volume manufacturing for T239. We know that dev kits are in the hands of 3rd party developers at this point, and that a 2H 2024 launch is likely for Switch NG. If HVM for the SoC seems too early, remember that these dies need to be manufactured, packaged with memory, integrated onto a PCB, wired to additional PCBs containing things like WiFi/Bluetooth modules, gyroscope, NAND memory, etc. Then all of these components need to be assembled into the console along with a screen attached. Joy Cons and Docks and other accessories need to also be manufactured, all the components need to be QA tested, and everything needs to be packaged together in a box, shipped across the world, and in the hands of retailers before launch. And Nintendo needs to have millions of these consoles all ready to go. The timeline for this process lines up extraordinarily well with both a 2H 2024 launch and HVM for T239 beginning now or a few months earlier.
  265.  
  266. Let's compare gross dies per wafer of T239 and GH100, silicon and other costs for each, and net margin of both products for Nvidia on 4N.
  267.  
  268. T239
  269. Die size: 90.89 mm²
  270. Yield: 0.95
  271. Gross dies per 4N wafer: 686
  272. Cost per die: $20.84
  273.  
  274. GH100
  275. Die size: 814 mm²
  276. Yield: 0.80
  277. Gross dies per 4N wafer: 67
  278. Cost per die: $277.78
  279.  
  280. A quick explanation for the 0.80 yield figure for GH100. It is an absolutely massive die at 814 mm², dwarfing T239's area of 90.89 mm² by 8.95x. Because of this almost nine fold increase in die area, you may expect a much worse decrease in yield than 0.15 compared to T239. However, unlike Drake, GH100 can be cut down to remove critical defects, and in fact is. GH100 has 144 SMs, 12x 512bit HBM controllers, and 60MB of L2 for the full die. However the top SKU (H100 SXM5) has only 132 SMs, 10x memory controller, and 50MB L2, and a further cut down SKU also exists (H100 PCIe) with even fewer SMs (114).
  281.  
  282. So now let's compare Nvidia's margin for each die when sold in packaged form. We'll go with GH100 first.
  283.  
  284. H100 has an average sale price of about $30k, and their margin is reported as being 1000%. Personally I think this is probably too low, but let's break down the costs of an assembled H100 anyway.
  285. GH100 cost per die: $277.78
  286. 80GB HBM3: (at a reasonable estimate of $10 per GB) = $800
  287. CoWoS interposer + packaging = ???
  288. Power delivery, wiring, IO, PCB + packaging = ???
  289. Heatsink/cooler = ???
  290.  
  291. I guess if reports are to be believed, the costs of the components other than the die and HBM, all well as packaging, validation, and shipping total about $1922? This seems pretty absurd to me, maybe the yield for the CoWoS packaging step is absolutely atrocious but I doubt it's bad enough to account for this huge discrepancy. In my opinion the margin is about 1500%. So anyways, moving back to the initial point, with 67 dies per wafer, each die is able to make Nvidia about $27000, so each 4N wafer allocated to GH100 is worth about $1.809 million to Nvidia.
  292.  
  293. Let us now compare this to T239.
  294.  
  295. With 686 dies per wafer, and a cost per die of about $20.84, and an estimated markup to Nintendo of 60%, we find that a 4N wafer of T239 would make Nvidia about $8578. I don't think I did the math incorrectly, but regardless a wafer of GH100 makes Nvidia about 210x more potential profit.
  296.  
  297. So why did I go on this tangent about margin and product costs and yada yada yada if it would just conclude with the implication that Nvidia would make vastly more money if they didn't make T239 on 4N. Because, all of the potential profit remains just that, potential, if all your GH100 dies are sitting in a TSMC warehouse, waiting months and months or even years for CoWoS capacity to finally catch up. Nvidia aren't stupid, they will make as many GH100 that they need to both fill demand and prevent a huge backlog of unpackaged H100's piling up. By the end of 2023, Nvidia is predicted to have sold 550000 H100s. However, this doesn't mean delivered to customers, it means some will be in customers hands, and a huge amount of those customers will have prepaid to get their H100s when they're actually fully complete.
  298.  
  299. Let's say that of those H100s, half are actually delivered to customers by the end of the year. Going back to our 67 dies per wafer, a yield of 0.80 per die, and a realistic yield estimate for the CoWoS packaging step of 0.9, we find that Nvidia will need 5700 4N wafers to produce the requisite dies for 225000 H100s. If they make extra GH100 dies (let's say 400000), they'd need about 8300 4N wafers. The last available estimate for TSMCs total N5 family capacity was 150000 per month. However this was in April of 2022, it's likely they are at around 200000 per month now. For 8300 4N wafers (GH100 only) over the course of 2023, Nvidia would need to allocate about 700 wafers per month of their total 4N wafer allocation. I can assure you they're allocated quite a bit more than that per month. To further render the argument that 4N capacity isn't enough invalid, let's look at TSMCs CoWoS packaging capacity. It is estimated that TSMC is able to package 8000-9000 per month, and that's dies per month, not entire silicon wafers full of dies. Nvidia recently in May wanted to increase their CoWoS allocation by 10000 over the course of the remainder of 2023. Let's split that into 2000 for each of the last 5 months in 2023, and assume that prior to that they had about 60% of overall CoWoS capacity per month. This results in a total of about 75000 packaged dies in 2023 for Nvidia. That's a whole lot less than 550000 H100s actually delivered. Essentially we can conclude that 4N supply to GH100 has no constraint on T239 production. But why stop there, maybe Nvidia still wouldn't have enough 4N supply?
  300.  
  301. How many wafers to fulfill demand for T239? Let's assume that Nintendo wants to go big and have 10 million units available at launch and to fulfill Holiday demand.
  302. At 686 dies per wafer, and a yield of 0.9, Nintendo would need about 13100 wafers. Maybe from T239 production to ready to sell consoles, the lead time is about 6 months. If HVM of T239 started in July 2023, and Nintendo launches Switch NG in October 2024, they would have about 10 months to produce enough T239 SoCs for 10 million consoles to be available immediately at launch. Per month, this would represent about 1300 4N wafers Nvidia would need to allocate to Nintendo of their total wafer supply from TSMC. With Nvidia being a large customer of TSMC, and TSMC producing 200000 N5 family wafers per month, let's just say that yeah, 2000 wafers per month for GH100 and T239 plus an additional few thousand for RTX 4000 is certainly within Nvidias monthly 4N allocation.
  303.  
  304. Going back to GH100 and T239, let's finally circle around to revenue.
  305. For 550000 GH100 sold, Nvidia makes 16.5 billion. For 10 million T239 sold to Nintendo, Nvidia generates about $333 million in revenue. However, that T239 figure is revenue in fully delivered products (B2B). With the CoWoS capacity constraints laid out earlier, Nvidia will only be able to actually deliver 75000 H100s in 2023, bringing the revenue from these GH100 dies down to only 2.25 billion. This is still about a 7 fold increase in revenue, but while CoWoS capacity is constraining H100 supply, T239 doesn't use CoWoS and has no such supply constraint for Nvidia to make money. And if you have the node capacity, why not make more money?
  306.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement