Nintendo Switch Bootrom Dumping Pseudo-writeup

(* The transcript below is from earlier today, when we were fielding a lot of questions and decided it was best to go ahead and give an explainer on the way bootrom dumping on the Switch took place. It's not quite a writeup or an in-depth technical document, but hopefully it can serve as a basic explanation of how we did what we did and why. )

RobocopBOT - Today at 6:27 PM
:lock: Channel locked down. Only staff members may speak. Do not bring the topic to other channels or risk disciplinary actions.

hedgeberg - Today at 6:28 PM
Ok so, lock is temporary just for the purposes of this explainer, lets get started.
First off, lets explain what exactly bootrom /is/ and why it cant be obtained by other means

ktemkin - Today at 6:28 PM
Suggestion: keep this in a nice condensed message as much as possible so we can pin it when you're done?

hedgeberg - Today at 6:29 PM
Oh god no this is not going to fit in one message >.<
Lets pastebin the whole deal afterwards and post the link.

ktemkin - Today at 6:29 PM
okay, sure-- that works too
d'you want me to proofread you and refine what you're saying, or do you just want to go?

hedgeberg - Today at 6:29 PM
Let me know if i make any glaring mistakes or if i missed something, absolutely
So, lets talk tegra: when it comes to the tegra, all execution starts with the BPMP, the Boot and Power Management Processor (an ARM7TDMI). For those of you who have read the writeup on jamais vu, that should sound familiar: its the same thing @SciresM pwned for TZ exec on 1.0. Basically, the BPMP is responsible for setting up everything on the TX1 so that things are safe and ready to execute when it comes time for the OS to start running on the AP (Applicaiton Processor). The AP is a 4-core aarch64 processor where almost all of the OS runs, and for the rest of this we won't really be talking about it because everything of interest that we're about to discuss takes place before that is even active.

So, when we talk about bootrom, we're talking about the very first code to execute on the BPMP. It's a set of instructions burnt into the silicon on the physical layer using whats called a "maskrom", basically a set of metal wires that are used to encode 1's or 0's. In addition, theres a set of "eFuses", i.e. fuses that are used to permanently and irreversibly store information at the factory before the device is shipped. The eFuses contain patches for the bootrom so that NVIDIA can fix bootrom bugs before shipping the SoC without fabricating a whole new set of masks and redoing the whole production of the chip. In addition, the Secure Boot Key (aka SBK) and other important console information are stored in eFuses

All that being said, i think we're good to dive in:
The bootrom is usually a big mess of spaghetti code and it is responsible for a /lot/ of important stuff. For example, the bootrom reads the BCT (boot config table) from the NAND Flash and uses that to figure out if whats in the NAND is safe to boot. In the Switch, only BCT entries that are signed by Nintendo can be loaded, because the TX1 verifies the BCT entry using RSA crypto. This is high-responsibility and complex code, it's usually where a lot of neat vulnerabilities end up, and manufacturers know this, so they lock out access to it as part of the bootrom before they pivot to execute what we'd call the "stage 1 bootloader" or Package1 on the switch. The idea is that if hackers dont have access to the code, they can't find vulnerabilities.

That leaves us 2 options: 1, somehow find an exploit in the bootrom that allows us to dump the bootrom before lockout, and do this without the actual code (basically, fuzzing all possible inputs from any possible source), which is basically shooting blind, or 2, somehow preventing the bootrom lockout from actually taking place and/or "tricking" the processor into taking a path or a set of instructions it never would have if operated correctly. (there is also a 3rd option, decapsulation or "decapping" and imaging, but that is an absurdly expensive process that is far beyond the reach of a team like ours)

1 is, well, possible, but outright terrible. The TX1 is a really, really complex SoC with thousands of different potential fuzzing routes. As such, 2 is the ideal. But the question becomes then, how do we actually cause 1 of the 2 situations listed under option 2 to happen? Ideally, we'd reach a point where when our code begins to execute, the bootrom is just not locked out, and we can slurp it right out when our code begins to execute. That means we need 2 things: the actual lockout skip, and code executing that can read things after the lockout skip.

So, for the second part, our life is made a lot easier by not using the Switch, since for code to execute on the switch, it needs to be signed by Nintendo. We instead use the Jetson TX1, which is a dev board for the same SoC as used in the switch, and (conveniently!) uses the same bootrom and patch set, too. (*edit: When writing this, I had forgotten that there are a couple more patches on the Switch than the Jetson, meaning that the bootrom is the same but the patch set is slightly different. Sorry, guys!) That takes care of code execution on the BPMP, we can run whatever we want there after the bootrom, but as for the bootrom lockout, we're still not well-off.

Actually rq, to make sure im not forgetting anything, @SciresM im not wrong in thinking bootrom lockout is done by the bootrom and not package1 on the switch, right?

(2 minutes go by)

Ok, I'm going to assume I'm remember correctly and that it hasnt been too long, and that the bootrom lockout is in fact done in the bootrom itself on the Jetson. The basic idea is that we need to "skip" an instruction, or corrupt its operation so completely that it doesnt produce the desired result. In this case, that would be the store operation in the BPMP thats used to write to the bootrom lockout register. If we skip that instruction, that register is never written to, and when we get execution we can read from the space the bootrom is in just liked we'd read from any other region of memory.

TuxSH - Today at 6:56 PM
@hedgeberg I'm not SciresM but I can confirm that the bootrom in fact disables itself by writing 0x1C (0b11100) to SB_CSR_0 just before jumping to the entrypoint (bit 3-2: reserved, bit4: "PIROM_DISABLE: Protected iROM Disable", see page 619 of the TRM)
...ah, hm, I may be wrong about the value written, it's actually 0x10. So yeah, SB_CSR_0 = PIROM_DISABLE

hedgeberg - Today at 6:58 PM
thanks @TuxSH wanted to make sure i wasnt being a dumbass

SciresM - Today at 6:58 PM
Correct bootrom locks itself out.

hedgeberg - Today at 6:59 PM
Ok, so, the idea is that if we can somehow hop over that instruction while letting the bootrom continue to execute, when we land in the BPMP firmware that we control on the Jetson, we'll be able to read the bootrom right out of memory.

hedgeberg - Today at 7:00 PM
This is where "glitching" comes in. Glitching is basically the process of injecting some sort of controlled noise (yes, thats a misnomer) on the power supply to corrupt the operation of the processor. Processors are designed to work within a certain range of operations, with a certain set of environmental constants. So, for example, they usually expect to work in a range of -25C to 100C or some such range, and outside of that range their behavior can become unpredictable and erratic.

...which would be nice for our purposes if it was unpredictable in the way we wanted, but that kind of temperature-based operating corruption isn't very controllable or predictable, and it isn't precise. However, with an FPGA (field progrmmable gate array) and a large high-speed mosfet, we can very quickly and precisely change the power supply voltage that's made available to the BPMP. This kind of glitching is called a "brownout" or "crowbar" glitch, the idea being that you're forcing the rail to the value you want for a short period of time.

If the BPMP doesnt have a high enough voltage that it can draw the power needed to operate correctly, data inside of it will corrupt. Pull the voltage down for too long, and you'll shut down the processor like how an engine would stall if you cut off the gas supply. Pull the voltage down for too little time, and the internal capacitances that exist on the gates of the transistors inside the SoC wont have time to actually change. But, if you hit a sweet spot in terms of timing, you can create a situation where just a few bits might change in different places.

Let's say, for example, you corrupted the Program Counter register just as it was incrementing at just the right time. You could cause it to over-increment, skipping over the instruction. That event isnt likely to actually happen, compared to other things that could, but its an easy example to conceptualize. The thing about modern processors is that they're extremely complex. Executing one instruction takes long pipelines of pre-processing, and different stages in those pipelines can have different effects if browned-out.

However, since we dont have the bootrom, we dont know where exactly the instruction we'd want to skip even is timewise. So, for that specific glitch, once you have the process of modifying the board to be ready for the attack, and you have your payload ready to run on the bpmp, it basically comes down to fuzzing. You need to find the right width of the glitch and the right position in time to actually cause the desired corruption, so you want something to automate that process.
For our attack on the Jetson, we used the chipwhisperer, which is a board by a hardware hacker named Colin O'Flynn designed to make the automation process easier.

From there, it was just a question of poking around at different times randomly until, on one reboot, the payload reported that it could in fact read from the bootrom space. After that it was as simple as dumping it out over UART.

This isn't the only thing we ended up glitching, we also worked on glitching to get code execution on the Switch BPMP to get things like keyblobs, as did Derrek, but thats a much more complex process. Instead of targetting a register write, we targetted a branch that checked whether or not the RSA key stored in a BCT entry was correct, tricking the processor into accepting a BCT that we signed instead of Nintendo.

Ok, im assuming there are questions, especially about some parts i glossed over, so post 'em in #hack-n-all and ill try to respond to em
@ktemkin while i do that, did i miss anything?

ktemkin - Today at 7:16 PM
just a sec, parsing

	( * user @MrSlick asks: "when you get around to questions i have one. how does what you glitched out differ from the nvtboot_recovery dump you have been reversing on stream")

hedgeberg - Today at 7:17 PM
kk, so the first question I saw was from @MrSlick, asking about nvtboot_recovery

ktemkin - Today at 7:17 PM
Okay, worth noting: the most electron movement happens in places where transistors are switching-- so areas with transistors switching are more significantly likely to be affected by a voltage glitch.
The more you have switching, the more likely an area is to be affected.

hedgeberg - Today at 7:18 PM
nvtboot_recovery is more like package1 on the switch. its not the bootrom, its more like a firmware binary

ktemkin - Today at 7:18 PM
This means that voltage glitches generally are best for corrupting the output of more significant computations.

hedgeberg - Today at 7:18 PM
we're analyzing that on the stream because its similar, but not the same
yes, thx @ktemkin, that is a good point, voltage glitching is better for attacking things like hashes and crypto operations etc, things that use a lot of power

ktemkin - Today at 7:19 PM
Or for affecting things like next-address computations

hedgeberg - Today at 7:19 PM
Yeah, I just didnt want to get too much into the physics so much as give an overview.

	( * user @Mak asks: "@hedgeberg could you explain more about the 3rd option? and why it's expensive?" (this was much earlier in the post, referring to the 3 options for getting bootrom))

Ok, next question is from @Mak who asked about the imaging/decap solution.
So, the thing about modern SoCs is that even the largest objects on any layer are /tiny/. The MOSFETs, for example, the building blocks of logic gates, are 22 nm thick in the Tegra, iirc. 22 nm is smaller than your average bacterium, and 1 nm is about the width of a carbon atom (unless I'm drastically misremembering). To look at that, we need really heavy-duty equipment, not even Scanning Electron Micrsocopes are enough. We generally need something like a Tunneling Electron Microscope (TEM) or Atomic Force Microscope (AFM) to do that kind of imaging, and those tools are not cheap, even to rent time on.

In addition, there are usually around 10 layers of metal wiring above the transistors and substrate, and to analyze the chip we need to analyze each layer, meaning we need extremely precise systems for scraping off each layer. Plus, the chip is bonded to a PCB using epoxy that can only be dissolved using dangerous acids, like very-high-concentration (aka "Fuming") nitric acid, so its a safety hazard, a time commitment, and a financial investment, all rolled into one. People do do that kind of reverse engineering, but those systems are expensive so you need a big budget or friends in high places.

	( * user @FineTralfazz asks: "when you say a "large" mosfet, what do you mean?")

Ok @FineTralfazz asks what I meant by a "large" MOSFET. For glitching, we usually use a MOSFET with really low Rds, meaning the effective impedance between the drain and the source. We want to be able to pull power away from the node as quickly as possible, so smaller Rds is better. In addition, it needs to be fast and able to handle a lot of power, so RF MOSFETS intended for use in stuff like amplifiers work well.

	( * user @roblabla asks: "how expensive would it be to replicate the setup - Assuming all we have is a Jetson - in terms of material ?")

@roblabla so, to set up the system, you need:

	-Jetson: $600
	-ChipWhisperer: ~$250
	-FPGA: ~$100
	-Transistors/gate drivers/etc: ~$50
	-(recommend adding a logic analyzer and an oscilloscope as this would not be possible without them imo, but those are really expensive)

So, total, including the target board, is about $1k. That doesnt mean there arent cheaper ways to do it, but its definitely cost-prohibitive.

	( * user @merry asks: "Essentially, summary, fuzz glitch parameters with success condition being your BCT being successfully executed? I had thought it would be more complex.")

@merry you asked about BCT, and the bootrom glitch doesnt need BCT, its not terribly complex
the BCT glitch is a lot more complex because the timing window is way larger. For that we synchronized using our handy-dandy FPGA to process eMMC commands. When we saw the end of the BCT read via the eMMC i/o, we could use that as a trigger for the chipwhisperer. The most efficient way to do the attack though, in reality, would be to fill up the BCT with entries containing our bogus key and trigger on each attempt (the TX1 reads 64 entries before just giving up).

Pinning down the glitch in that is really tricky, you basically have to analyze the time between the end of one read and the start of another to try and figure out what codepaths are being executed. It sounds easy from a high level because overall the concept is pretty simple, admittedly, but getting the system to a point where it was even arguably "working" took me about 200 hrs at least. There's a lot of moving parts that you run into, which is why I tell people its not simple. A background in CPU architecture, PCB design, and Analog Electronics helps /a lot/.