DiceCTF 2024 Quals: RISC-V Privileged Execution and a Verilator 0-Day (I Guess)
DiceCTF Quals 2024
Another year, more CTFs. DiceCTF is a qualifier this year, I played with r3kap1g + Project Sekai merger and placed third. One set of challenges involved a RISC-V processor described in Verilog. Coincidentally, I'm auditing UIUC's ECE 411, so this kind of just gave me a solution to the final programming assignment as a CTF challenge (fun).
Hardware-like challenges are always fun and obscure, so for education purposes I'll be explaining a lot of ground up concepts with respect to RISC-V and processor design. If you just want the exploit idea + script, skip to the TLDR.
What is RISC-V
RISC-V is an Instruction Set Architecture, like x86, which defines the specification for how a processor should interact with memory and handle instructions. Like x86, there are a set of instructions a processor must support, but RISC stands for "Reduced Instruction Set Computer", and thus brings two aspects that make it much simpler than x86. First, there are explicit instructions for load and store, arithmetic instructions cannot directly interface with memory. Second, instructions are fixed-width, so always 32 bits in this case.
RISC-V is an open source ISA, so anyone is able to design and produce a processor that is built upon RISC-V, and you'll often see it in some IoT boards and other embedded hardware CPUs. Important resources for us to use is an instruction decoder, and the RISC-V specification.
Designing a Processor
The important basics to a processor is considered the "5 stage processor", where each stage performs a specific task to parse and execute one instruction. The basics here are that the RISC-V processor has a register file of 32 registers and a program counter register, which holds the next address from program memory.
The 5 stages are:
-
Instruction Fetch: Determining the next program counter address. By default, it fetches the address plus 4, however, may also load a new address if a branch instruction is being executed.
-
Instruction Decode: This stage decodes the instruction into important components, such as register addresses, immediate values, function codes, and the opcode itself. It sets up control words to affect the rest of the processor state. In this stage, it also retrieves the values addressed by the instruction from the register file.
-
Execute: This stage takes the resulting register values and control words and performs the operation, which may be doing arithmetic with the Arithmetic Logic Unit (ALU), computing branch address, memory access address, and more.
-
Data Memory: This stage,accesses the memory available to the processor and writes or retrieves value from some computed address from the Execute stage. Note that not every instruction makes use of this stage.
-
Write Back: This stage is where instructions are "retired", i.e. completed. New register values are written back to the register file and the instruction is completed.
Introductory processor design will pipeline these stages. A large register (flip flop logic) will store relevant information between each stage, and pass it on. This allows you to start processing multiple instructions at the same time. There's a fair bit of care to deal with flushing the pipeline for a branch, stalling when memory takes too long and so on, but this generally allows you to increase the speed of your processor significantly.
Out of Order Execution
Modern processors follow an out-of-order execution design. The processor will have a queue and a reservation station to hold instructions and their current state, and each stage can execute independently. This allows you to compute multiple instructions at once, not have instructions go through stages it doesn't need, and compute instructions out of order when there aren't strong dependencies.
The technical details of this are quite complex, and the processor we get is an OOO processor. Luckily, these behaviors are not an integral part of the challenge, but it's important to understand this design paradigm when looking at the Verilog code. A basic description of instruction scheduling can be viewed on Wikipedia.
HDL Basics
The first thing to do when given an HDL of a processor is find opcode definitions. This allows us to note any extensions or custom instructions implemented in the processor.
//lists/ops.v
`INSTR(ARITH, 7'b0110011, R, ALU)
`INSTR(ARITH_IMM, 7'b0010011, I, ALU)
`INSTR(LUI, 7'b0110111, U, ALU)
//...
`INSTR(ECALL, 7'b1110011, I, BU)
`INSTR(ERET, 7'b1110010, J, BU)
`INSTR(SETPRIV, 7'b0110100, R, PU)
`INSTR(FLAG, 7'b1110000, U, IOU)
`INSTR(WELCOME, 7'b1110001, U, IOU)
The ops.v
file implements all the basic instructions, as well as ECALL/ERET
and custom instructions SETPRIV,FLAG,WELCOME
. The letter tells us the type of instruction, which can be viewed on the spec. R type instructions take in 2 source registers and a destination register, whereas U type instructions take 1 destination register and a large immediate value. I type instructions take a source register and an immediate register, and J type are like U type, although construct the immediate differently.
ECALL
is an environment call. Think of it like syscall
on x86. It allows for privileged execution on the kernel from user code.
//core.sv
`RESERVATION_STATION(alu_rs, ALU, alu_done, alu_curr_op, alu_rop1, alu_rop2, alu_op1, alu_op2)
alu alu (
.lhs ( alu_op1 ),
.rhs ( alu_op2 ),
.lhs_valid ( alu_rop1 ),
.rhs_valid ( alu_rop2 ),
.op_spec ( alu_curr_op ),
.result ( alu_result ),
.result_valid ( alu_done )
);
In core.sv, we can see an example of an individual unit, in this case the alu, being hooked up to the reservation station for out of order execution.
//io_unit.sv
always_ff @ (posedge clk) if (!rst) begin
if (retire_i) begin
if (cpl_i != SUPERVISOR)
$fatal(1, "permission denied");
else
case (`OPC(op_spec_i.insn))
`OPC_FLAG : begin
string flag;
int fd = $fopen("flag.txt","r");
$fscanf(fd, "%s", flag);
$display("%s", flag);
end
`OPC_WELCOME : $display("Welcome to CORCPU!");
default: $fatal(1, "invalid op for io unit");
endcase
$fflush();
end
end
The io unit provides the processor (usually non-synthesizable) functions to print to the terminal output. In this case, if the cpl
or current privilege level is supervisor, then it can print the flag file or a welcome message.
//cpu.sv
int c;
initial begin
for (int i = 0; i < 64'h10000; i++)
cache.imem[i] = 0;
$readmemh("/tmp/user_rom.rom", cache.imem, 0, 4096);
$readmemh("system_rom.rom", cache.imem, 16384);
$display("loaded roms"); $fflush();
clk = 0;
The cpu file provides the initialization of the cores and rom. In this case, we can see that our rom is loaded between 0x0000-0x1000
and the system rom is loaded into 0x4000
.
//bu.sv
`OPC_ECALL : begin
result_pc = `PRIV_ROUTINE_START;
result_pc_valid = 1'b1;
done_o = 1'b1;
end
`OPC_ERET : begin
result_pc = sys_return_addr;
result_pc_valid = 1'b1;
done_o = 1'b1;
end
default : begin end
endcase
if (result_pc_valid && result_pc >= `PRIV_ROUTINE_START
&& !(`OPC(op_spec.insn) == `OPC_ECALL || cpl_i == SUPERVISOR)) begin
$fatal(1, "illegal jump");
end
In the branch unit, we can see that the only way to jump to or within system rom is to use ECALL
, or be a supervisor. If we aren't the processor will throw an error. The ECALL
is very simple, it just jumps to the start of privileged memory 0x4000
.
At this point, we should take a look at the privilege unit.
The Privilege Unit
The privilege unit manages the current privilege level wire cpl
within the processor.
In the first challenge, C(OOO)RCPU
:
always_ff @ (posedge clk) if (!rst) begin
if (lhs_valid) done_o <= 1;
if (retire_i) begin
case (lhs)
1'b0 : cpl_o <= USER;
1'b1 : cpl_o <= SUPERVISOR;
default: $fatal(1, "invalid cpl for privilege unit");
endcase
done_o <= 0;
end
end
As we can see, if the SETPRIV
instruction is called, it checked for the first bit of the first register value to set the level. We can call SETPRIV
with a set register, and then call FLAG
to win. This challenge was quite simple.
C(OOOO)RCPU
was more involved. The privilege unit had one extra protection, we must be in system rom to set the privilege level.
always_ff @ (posedge clk) if (!rst) begin
if (lhs_valid) done_o <= 1;
if (retire_i) begin
case (lhs)
1'b0 : cpl_o <= USER;
1'b1 : if (curr_fetch_pc_i >= `PRIV_ROUTINE_START)
cpl_o <= SUPERVISOR;
else
$fatal(1, "pu: permission denied");
default: $warning(1, "invalid cpl for privilege unit");
endcase
done_o <= 0;
end
end
always_ff @ (posedge clk) if (rst) begin
cpl_o <= USER;
done_o <= 0;
end
This meant that we had to call ECALL
first, which had preloaded code, if we wanted to change the privilege level. Not great. However, there's one aspect about this unit that I noticed. It uses the current fetched program counter to determine level. This value is not passed with an instruction, rather more like a global state of the processor.
This meant that if we could get the fetched PC to be in the system region, and execute a SETPRIV,FLAG
instructions with that fetched PC, then the privilege unit wouldn't care that it came from user rom and execute FLAG
with our new privilege level.
Fetching Instructions
The instruction fetch unit has one critical piece of code:
//ifu.sv
always_ff @(posedge clk) if (!rst) begin
if (imem_load_rdy_i) begin
if (`OPC(imem_load_insn_i) == `OPC_BRANCH || `OPC(imem_load_insn_i) == `OPC_JAL || `OPC(imem_load_insn_i) == `OPC_JALR || `OPC(imem_load_insn_i) == `OPC_ECALL|| `OPC(imem_load_insn_i) == `OPC_ERET)
branch_in_pipeline <= 1; //ow branc
else
insn_load_pc <= insn_load_pc + 'h4; //get next pc
end
Every clock cycle that the instruction memory is ready, if there wasn't some type of branch, the ins_load_pc
was incremented. This ins_load_pc
value was directly wired to the same curr_fetch_pc_i
in the privilege unit.
This gives us an idea, if we run a slow instruction, that's not a branch, then the ifu will fetch multiple times, maybe before we execute an instruction like SETPRIV
, then we can bypass the restriction on the privilege unit.
Exploit
I wanted to learn more about that readmemh
function Verilog uses to read out code. This is because the `system_rom`` file was a series of bytes, written as ASCII hex:
13
01
01
fe
...
These bytes correspond to instructions in little endian (which makes it particularly annoying to parse). For example, this first instruction was:
addi x2, x2, -32
A smart version of me would've disassembled the system rom and found a much easier and intended solve, but instead I found this obscure documentation file about readmemh
.
In it, it states:
To read data from a file and store it in memory, use the functions: $readmemb and $readmemh. The $readmemb task reads binary data and $readmemh reads hexadecimal data. Data has to exist in a text file. White space is allowed to improve readability, as well as comments in both single line and block. The numbers have to be stored as binary or hexadecimal values. The basic form of a memory file contains numbers separated by new line characters that will be loaded into the memory.
This explains why its written as hex, but then it goes on to say:
When a function is invoked without starting and finishing addresses, it loads data into memory starting from the first cell. To load data only into a specific part of memory, start and finish addresses have to be used. The address can be explicit, given in the file with the @ (at) character and is followed by a hexadecimal address with data separated by a space. It is very important to remember the range (start and finish addresses) given in the file, the argument in function calls have to match each other, otherwise an error message will be displayed and the loading process will be terminated.
This is corroborated with the IEEE standard:
Loading shall continue to follow this direction even after an address specification in the data file.
When addressing information is specified both in the system task and in the data file, the addresses in the data file shall be within the address range specified by the system task arguments; otherwise, an error message is issued, and the load operation is terminated.
This @
syntax is interesting, it lets us place code in a specific place. Although it seems like we can't place code outside of the range given to us from readmemh
(0x1000
). However, the challenge uses Verilator to compile and simulate the processor. Verilator, while good, is not a 1:1 for spec, and can run into issues.
Let's see what gets called when we construct our cpu with Verilator:
VL_READMEM_N(true, 8, 65536, 0, VL_CVT_PACK_STR_NW(5, __Vtemp_2)
, &(vlSelf->cpu_entry__DOT__cache__DOT__imem)
, 0U, 0x1000U);
This VL_READMEM_N
has the following check in the Verilator source:
void VL_READMEM_N(bool hex, // Hex format, else binary
int bits, // M_Bits of each array row
QData depth, // Number of rows
int array_lsb, // Index of first row. Valid row addresses
// // range from array_lsb up to (array_lsb + depth - 1)
const std::string& filename, // Input file name
void* memp, // Array state
QData start, // First array row address to read
QData end // Last row address to read
) VL_MT_SAFE {
//...
VlReadMem rmem{hex, bits, filename, start, end};
//...
if (rmem.get(addr /*ref*/, value /*ref*/)) {
if (VL_UNLIKELY(addr < static_cast<QData>(array_lsb)
|| addr >= static_cast<QData>(array_lsb + depth))) {
VL_FATAL_MT(filename.c_str(), rmem.linenum(), "",
"$readmem file address beyond bounds of array");
}
The VlReadMem.get
function returns its internal m_addr
, which can load @
addresses:
if (reading_addr) {
// Decode @ addresses
m_addr = (m_addr << 4) + value;
}
Note that it crashes when m_addr
is outside the array_lsb+depth
, which in our case is 0x10000
, it NEVER checks whether a loaded address with @
is outside of QData end
!
When testing, we loaded memory at an address outside of the range, so our rom looked like
13 05 d5 3f
13 15 45 00
67 00 05 00
@00003fd0
73 00 00 00
...and it didn't crash. So Verilator does not check @
being outside of end adresses.
So we can write a small program like:
addi x10, x10, 1021
slli x10, x10, 4
jalr x0, 0(x10)
@00003fd0
nop
nop
nop
nop
nop
sw x1, 1000(x2)
lw x1, 1000(x2)
sw x1, 800(x2)
addi x5, x0, 1
setpriv x5
flag
and with a small python wrapper:
code = """13 05 d5 3f
...
""" #omitted for brevity.
code = base64.b64encode(code.encode())
r = remote("mc.ax", 31442)
r.sendline(code)
r.interactive()
We get our flag.
[C(OOOO)RCPU]
loaded roms
dice{d0nt_f0rget_t0_sta11}
The intended solution was to disassemble the system_rom file, see that there is a SETPRIV, ERET
gadget, reachable from a branch instruction based on x2
and x4
being set, as ECALL
and the system_rom itself does not clear extraneous registers before executing. Oh well.
TLDR
The privilege unit uses the fetched PC to set privilege level, and due to a bug in Verilator we can arbitrarily place code anywhere less than 0x4000
. Thus, we run slow commands to cause the instruction fetch unit to fetch past 0x4000
, which satisfies the pu's condition, and run the flag opcode.
from pwn import *
ECALL = "73 00 00 00"
FLAG = "70 00 00 00"
NOP = "13 00 00 00"
code = f"""13 05 d5 3f
13 15 45 00
67 00 05 00
@00003fd0
{NOP}
{NOP}
{NOP}
{NOP}
{NOP}
23 24 11 3e
83 20 81 3e
23 20 11 32
93 02 10 00
34 80 02 00
{FLAG}
"""
code = base64.b64encode(code.encode())
r = remote("mc.ax", 31442)
r.sendline(code)
r.interactive()