About This Project

This project contains all of the SystemVerilog HDL, scripts, tools, software support, and programs (C) for my 2024 CPU Design/Computer Architecture Independent Study project. My goal for this project was to implement a fully-functional RISC-V rv32im processor that could be synthesized for a ICE40 FPGA on my Upduino board. I've done a bit more than just that (made a pipelined version for an ECP5 FPGA, many peripherals, and a software simulator!)

See Project Structure for more details.

A big thanks to all of the folks who helped create the OSS CAD Suite which has been essential for all of my FPGA and HDL tinkering including this project. Specifically:

Project Structure

This repository contains the implementation of all of the various different components required for hardware synthesis, simulation, verification, and software support of the full system on a chip (SoC). The SoC can be used and tested either as a software simulation or synthesized to run on three different FPGA boards

This project contains two separate implementations of the RISC-V rv32im architecture, a simple, minimal multi-cycle design, and a performance-optimized pipelined design.

CPU component architecture.

Each CPU is implemented as two major components:

The processor itself (control logic, instruction decoding, etc)
The arithmetic logic unit or ALU that is used by the CPU containing the implementation of all arithmetic instructions
A bus, which allows the CPU to read instructions from a memory device (non-pipelined version) and interface with other miscellaneous IO/peripheral devices through a bus hub.
More Implementation Details Here

Bus Architecture

This system on a chip is designed around a central bus hub which serves as interface between the CPU and multiple bus devices including peripherals and memory. More information can be found in the Bus Architecture section.

FPGA-Oriented Peripherals

This project contains four specialized bus devices (peripherals) designed for use on FPGAs.

A minimal parallel port implementation is provided
- 32 bits of input and 32 bits of output (separate wires)
- controlled with a single 32 bit word (configurable address).
- Synchronous, updates its output bits on the positive edge of clock when write enable is asserted and the device is active.
SPI (Serial Peripheral Interface) Host Controller
- A more efficient alternative to bit banging SPI with the parallel port.
- Configurable clock reduction with 30-bit resolution
  - Why is this useful? no idea. You could make chiptune music with it, I guess.
- Automated SPI transactions through an internal finite state machine.
- Three words of control and data registers
  - BASEADDR + 0: Status register
    - Bit 0: Transaction finished
    - Bit 1: Busy
  - BASEADDR + 4: Control register
    - Bit 0: Start transaction
    - Bit 1: Hardware chip-select level
    - Bits 2-31: 30-bit clock divider (sclk_freq = clock_in_freq/clock_divider)
- GPIO Bank (for ICE40 devices)
  - Bidirectional IO pins
  - Configurable number of IO (up to 32)
- HUB75 LED Matrix Display Driver

Bus Design

All peripherals and processors in this project share a consistent bus interface. This bus interface is custom and unique to this project. Why not use Wishbone or some other pre-defined and standardized bus? I chose to create my own bus architecture because designing every component in this project from scratch has allowed me to gain a more intuitive understanding of computer architecture in general. If I chose to use something like Wishbone in this project, I would be building on something that many experienced people have already put a lot of thought into rather than doing so myself. In creating my own bus architecture, I made these design decisions myself and gained an understanding of why things like Wishbone exist, and why they are designed the way they are.

Interface

The bus hub has a consistent interface for devices (peripherals, memory, etc.) as well as for the host (CPU).

Device interface:

// Device ports * 1
output wire [32:0] device_address, // Address requested by the HOST
output wire [32:0] device_data_write, // Write word
output wire [3:0] device_write_mask, // Byte mask for write
output wire [0:0] device_ren, // Read-enable
output wire [0:0] device_wen, // Write-enable
input wire [0:0] device_ready, // Device ready (finished)
input wire [31:0] device_data_read, // Read word
input wire [0:0] device_active // Device-self-select (see explanation)

Host Interface:

input wire [31:0] host_address, // Bus transaction address
input wire [31:0] host_data_write, // Write word
input wire [3:0] host_write_mask, // Write byte mask
input wire host_ren, // Read-enable
input wire host_wen, // Write-enable
output reg [31:0] host_data_read, // Read word
output reg host_ready, // Transaction finished

On the device side, the bus hub provides wires for a 32-bit address, 32-bit write word, 32-bit read word, a 4-bit write mask (indicating which of the 4 separate bytes in the 32-bit word to write to), a read-enable bit, write-enable bit, and a ready bit (controlled by the device.)

Demultiplexing

The bus hub incorporates an "active" (self-select) input from each separate bus device. The bus switch selects which device has access to the host based on each device's active bit. The active bits of all devices connected to the bus hub are fed into a priority encoder, and the device attached to the bus hub, asserting its device active signal with the highest priority is the device that receives read-enable, write-enable, and data signals from the host. The host also receives the ready signal and read word from this "selected" device.

If no devices assert the active signal on the bus hub, the host will read back all zeros and will receive a ready signal of zero.

This bus architecture allows for devices to manage their own memory space with respect to addressing on the bus, which requires less configuration of the bus hub and allows for more specialized logic inside of each device for addressing. Offloading address-management onto the peripherals allows for more flexibility in designing peripherals (allows me to mess around with things more), but is not a good design choice for systems with many devices, or systems with peripherals consisting of external IP that have unknown addressing behavior.

A Python script hubgen.py is provided for generating bus hubs with an arbitrary number of devices due to my difficulties with SystemVerilog's generate features.

Timing Diagrams

Write Timing Diagram Read Timing Diagram Read-Write Timing Diagram Multiple Read Timing Diagram

Memory Implementations.

Two memory implementations are provided for use on the bus currently:

Simple
- Implemented as a SystemVerilog packed array
- Can be configured for any arbitrary amount of words
- Arbitrary start address
- Can contain initialization data provided as a binary or hex file.
- Intended for use on any FPGA that supports memory (BRAM) inference from unpacked SystemVerilog arrays as well as for simulation.
ICE40 SPRAM
- Uses the quad 256 Kilobit SPRAM memories of ICE40 UltraPlus-series FPGAs
- 32K words of random access memory
- Does not allow initialization data.
- Arbitrary start/base address

CPU implementation.

This project contains two separate implementations, two separate fully functional processors.

The first, a simple multi-cycle, non-pipeline design, is designed with logic space in mind and has been optimized to be able to fit on small inexpensive FPGAs (ICE40 UP or HX).
The second design is a more complex pipeline design, designed for performance rather than logic size, and as of the moment cannot fit on any of my small inexpensive FPGA and additionally requires dual port random access memory for a Von Neumann architecture.
- With dual port memory, this pipeline processor can fuse the program and data (bus access) ports into operating on the same memory space.
- With single-port RAM, data and program memory must be split (Harvard)

ALU

Both of these CPUs rely on a shared arithmetic logic unit implementation, which handles all numerical and arithmetic instructions in the RV32IM instruction set (including those of the multiplication and division extension) The arithmetic logic unit is a compromise between speed and logic size.

A few logic size reduction techniques are used
- Single barrel shifter for both left and right shifts, rather than two, (result: less LUTs, increased critical path length)
- Sharing a single multiplier for the various different multiplication instructions with input and output multiplexers. (result: less hardware multipliers/LUTs, increased critical path length)

The ALU takes input of:

both operands for the instruction (immediates/registers are multiplexed in the CPU's control logic)
a flag indicating whether the instruction contains an immediate value or not (affects decoding)
a ready flag (start signal)
funct3 and funct7 sections of the decoded instruction To produce a 32-bit output as well as a done flag (which is always asserted unless a division instruction is requested in which the done flag will stay low until the division procedure has completed).

The ALU implements a 32-cycle division process which is designed more with logic size and simplicity in mind rather than performance.

Non-Pipelined CPU (V1)

The non-pipelined CPU is a state machine that progresses through four separate stages to execute a single instruction.

Note: some instructions only use three of the four stages.

1. Fetch

The instruction fetch stage uses the CPU's bus host interface to read a 32-bit instruction from the current program counter and decodes various immediate values and instruction parameters. The fetch stage also loads the values of any operand registers if applicable.

2. Decode

In the decode stage:

Operands are loaded into the ALU and ALU control signals are set.
Branch conditions are calculated and the load store address is calculated if applicable.
The ALU is signaled to be ready to start completing its function
The control logic of the CPU decides whether to progress to the execute or write back stage.
- Only multi-cycle ALU instructions, load instructions, and store instructions require the execute stage.
- If the execute stage is not required, the instruction pipeline skips the execute stage and goes directly to writeback.

3. Execute

The execute stage's sole function is to wait for completion of either an ALU operation or a memory transaction which was started in the decode stage. The CPU will simply stay in execute until said transaction completes and will then progress to the write back stage.

4. Write-Back

In the write back stage:

The program counter is updated to the next calculated program counter (whether that is the current program counter + 4 or a specific jump or branch program counter, which is calculated combinationally)
Signals the ALU to flush any data currently stored
Writes the current calculated value to a destination register if applicable.
- This write back value is multiplexed from the ALU, memory loads, and immediates.

In the write back stage, the processor also asserts an "instruction sync" signal which is used only internally for verification of the processor. (synchronization with the simulator)

This processor can achieve ~4.7 MIPS (Mega-Instructions Per Second) running the Dhrystones benchmark at 25 megahertz on a Lattice ICE40 UP5K FPGA (Upduino board) and meets timing requirements for such.

Pipelined CPU (V2)

The second version of the CPU in this project is a pipelined version of my original processor - a complete rewrite, and is optimized for speed and simplicity rather than for logic size.

Understanding of Pipelining

While I was writing the first version of the processor, my goal was to minimize logic size and be able to run it on a small FPGA rather than to have a high-performing core. During my research, I was intrigued by the idea of pipelining and being able to get so much more performance out of the same architecture by using a different structure and implementation inside of the CPU. I thought that it would be an arduous task to implement a pipelined processor, but once I made sense of what pipelining really was, it was quite intuitive.

The way I think about it, pipelining is like doing your laundry.

My non-pipelined CPU is like putting dirty clothes in the washer, washing them, taking them out, putting them in the dryer, drying them, taking them out of the dryer, and hanging them up on your clothes rack before going and putting the next load in the washer.
Pipelining is doing laundry how you normally do laundry.
- Take dirty clothes put them in the washer.
- when that's done, take whatever is in the dryer and set it aside
- then you take what was in the washer, you put it in the dryer, you put a new load in the washer, and you start both of them at the same time.

Doing laundry is a simple real-world example of a two-stage pipeline that helped me develop a lot of the intuition necessary for putting together this pipelined implementation.

Implementation

The way I've implemented my pipeline processor is with four separate stages:

Fetch
Decode
Execute
Write-Back

Wait, those are the same as in the non-pipelined version? how can that be? In this version, the stages are not part of a state machine, but rather separate entities that work in parallel and move data through every clock cycle.

Each stage has registers to maintain the state of that stage
Wvery clock cycle, each stage of the pipeline checks if the next stage is open, and if so, shifts the data forwards one step in the pipeline.
Stages can block (stall) to prevent data from being shifted in when work is not done in aforementioned stage

The most important things necessary for proper operation the pipelined processor are the status flags in each stage that signal whether that stage has valid data that's being processed, and whether that stage will be "open" at the next clock cycle to shift data into. These status flags are either registered (for indicating valid data) or combinationally determined values (in the case of whether the stage is open or not for new data). Some stages such as the execute stage can block for multiple cycles (when operation such as division is happening in the ALU).

Pipelined Memory Model

One of the more unique things about my pipelined processor is the memory architecture. I didn't really like the idea of having a (modified) Harvard architecture with separated program and data, which is typical for pipelined processors. (modified Harvard uses shared program/data main memory, but has split program/data caches, and is more complexity than I would like for this project) For this project, and consistency between the pipelined and non-pipelined versions of the CPU, I would much rather have a Von Neumann architecture where program and data memory are one and the same.

To accomplish this while still being synthesizable for the target FPGA (ECP5, for this pipelined version), I had to use a few tricks.

The memory for this processor (and SoC) is contained within the CPU logic.
Rather than having a bus-host interface for the memory, this processor has both a host port (peripherals/memory), and a device port that uses one of the read/write ports on the dual-port RAM used for program and data
The other port of the dual-port RAM is used internally to the CPU for reading program data.

Fetch

In the fetch stage of my processor, the next program counter is determined, whether that is simply the next program counter in the sequence, or possibly a branching program counter if there is a branch instruction in the write back stage.

A single word, the instruction, is read from directly from the memory contained within the CPU and passed into the decode stage if the decode stage is open.
There are many checks in place that need to be in the correct state before the fetch stage actually puts data into the decode stage.
- Decode stage needs to be open for data (that is, none of the pipeline stages after it are stalled)
- The fetch stage does not put any data in the decode stage if there is a data-dependency of the instruction (an instruction that requires a register value that has not yet been written by the pipeline)
  - Exception: register forwarding (we'll get to that later) and
- The fetch stage will not place data in the decode stage if "unsafe execution" is occurring.

"Unsafe execution" is defined in this implementation as there being an instruction in the decode, execute, or write back stage that will affect the program counter of the next executed instruction (a branch or a jump instruction). This processor does not implement any sort of branch prediction for simplicity, so all branches and jump instructions are treated as possibly changing the program counter, and thus, unsafe execution.

Decode

Once an instruction reaches the decode stage, the opcode, various immediates, as well as register values are decoded into the execute stage. Although most of these values could be combinationally determined, I have found that doing so in the decode stage improves the timing characteristics by having more flip-flops in the data path (which reduces the critical path length for these signals).

Note: in the decode, execute and, write back stage, flags are raised (combinationally determined) for whether the instruction in each stage has a data dependency or is unsafe. These flags are used by the fetch stage to determine whether instructions should be flowing into the pipeline or not.

Execute

Once the instruction reaches the execute stage, it is further decoded into the ALU control signals. Register operands and immediates are multiplexed into the ALU inputs and bus interface. The execute stage is the only stage in the pipeline that can stall, and it will do so if the ALU takes multiple cycles to execute the instruction (division) or if the bus is held (not finished) for multiple cycles during a read or write.

Note: if using the dual-port memory built in to the CPU the execute stage will never take more than one cycle for loads and stores.

Write-Back

When the execute stage is finished, data from the execute stage is written into the write back stage. Note: The write back stage is always open and never blocks.

The write-back stage's task is to:

write back the currently calculated result into a destination register (if applicable)
forward any possible program counter changes back to the fetch stage so execution can resume after a conditional branch or jump.

Register Forwarding

One performance enhancing feature that I have added is register forwarding.

If there is a data dependency detected, instructions will still progress from the fetch stage to the decode stage and rather than the fetch stage waiting until a data dependency is resolved, the decode stage will wait (only one cycle) and then load data not from registers but directly from the result of the execute stage. This optimization saves one clock cycle in the case of consecutive instructions with a data dependency and increases the performance of this implementation by a significant factor without adding significant complexity.

Forwarding is accomplished through a number of simple checks on the indices of registers present in the execute and decode stage of the processor and multiplexing forwarded values with register reads in the decode stage.

Note:

The memory inside of the pipelined version of the processor is initializable when used on certain FPGA targets (ECP5).
Currently this processor only supports the Lattice ECP5 FPGA because it's the only FPGA I could get my hands on (cheap enough) that was large enough to synthesize and run a SoC with this processor on.
The base configuration of this pipeline processor does not require a bus switch with no peripherals.
- Bus host port of the processor can just be connected to the bus device port (that is, the memory access port) of the processor and it can run as an entirely self-contained system.
For peripherals a bus hub must be inserted between the host and device ports of the memory on the processor itself, and any number of peripherals can be connected using the exact same bus architecture as the non pipeline version of the processor.

RISC-V CPU Design Independent Study