Garuda: RISC-V ML Accelerator

7.5-9× lower tail latency for batch-1 attention microkernels — A CVXIF coprocessor that extends RISC-V with custom INT8 MAC instructions, optimized for transformer inference on on-SoC deployments.

Overview

Garuda is a CVXIF coprocessor that extends RISC-V with custom INT8 multiply-accumulate (MAC) instructions, optimized for batch-1 tail latency (p99). Ideal for real-time transformer inference, voice assistants, and local LLM attention workloads.

Key Achievement: 7.5-9× latency reduction vs. modeled baseline for attention microkernels (p99: 307→34 cycles).

Quick Start

git clone https://github.com/certainly-param/garuda-accelerator.git
cd garuda-accelerator
git submodule update --init --recursive

# Run simulation
iverilog -g2012 -o sim_test.vvp garuda/tb/tb_attention_microkernel_latency.sv garuda/rtl/attention_microkernel_engine.sv
vvp sim_test.vvp

Performance

Latency (Attention Microkernel)

Workload: Q·K dot product (K=128 INT8 elements)

Metric	Baseline	Garuda	Improvement
p50 latency	256 cycles	34 cycles	7.5×
p95 latency	291 cycles	34 cycles	8.6×
p99 latency	307 cycles	34 cycles	9.0×

Measured via tb_attention_microkernel_latency.sv (1000 trials). Baseline models CPU-style SIMD_DOT loop with dispatch jitter.

Instruction Performance

Operation	Standard RISC-V	With Garuda	Speedup
Single MAC	2 instructions	1 instruction	2×
4-elem dot product	16 instructions	1 instruction	16×
MAC latency	5-8 cycles	3-4 cycles	1.6-2×

Features

Custom Instructions

Instruction	Opcode	Description	Latency
`MAC8`	0x0001	INT8 MAC, 8-bit accumulator	3-4 cycles
`MAC8.ACC`	0x0002	INT8 MAC, 32-bit accumulator	3-4 cycles
`MUL8`	0x0003	INT8 multiply	2-3 cycles
`CLIP8`	0x0004	Saturate to INT8 range	1 cycle
`SIMD_DOT`	0x0005	4-element SIMD dot product	3-4 cycles
`ATT_DOT_SETUP`	0x0008	Configure attention microkernel	1 cycle
`ATT_DOT_RUN`	0x0009	Stage & execute dot product	Variable
`ATT_DOT_RUN_SCALE`	0x000A	Run with scaling	Variable
`ATT_DOT_RUN_CLIP`	0x000B	Run with scaling + clipping	Variable

All instructions use RISC-V custom-3 opcode (0x7B).

Architecture

CVXIF Interface: Standard coprocessor protocol (no CPU changes required)
Attention Microkernel Engine: Internal deterministic loop execution, eliminates CPU dispatch overhead
Multi-Issue Support: Register rename table enables 4-wide instruction issue
INT8 Quantization: 4× memory reduction vs. FP32, lower power consumption

Key Modules:

int8_mac_unit.sv: Core MAC execution unit
attention_microkernel_engine.sv: Latency-optimized attention engine
int8_mac_decoder.sv: CVXIF instruction decoder
register_rename_table.sv: Multi-issue rename infrastructure

Getting Started

Prerequisites

Simulator: Icarus Verilog, Verilator, QuestaSim, or VCS
RISC-V Toolchain: For software development
Python 3.7+: For Cocotb verification tests

Run Simulations

Icarus Verilog:

iverilog -g2012 -o sim_rr.vvp garuda/tb/tb_register_rename_table.sv garuda/rtl/register_rename_table.sv
vvp sim_rr.vvp

Verilator (all testbenches):

bash ci/run_verilator_sims.sh

Cocotb tests:

cd garuda/dv && make

CVA6 Integration

The integration/ directory contains a full system testbench integrating Garuda with the CVA6 RISC-V CPU:

cd integration
make SIM=verilator compile-debug
make SIM=verilator run

Supported Simulators:

Verilator (recommended)
QuestaSim
VCS
Icarus Verilog

Files:

system_top.sv: Top-level module wiring CVA6 + Garuda + Memory
tb_system_top.sv: System-level testbench
memory_model.sv: AXI memory model
Makefile.commercial: Multi-simulator build automation
extract_cva6_files.py: CVA6 RTL file extraction script

Architecture:

┌─────────────────────────────────────────────┐
│         tb_system_top.sv                     │
│         (Testbench)                          │
└──────────────┬──────────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────────┐
│           system_top.sv                      │
│                                             │
│  ┌──────────┐         ┌──────────┐         │
│  │   CVA6   │◄────CVXIF────►│ Garuda │     │
│  │   CPU    │         │Coprocessor│         │
│  └────┬─────┘         └──────────┘         │
│       │                                     │
│       │ NoC (AXI)                           │
│       │                                     │
│       ▼                                     │
│  ┌──────────┐                              │
│  │  Memory  │                              │
│  │  Model   │                              │
│  └──────────┘                              │
└─────────────────────────────────────────────┘

Note: CVA6 is included as a git submodule. Run git submodule update --init --recursive before building.

Usage Example

SIMD Dot Product (C with inline assembly):

static inline int32_t simd_dot(int32_t acc, uint32_t a_packed, uint32_t b_packed) {
    int32_t result;
    asm volatile (
        "simd_dot %0, %1, %2"
        : "=r" (result)
        : "r" (a_packed), "r" (b_packed), "0" (acc)
    );
    return result;
}

Attention Microkernel:

// Configure engine
att_dot_setup(k_elements, shift, scale);  // Q8.8 format

// Stage operands and execute (one instruction per word pair)
for (i = 0; i < k_elements / 4; i++) {
    uint32_t q_word = *(uint32_t*)&q[i * 4];
    uint32_t k_word = *(uint32_t*)&k[i * 4];
    result = att_dot_run_scale(q_word, k_word);
}

Verification

Testbenches (5 passing):

tb_int8_mac_unit.sv: Basic MAC operations
tb_attention_microkernel_engine.sv: Attention microkernel
tb_attention_microkernel_latency.sv: Latency measurement (1000 trials)
tb_register_rename_table.sv: Multi-issue rename logic
tb_attention_microkernel_cvxif.sv: CVXIF integration test

CI: Automated Icarus Verilog and Verilator tests on every push/PR. View CI results

Repository Structure

garuda-accelerator/
├── garuda/
│   ├── rtl/          # RTL source files
│   ├── tb/           # Testbenches
│   ├── dv/           # Cocotb verification
│   └── synth/        # Synthesis scripts
├── integration/      # CVA6 integration testbench
│   ├── system_top.sv      # Top-level system (CVA6 + Garuda + Memory)
│   ├── tb_system_top.sv   # System testbench
│   ├── memory_model.sv    # AXI memory model
│   ├── Makefile.commercial # Multi-simulator build system
│   └── extract_cva6_files.py # CVA6 file extraction
├── ci/               # CI helper scripts
└── .github/workflows/ # CI workflows

Technical Specifications

Interface: CVXIF (Core-V eXtension Interface)
Data Width: 32-bit (XLEN=32)
MAC Latency: 3-4 cycles
SIMD_DOT Latency: 3-4 cycles (4 INT8 MACs)
Attention Dot Latency: K/4 + post-op cycles (deterministic)
Max Dot Product Length: 256 INT8 elements (64 words)

Use Cases

Transformer Attention: Q·K^T dot products for attention scores (7.5-9× latency reduction)
Real-Time Voice Assistants: Low-latency inference with deterministic execution
Local LLM Inference: Batch-1 queries optimized for tail latency
Edge AI: Low power, predictable performance for embedded systems

Why Batch-1? Edge devices have limited memory, power constraints, and real-time requirements. Batch-1 processing enables immediate response without waiting for batch to fill, matching event-driven embedded workloads.

Implementation Status

Completed:

INT8 MAC unit with all basic operations
SIMD_DOT instruction (4× speedup)
Attention microkernel engine
CVXIF interface integration
CVA6 CPU connection
5 passing testbenches

Future Work:

FPGA/ASIC implementation and benchmarking
Power consumption analysis
Extended instruction set

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

References

License

Garuda RTL: Apache License 2.0
CVA6: Solderpad Hardware License v0.51
Documentation: Creative Commons BY 4.0

Star This Project

If you find Garuda useful or interesting, please consider giving it a ⭐ star on GitHub! It helps others discover the project and shows your support.

Made with ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
ci		ci
cva6 @ 4d823b8		cva6 @ 4d823b8
garuda		garuda
integration		integration
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Garuda: RISC-V ML Accelerator

Overview

Quick Start

Performance

Latency (Attention Microkernel)

Instruction Performance

Features

Custom Instructions

Architecture

Getting Started

Prerequisites

Run Simulations

CVA6 Integration

Usage Example

Verification

Repository Structure

Technical Specifications

Use Cases

Implementation Status

Contributing

References

License

Star This Project

About

Uh oh!

Releases

Packages

Languages

License

certainly-param/garuda-accelerator

Folders and files

Latest commit

History

Repository files navigation

Garuda: RISC-V ML Accelerator

Overview

Quick Start

Performance

Latency (Attention Microkernel)

Instruction Performance

Features

Custom Instructions

Architecture

Getting Started

Prerequisites

Run Simulations

CVA6 Integration

Usage Example

Verification

Repository Structure

Technical Specifications

Use Cases

Implementation Status

Contributing

References

License

Star This Project

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages