Extending RISC-V with SCAIE-V: Portable, Efficient, and Scalable Custom Instruction Integration
By Tammo Mürmann, Technical University of Darmstadt
Download the handout here for in-depth information and project links.
Systems that require low power consumption or high performance often employ in-hardware accelerators to meet these demands. While complex algorithms can be efficiently offloaded to memory-mapped accelerators, small computational kernels are typically unsuitable due to overhead and latency. For these cases, custom instructions provide a more efficient acceleration path. However, custom instructions introduce new challenges, including the need for tight integration into the processor pipeline, which can reduce portability.
Although tools for Instruction Set Architecture Extension (ISAX) integration exist, many are commercial, limited to specific core families (e.g., RoCC), support only a subset of instruction types (e.g., no memory access in CV-X-IF), or are still work-in-progress (e.g. RISC-V CX).
SCAIE-V overcomes these limitations by providing a flexible and efficient interface that integrates directly into the processor pipeline. Its logic generation framework supports portability across cores, simultaneous integration of multiple ISAXes, arbitrary instruction encodings, and stateful instruction behavior.
To support a wide range of use cases, SCAIE-V offers three instruction execution modes:
- Tightly-coupled instructions execute in lockstep with the processor pipeline.
- Semi-coupled instructions also execute in lockstep but may span multiple internal stages per pipeline stage, enabling more complex computations.
- Decoupled instructions execute independently and may write back at any time, with hazards automatically managed by SCAIE-V.
In addition, SCAIE-V supports instruction-independent behavior, enabling operations that are decoupled from specific instruction execution, like zero-overhead loops.
SCAIE-V is compatible with several well-known RISC-V cores of varying complexity, such as PicoRV32, VexRiscv, CVA5, and CVA6. Implementation in a modern 22nm ASIC process demonstrates minimal overhead and significant performance gains.

