NoC Silicon IP for RISC-V based chips supporting the TileLink protocol
A Case Study in Rule-Based Modeling
Mieszko Lis - Bluespec, Inc. - Waltham, MA USA
Abstract—We present a case study in employing rule-based high-level synthesis to implement a parameterizable general purpose processor. We contrast a generic implementation in Bluespec SystemVerilog to reference implementations in VHDL and SystemVerilog, and discuss the impact on the development process, and resulting code and hardware quality. We also show how this methodology positively affects the verification process.
I. INTRODUCTION
As commercially produced digital circuits grow in complexity, the cost of new development necessitates increased reuse. In ASIC development, reuse typically takes one of two forms: use of purchased IP, or adaptation of a component developed for a previous project.
Reused in-house IP rarely fits the desired purpose precisely. More frequently, the component was designed for a specific architecture, and must be altered to fit new requirements. The clock frequency often increases, and the design must be re-pipelined and re-optimized; changing interface details such as bus widths, number of storage elements, etc., add to the complexity of the task and heighten the resulting risk.
The adapted design must then be verified. Most often, the testbenches used to verify the original were specific to the very microarchitectural details, and adapted designs require corresponding changes in the testbench. This, of course, makes the redesign even more costly and the risk even more severe.
Traditionally, reusability has meant employing polymorphism or macro substitution to parametrize the design along such axes as bus widths or queue lengths. The most useful kinds of parametrization, however, are seldom obvious when the block is first designed, and an effective component will inevitably be employed outside of its original design considerations. To make matters worse, making a design generic along all conceivable axes takes additional time, and schedule pressures often consign parametrization to a “cleanup” stage after the design has taped out.
A reusable-design methodology must therefore acknowledge that adaptation is inevitable, and focus on making changes easy and safe.
II. PAPER ORGANISATION
In the remainder of the paper, we examine the design of a simple Princeton-architecture processor using rule-based synthesis and compare the results to reference RTL designs. We start by briefly reviewing existing approaches to reusable design and high level synthesis, and, in Section III, sketch the architecture of the reference model employed in our study. In Sections IV and V, we offer an outline of rulebased design and tool features pertinent to our study, and examine how the flow differs from traditional RTL synthesis. We present our code comparison and synthesis results in Section VI.
III. RELATED WORK
Strategies for reusable design have focused on making models as generic as possible [8][11]; generator and preprocessor techniques are in common industry use. A higher level of description in existing languages [2][3][9] reduces coding effort, but does not dramatically enhance robustness to change. Traditional high-level synthesis [14] has been employed to similar effect.
Research in rule-based synthes is [6][7] has shown suitability for architectural exploration [1]; the technology has been successfully employed in a variety of highly parametrized designs [5][15].
High-level domain-specific languages concentrating on reusability and modularity have been used to model microprocessors [12][13], but have not concentrated on synthesis.
IV. THE MAX ARCHITECTURE
MAX is a simple general-purpose processor implemented as a Princeton architecture (shared instruction and data memory).
The width of data words stored in registers and in memory is parameterizable, subject to a minimum of twelve bits required to store an instruction (six opcode bits and three two-bit register addresses).
ALU, memory, and I/O instructions access four general purpose registers. Loads and stores use direct, indirect, and doubly-indirect addressing modes. Direct jumps may be unconditional or contingent on one of the overflow, carry, negative, and zero flags. Memory reads are asynchronous and writes are synchronous; external I/O shares the memory data bus.
Each instruction takes one or two words in memory. ALU instructions are encoded in one word, while memory, I/O and jump instructions are followed by an argument indicating the target address.
Figure 1. State diagram of MAX
Execution takes place sequentially, and instructions take several clock cycles, depending on instruction type. On account of shared instruction and data memory, instruction fetch is suspended while instruction arguments are read and while loads and stores are being executed; the next instruction is fetched, however, while a preceding ALU instruction executes. The corresponding state diagram is depicted in Figure 1. .
A previously developed reference implementation in VHDL of MAX was available during this study as well as a SystemVerilog version.
V. RULE-BASED DESIGN
Rule-based synthesis, as embodied in Bluespec SystemVerilog (BSV) [4] is based on TRS synthesis technology developed by Hoe and Arvind [6][7]. It has been employed for architectural exploration [1] and designing complex, parameterizable designs [5]. An overview of BSV features relevant to our study follows.
A. State
In BSV, state is never inferred, and must be instantiated explicitly. The program counter, for example, is a register containing an Address, and a zero on reset:
Reg#(Address) pc <- mkReg(0);
(the left arrow is syntactic sugar for module instantiation.) More complex state elements and purely combinational modules like the ALU are similarly instantiated:
ALU myALU <- mkALU();
B. Rules
In a rule-based approach, the designer focuses on operations, rather than on data path and control logic separately. In our processor, these operations correspond to instruction fetch and execution stages of each class of instructions. Instruction fetch, for example, is a single rule, which can execute unless some other instruction has disabled it (via register block_if):
case(i_reg.opcode)
JC: if (carry)
tmp_pc = addr;
// ... other conditional jumps ...
i_reg <= dat2instr(mem.sub(tmp_pc));
pc <= tmp_pc+1;
block_if <= True;
During the execution of this rule, the fetch address is determined (either the pc or a previously computed jump target), the instruction register i_reg receives the instruction decoded from the relevant memory location, the next pc is computed, and the instruction fetch is disabled until execution completes. The variable tmp_pc is temporary, and does not imply storage.
C. Types
BSV permits the declaration of new types, and strictly enforces type conversion. The instruction type, for example, stored in i_reg, comprises the opcode and the source and destination registers:
Src src1;
Src src2;
Dest dest;
The opcode itself is an enumerated type; as we have seen in the fetch rule, the opcodes are symbolic:
typedef { NOP, JMP, JC, JZ, …} Ops;
Types must match when any value is used; for example, assignment to an Instruction from a bit vector, even of the same bit length, would produce an error, and must be explicitly typecast. Bit vectors of different length are different types, and bit vectors which do not match the length expected by an operator or assignment must be explicitly extended or truncated.
D. Organization
Rule-based design differs significantly from the design approach in our reference implementations (VHDL/SV). The BSV design was organized around operations performed by the processor: instruction fetch, jump execute, etc. Rules specified both the data path and the logic which controls it.
Our VHDL/SV model, in contrast, was organized around state modules: the pc, the register file, and the memory were all separate modules with data input and output buses and control inputs; a master FSM module contained all control logic. A top-level module instantiated and connected the state modules and the FSM module.
E. Design Refinement
A rule-based approach lends itself to a design style where refinements are made to an existing design. Our processor model was implemented incrementally, by adding instructions one at a time, as follows:
- add the instruction opcode to the Ops enum type;
- add new rules or change existing rules to support the new instruction (several lines of new code);
- possibly instantiate a new interlock register (one line of new code).
The refined source was then compiled and debugged until the testsuite passed.
In our reference VHDL and SV descriptions, on the other hand, adding a new instruction required more complex steps:
- add the instruction opcode to the operation type
- add instruction to the decoding unit (an additional element in the look-up table);
- add ports, depending on the type of the instruction (this leads to a change in all hierarchy levels),
- modify the control unit, and instantiate additional state;
- potentially add a port for another control signal;
- in the worst case, implement heavy datapath changes.
As above, the design was then recompiled, simulated, and debugged, and potential errors were corrected. Verification effort was, however, higher because design changes in the VHDL flow resulted in a larger design impact.
VI. TOOL FLOW
The main difference from a traditional VHDL or SV flow is the additional step of compiling the rule-based description into synthesizable RTL. The design can then be simulated or synthesized using traditional tools. While traditional debugging tools are used, the debugging process differs somewhat, as we discuss below.
A. Synthesis into RTL
BSV source code is compiled using the Bluespec Compiler (BSC) and produces synthesizable Verilog RTL. Compilation reports errors in the source and warnings. Type errors and some out-of-bounds errors abort compilation, and are comparable to those reported by a linter tool.
The main difference between a classical HDL design and the Bluespec approach is the focus of each description. A classical HDL design is described in a register-centric way. The functionality is split into states, which are implemented with clocked processes, and transition logic, which is described with combinational processes. When it comes to modeling concurrent behavior with shared resources a lot of multiplexing is required to specify prioritizations among the accesses, which can be very error prone and hard to trace. Especially specification changes which affect these areas can cause many errors, when adapting the design accordingly.
The Bluespec compiler has the strongest impact in that context. Its rule scheduler formally analyzes all rules to determine subsets which can fire simultaneously and allocates resources used by these rules. Shared resources with insufficient ports are arbitrated, guided by userspecified directives. In the absence of user guidance, the scheduler chooses a prioritization, issues a warning, and reports the conflicting rules.
Classical HDLs allow prioritizations to be specified implicitly, using an if-like structure, for example when a resource is never accessed simultaneously. But concurrency assumptions, even when explicit (e.g., using “unique” in SV), must be verified in simulation and are not robust to design modifications.
The following pseudocode, showing an implicit prioritization of write-accesses to the registers of the evaluation model, exemplifies this issue:
The code represents a multiplexer, which switches between write-accesses from the EXEC-FSM and the FETCH-FSM. If both state-machines set a write request simultaneously, only the EXEC-FSM will get access. If the designer missed the possibility of two simultaneous access (one from each state-machine), this error can only be found by simulating a test case which exactly triggers such a behavior. The Bluespec scheduler, on the other hand, discovers these kinds of errors during compile time: consider, for example, the rules implementing the execution of a LoadConstant (LDC) and an ADD(C)/SUB(C) operation:
is_arith(i_reg_alu.opcode) &&
fetch_state==PF);
rf.upd(dest,myAlu.compute(…));
…
endrule
rule Ldc(fetch_state==PF &&
i_reg_opcode==LDC);
rf.upd(dest,mem.sub(…));
…
endrule
Whenever a LDC is executed after an ADD(C)/SUB(C), a simultaneous access to the register file occurs. The compiler will prioritize ADD and omit LDC and issue a warning that ADD and LDC are conflicting, necessitating a stall (as depicted in Figure 1. with the loop-back transition in the PF-Phase). The stall can be implemented by adding a further rule for the LDC-Operation and changing the enabling conditions as follows:
exec_state == EX2);
rule Ldc_no_stall(fetch_state==PF &&
exec_state != EX2);
rf.upd(dest,mem.sub(…));
The example shows that in this case the scheduler acts as if verifying a compile-time assertion, which a designer in a traditional flow would add manually.
The advantages of the scheduler compared to any HDL can be summarized as follows:
- Safe modeling of resource-sharing in a concurrent environment
- Checks active throughout all modifications of an initial design
- Rule conflict analysis can statically reveal genuine bottlenecks (e.g., insufficient ports on a register file or memory) that need micro-architectural fixes.
A further major advantage of rule-based design is the possibility of easy code modification. Since all functionality is encapsulated in rules, the behavior of a design can be altered by adding and removing rules respectively. The built-in scheduler guides the user by ensuring that the behavior of the original design does not change more than intended.
A further huge advantage of a rule-based design methodology is that the designer is freed from the necessity of specifying clk-edges or reset behavior. Especially modifications in the reset behavior can produce a lot of effort with classical HDLs, since these changes often require to design some parts completely from scratch.
Due to the higher level of abstraction the functionality encapsulated in a rule can be mapped more easily to a requirement-spec. The higher level of abstraction is achieved by freeing the designer from specifying multiplexers and further control logic.
The higher level of abstraction also lessens the time required for planning a design, in combination with the scheduler, both design and verification time can be reduced by decreasing the possibility of inserting a bug from the beginning.
B. Simulation and Waveforms
We used ModelSim to simulate the generated Verilog and examine the resulting waveforms.
In addition to the traditionally available signals, corresponding to state and wires in the design, the waveforms show a WILL_FIRE signal for each rule in the simulated design. These signals indicate whether each rule is firing in a given clock cycle. Because the designer knows when he expects each rule to fire, this makes diagnosis relatively easy.
Figure 2. shows a trace of all WILL_FIRE signals for MAX running a short assembler program. The arrows denote the sequence of firing rules.
The waveform on the bottom of Figure 2. shows the WILL_FIRE signal for the instruction fetch rule. The trace shows the sequence of rules fired for the following assembly instructions:
LDC R2 8
ADD R3 R1 R2
STOP
Note that the WILL_FIRE signals show the firing sequence. Bugs which affect the correct instruction or firing sequence can be quickly identified.
Figure 2. Trace of WILL-FIRE signals
C. Debugging
The novel aspects of the debugging process are perhaps best illustrated by an example. While testing the implementation of this processor, we found that programs containing memory references were executing incorrectly.
By following the WILL_FIRE signals in a waveform viewer we were able to observe the sequence of instructions executed, and discovered that, following each store, the processor was executing an unexpected instruction not found in the original program.
Examining the instruction register confirmed that this instruction was indeed being fetched from memory, and the program counter showed that the correct location was being read. A brief glance at the values on the memory bus revealed that the previously executed store instruction had overwritten its successor with its payload.
Indeed, the source confirmed that the memory was being written to at the location stored in pc:
block_if <= False;
and, sure enough, changing pc to addr (the pre-computed storage location) corrected the error:
block_if <= False;
VII. RESULTS
A. Code Comparison
At the beginning of this case study we had a reference description of Max-CPU in both VHDL and SystemVerilog. The following table shows the measured lines of code.
TABLE I. COMPARISON OF LINES OF CODE
Structure | VHDL | SV | BSV |
Hierarchical | 1535 | 1149 | 412 |
Dual-process | 475 | 452 | — |
The Bluespec source has fewer lines of code compared to the hierarchical models. Since the BSV-model is a mixed control and datapath model, we also developed a dualprocess version for VHDL and SystemVerilog. The BSVsource, however should be compared to the hierarchical structure, since this is the most common way of writing complex designs. The reduction of code length is mainly caused by the implementation of a hierarchy in VHDL and SystemVerilog.
If compared to the dual-process solutions the table shows that taking away the hierarchy saves a huge amount of lines of code. Thus the advantages of the rule-based synthesis approach as described in this paper can not be measured by comparing lines of code. However, maintainability, extendibility and readability are the main advantages that BSV offers compared to the dual-process solutions, due to encapsulation of functionality in rules.
B. Synthesis
We synthesized a 12 bit variant of the processor using the Magma Blast RTL flow with a HP 0.13um 8-layer-metal library. For comparison reasons we also synthesized the VHDL dual-process solution. The synthesis results are summarized in the following table.
TABLE II. SYNTHESIS RESULTS
Period/Frequency | VHDL | BSV-generatedVerilog | ||
Area/ìm² | Gates | Area | Gates | |
1.5ns/666 MHz | 17857 | 3571 | 13269 | 2654 |
1.65ns/606 MHz | 15574 | 3115 | 11318 | 2264 |
2.0ns/500 MHz | 11940 | 2388 | 10339 | 2067 |
3.0ns/333 MHz | 10814 | 2162 | 9550 | 1910 |
4.0ns/250 MHz | 9850 | 1970 | 9470 | 1895 |
When considering adequate speed, the generated Verilog code results in slightly less area and gates than the VHDLcode. With growing speed this difference increases because the layout tool has to use stronger and bigger cells to meet the required clock speed.
C. Impact on the design process
Implementing MAX in BSV took approximately two man-days. This compares very favorably to the approximately four man-days originally required for the VHDL/SV reference model.
While it’s impossible to attribute the speedup to specific features, we believe that several factors contributed to this figure. The high level of abstraction allowed by the rulebased design was certainly a source; we found the ability to combine control logic and data path especially convenient. The many race conditions detected by the compiler saved a good deal of verification and debugging time. Debugging itself was likely more efficient because we were able to observe which rules (and thus instructions) were active at a given time; in our reference design, this would have required decoding the FSM states.
We found using rules in development natural and intuitive. The operation-centric nature of rule-based design lends itself well to implementing the design in stages: implement a few instructions, test them, and continue until the design is complete. Doing the same with our VHDL design would have required us to redesign the control FSM at each stage, and would have been impractical.
By the same token, architectural changes, such as adding instructions, are significantly easier in a rule-based approach: we found adding, changing, and removing single operations easier and safer than redesigning a monolithic FSM, and were spared many possible bugs by the compiler’s detection of race conditions.
VIII. CONCLUSIONS
Our case study of rule-based synthes is reveals a good method for safe parametrized design and architectural exploration. While we worked on one example only, our experiences make a convincing case for BSV as a vehicle for reducing the error probability during design.
While permitting adequate parameterization, BSV allows the designer to focus on the functionality, not the detailed implementation, of the design. The ease and safety of modification allows, along with advanced features like race-condition detection, not only architectural exploration but also encourages a more evolutionary approach to development. For similar reasons, rule-based methodology assists in debugging, as the designer can focus on the operations rather than the details of state machines and other implementation logic. We also saw that the rule-based method produces adequate synthesis results and works well with control-heavy designs like MAX.
We did not take advantage of the plug-and-play interface semantics, for example, and did not have opportunity to employ much structural abstraction. We expect further research to focus on a fully pipelined design, with a rich hierarchy of complex components with intricate interactions. Also we plan to do further research on the applicability of BSV for system level descriptions and make a thorough comparison to the well accepted system-level description language SystemC.
REFERENCES
[1] Arvind, R. S. Nikhil, D. L. Rosenband, and N. Dave. High-level Synthesis: An Essential Ingredient for Designing Complex ASICs. In Proceedings of ICCAD’04, San Diego, CA, 2004.
[2] M. Hohenauer et al. Compiler-in-loop Architecture Exploration for Efficient Application Specific Embedded Processor Design . In Design & Elektronik, February 2004, Germany.
[3] A. Schliebusch et al. RTL Processor Synthesis for Architecture Exploration and Implementation. DATE’04.
[4] Bluespec, Inc., Waltham, MA. Bluespec SystemVerilog Version 3.8 Reference Guide. November 2004.
[5] N. Dave. Designing a Reorder Buffer in Bluespec. In Proceedings of MEMOCODE’04, San Diego, CA, 2004.
[6] J. C. Hoe and Arvind. Synthesis of Operation-Centric Hardware Descriptions. In Proceedings of ICCAD’00, pages 511–518, San Jose, CA, 2000.
[7] J. C. Hoe and Arvind. Operation-Centric Hardware Description and Synthesis. IEEE TRANSACTIONS on Computer-Aided Design of Integrated Circuits and Systems, 23(9), September 2004.
[8] V. Preis. An Approach to Complex and Self-Generating VHDL Models for Simulation and Synthesis. In Proceedings of the Spring’94 Meeting of the VHDL-Forum for CAD in Europe, pp. 39ff, Tremezzo, April 1994.
[9] H. King et al. Behavioral Synthesis and Component Reuse With VHDL. Kluwer Academic Publishers, November 1996
[10] Regimbal, S.; Lemire, J.F.; Bois, G.; Aboulhamid E.M., Baron, A.: Aspect partitioning for Hardware Verification Reuse. Proceedings of the HDLCON2002, San Jo se, USA, March 2002.
[11] Girczyc, E; Carlson, S: Increasing Design Quality and Engineering Productivity through Design Reuse. Proceedings of the 30th Design Automation Conference, Dallas, USA, 1993
[12] B. Cook, J. Launchbury, and J. Matthews. Specifying Superscalar Microprocessors in Hawk. In Formal Techniques for Hardware and Hardware-like Systems, 1998.
[13] D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report Technical Report 1342, Computer Sciences Department, University of Wisconsin, 1997
[14] D. Gajski et al. High-Level Synthesis: Introduction to Chip and System Design.
[15] N. Dave, M. C. Ng, and Arvind. Automatic Synthesis of Cache- Coherence Protocol Processors Using Bluespec. In Proceedings of MEMOCODE’05, Verona, Italy, 2005.
Related Articles
- An FPGA-to-ASIC case study for refining smart meter design
- Case study: optimizing PPA with RISC-V custom extensions in TWS earbuds
- UPF Constraint coding for SoC - A Case Study
- Formal Property Checking for IP - A Case Study
- Context Based Clock Gating Technique For Low Power Designs of IoT Applications - A DesignWare IP Case Study
New Articles
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- Synthesis Methodology & Netlist Qualification
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
- Demystifying MIPI C-PHY / DPHY Subsystem
E-mail This Article | Printer-Friendly Page |