



# Quantifying the Efficiency of High-Level Synthesis for Machine Learning Inference

Caroline Johnson, Scott Hauck, Shih-Chieh Hsu, Waiz Khan, Matthew Bavier, Oleh Kondratyuk, Trinh Nguyen, Stephany Ayala-Cerna, Anatoliy Martynyuk, Aidan Short, Jan Silva, and Geoff Jones

### **Our Process**

- > Goal: quantify the losses/gains from using the HLS4ML platform
- > Compare resources and performance



### **Our Analysis**

- > Results are in terms of max resource usage
- > HLS results are dashed, SV results are solid

Dashed - HLS Solid - SV



Resources: Ax more, Latency: Bx worse, Clock Period: Cx better

### **Benchmark 1**

#### **One-Layer Model**





## **Initial Approach**

- Heavy pipelining
- Constant folding, II = 1
- Neural-network specific DSP Optimizations

### **Overall goal:**

Match HLS4ML's accuracy with better performance and resource usage



## **Multiplier Packing into DSPs**

### Virtex 7 supports 25x18 bit multiplication

- Bitwidths <=8 can be combined via the DSP pre-adder



# **Multiplier (DSP) Packing**

Dashed - SV with packing Solid - SV without packing



- Besides latency, all ~ equal
   Bitwidths 7 and 8 packing reduces the LUT cost to 0.68x and 0.70x
- > Tradeoff of DSP usage not beneficial in our DSP limited design

### **One Layer - Initial Results**

Dashed - HLS Solid - SV



Resource: 1.28x more, Latency: 1.7x worse, Period: 1.46x better

HLS4ML is outperforming on almost all metrics.



Hint from HLS4ML: DSPs decrease as bitwidth goes down *output = input\*6* could instead be *output = (input<<2)+(input<<1)* Shift-add module:

| HLS       |   |      |   |     |   |      |  |
|-----------|---|------|---|-----|---|------|--|
| WEIGHT    | I | LUTS | 1 | FFs | I | DSPs |  |
| 20'h01010 | 1 | 18   | 1 | 0   | 1 | 0    |  |

| SystemVeri | 109 | г    |   |     |   |      |
|------------|-----|------|---|-----|---|------|
| WEIGHT     | 1   | LUTS | 1 | FFs | ī | DSPs |
| 20'h01010  |     | 0    |   | 0   |   | 7    |

### **Shift-Add Capabilities**

#### **Vivado HLS**

<u>Vivado</u>

+-(input<<c1)+-(input<<c2) for any c1 or c2 +-(input<<c1)+-(input<<c2) where c1 and c2 must be less than 3 or +-(input<<c1) for any c1

### **Shift-Add Module**

Implemented a module that allows for DEPTH powers of 2 to be added



### **Updated One Layer Results**

Dashed - HLS Solid - SV



Resource: 0.97x less, Latency: 1.17x worse, Period: 1.54x better

DEPTH = 2 DSP usage identical for < 24, DSP > 24  $\rightarrow$  does not fit into 1 DSP anymore

### **DSPs > 24**

#### HLS4ML "Magic Multiplier" Subroutine

```
/* Wrapper for multiplication module
                                        /* Internal Multiplication module
*/
                                         */
module mult_op_wrap (
                                        module mult op (clk, ce, a, b, p);
   clk,
   reset,
   ce,
                                        parameter din WIDTH
                                                                    = 32'd1;
   din,
                                        parameter dweight WIDTH = 32'd1;
   dweight,
   dout
                                        parameter dout WIDTH
                                                                   = 32'd1;
);
                                        input clk;
parameter din WIDTH
                    = 32'd1;
parameter dweight WIDTH = 32'd1;
                                        input ce;
parameter dout WIDTH = 32'd1;
                                        input[din WIDTH-1 : 0]
                                                                        a;
input clk;
                                        input[dweight WIDTH-1 : 0]
                                                                        b;
input reset;
                                        output[dout WIDTH-1 : 0]
                                                                        p;
input ce;
input [din WIDTH-1:0]
                        din;
input [dweight WIDTH-1:0]
                        dweight;
                                                      [din WIDTH-1 : 0]
                                        reg signed
                                                                               a reg0;
output [dout WIDTH-1:0]
                        dout;
                                        reg signed
                                                      [dweight WIDTH-1 : 0] b reg0;
                                        wire signed [dout WIDTH-1 : 0]
                                                                               tmp product;
                                        reg signed [dout WIDTH-1 : 0]
                                                                               buff0;
mult op #(.din WIDTH ( din WIDTH
                                  ).
         .dweight WIDTH( dweight WIDTH ),
         .dout WIDTH ( dout WIDTH
                                        assign p = buff0;
                                 )
   ) internal operation (
                                        assign tmp product = a reg0 * b reg0;
   .clk( clk ),
   .ce( ce
            ),
   .a( din
                                        always @ (posedge clk) begin
             ),
   .b( dweight ),
                                             if (ce) begin
   .p( dout ));
                                                 a reg0 <= a;
                                                 b reg0 <= b;</pre>
endmodule
                                                 buff0 <= tmp product;
                                             end
                                        end
                                        endmodule
```



#### HLS4ML "Magic Multiplier" Subroutine

Dashed - HLS Solid - SV



Resource: 1.03x more, Latency: 1.11x worse, Period: 1.44x better







Resource: 0.49x less, Latency: 1.12x worse, Period: 1.49x better

> Tuning of shift-add DEPTH parameter based on optimal results per bitwidth

### **Major Takeaways from One-Layer Model**

- > DSP packing is not beneficial for multiplication-heavy algorithms such as these ML ones
- > HLS4ML handles DSPs better than the tools normally allow for
- > HLS4ML multiplier subroutine allows for DSP usage at higher bitwidths



#### **CNN Model**





### **Convolution Streaming Method**



Line Buffer approach. Shift Register elements (red and blue) are shifted by one index. Input window buffer (orange) is updated with concatenation (green) of popped pixels—**b** and **c**—and input **a**.



Dashed - HLS Solid - SV



Resource: 0.82x less, Latency: 1.67x worse, Period: 1.13x worse

#### HLS4ml outperforming in all resources, except for DFFs



Resource: 1.23x more, Latency: 1.19x better, Period: 1.21x better

### DFFs become the limiting factor. Shift-add DEPTH of 3 (now use 0.58x DSPs)



- > HLS4ML is leveraging the power of Vivado HLS in ways that normal optimizations do not
- > To achieve the same resource usage, we had to mimic HLS results

Should we ever hand code again? Depends on the application. HLS4ML does these specific models very well, but does it scale?



Using our two models, build larger and more applicable models to see how our results scale.

Encoder model Convolution with Stride of 2 Reuse of 3 and 9

Jet Tagger Introducing more complex layers - Batch Normalization







| Model          | LUTs            | DSPs          | FFs             | Max Usage       | Latency<br>(ns) | Period (ns) |
|----------------|-----------------|---------------|-----------------|-----------------|-----------------|-------------|
| 1Layer HLS     | 9265 (1.0)      | 241<br>(4.23) | 7693 (1.0)      | 7.07%<br>(2.57) | 52.6 (1.0)      | 4.19 (1.39) |
| 1Layer<br>Base | 18845<br>(2.03) | 254<br>(4.56) | 13540<br>(1.76) | 8.66%<br>(3.15) | 89.2 (1.70)     | 3.23 (1.09) |
| 1Layer Opt.    | 18207<br>(1.96) | 57 (1.0)      | 20669<br>(2.69) | 2.75% (1.0)     | 59.0 (1.12)     | 2.95 (1.0)  |
| CNN HLS        | 18901<br>(1.03) | 288<br>(3.24) | 12833<br>(1.82) | 26.4% (1.0)     | 541 (1.62)      | 4.34 (1.21) |
| CNN Base       | 18423(1.0)      | 411<br>(4.62) | 7058 (1.0)      | 32.8%<br>(1.24) | 453 (1.36)      | 4.92 (1.37) |
| CNN Opt.       | 23176(1.26)     | 89 (1.0)      | 16615<br>(2.35) | 33.0%<br>(1.25) | 334 (1.0)       | 3.59 (1.0)  |