Internal representation

The hls4ml library will parse models from Keras, PyTorch or ONNX into an internal execution graph. This model graph is represented with the ModelGraph class. The nodes in this graph, loosely corresponding to the layers and operations of the input model are represented by classes derived from the Layer base class.

Layers are required to have defined inputs and outputs that define how they are connected in the graph and what is the shape of their output. All information about the layer’s state and configuration is stored in its attributes. All weights, variables and data types are attributes and there are mapping views to sort through them. Layers can define expected attributes and can be verified for correctness, or to produce a list of configurable attributes that user can tweak. The complete list of attributes can be found in the Attributes page.

Layers

The backends of hls4ml are independent from each other and free to implement features in any suitable way, most implementations share common concepts which we will mention here.

Dense Layers

One-dimensional Dense Layers

Dense layers over one-dimensional data perform a matrix-vector multiplication followed by elementwise addition of bias tensor. This routine is the underlying computation of many other layers as well and is reused as much as possible. It exists in several implementations across different backends, for different io_type’s and strategies.

io_parallel

All the backends have a Resource implementation, which divides the computation into a loop of reuse_factor iterations, each iteration simultaneously accessing a different part of the array partitioned in BRAM. There are different implementations depending on whether the reuse factor is smaller or bigger than the input size. The two Xilinx backends and Catapult also implement a Latency implementation, which uses the reuse factor to control the amount of pipelining/unrolling of the whole function while the weight array is fully partitioned in registers.

io_stream

The io_stream implementation only wraps the io_parallel implementation with streams or pipes for communication. Internally, data is still accessed in parallel as an array.

Multi-dimensional Dense Layers

Multi-dimensional Dense layers are converted to pointwise convolutions, and do not directly use the above implementation.

Convolution Layers

Standard convolution

By standard convolution we refer to the operation represented by the Conv1D/2D layer in Keras (Conv1d/2d in PyTorch). Depending on the io_type option used, there are two classes of implementations in hls4ml.

io_parallel

Parallel IO is applicable to small models that require low latency implementation. Larger models face synthesizability limits very quickly.

In Vivado/Vitis backends, parallel convolution relies on the im2col transformation of the input, which turns convolution into a matrix-multiplication task. This task is then implemented as a sequence of matrix-vector multiplications using the routine mentioned above. The Latency and Resource strategies refer to the function used for matrix-vector multiplication routine, with Resource allowing for a slightly larger models to be synthesized. Parallelism can be further controlled via the ParallelizationFactor. Catapult backend in turn uses a direct implementation of convolution via nested loops. The Quartus, oneAPI, and Catapult backends also implement a Winograd algorithm choosable by setting the implementation to Winograd or combination. Winograd implementation is available for only a handful of filter size configurations, and it is less concerned about bit accuracy and overflow. In certain conditions it can be faster.

io_stream

There are two main classes of io_stream implementations, LineBuffer and Encoded. LineBuffer is the default, and generally produces marginally better results, while Catapult and Vivado also implement Encoded, choosable with the ConvImplementation configuration option. In all cases, the data is processed serially, one pixel at a time, with a pixel containing an array of all the channel values for the pixel.

Depthwise convolution

Depthwise implementation substitutes the matrix-vector multiplication in the kernel to the elementwise multiplication. The only implementation available is based on Latency strategy, used by both io_parallel and io_stream.

Pointwise convolution

Pointwise convolutions are a special case of convolution where the filter size is 1 for 1D or 1x1 for 2D.

For the Vivado/Vitis backends, there is a dedicated io_parallel/Latency strategy implementation of 1D pointwise convolutional layers originally developed for arXiv:2402.01876. The reuse factor (RF) is used to split the layer execution and reuse the existing module RF times. The RF also limits the number of multipliers in each module. The initiation interval scales as the RF. One limitation is that it assumes in_width is divisible by the RF.

Activations

Most activations without extra parameters are represented with the Activation layer, and those with single parameters (leaky ReLU, thresholded ReLU, ELU) as ParametrizedActivation. PReLU has its own class because it has a parameter matrix (stored as a weight). The hard (piecewise linear) sigmoid and tanh functions are implemented in a HardActivation layer, and Softmax has its own layer class.

Backends have four softmax implementations that the user can choose from by setting the implementation parameter:

latency: Good latency, but somewhat high resource usage. It does not work well if there are many output classes.
stable: Slower but with better accuracy, useful in scenarios where higher accuracy is needed.
legacy: An older implementation with poor accuracy, but good performance. Usually the latency implementation is preferred.
argmax: If you don’t care about normalized outputs and only care about which one has the highest value, using argmax saves a lot of resources. This sets the highest value to 1, the others to 0.

Vivado/Vitis backend additionally support completely skipping softmax activation and returning raw outputs.