Part 2: Advanced Configuration#
from tensorflow.keras.utils import to_categorical
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import plotting
import os
os.environ['PATH'] = os.environ['XILINX_VIVADO'] + '/bin:' + os.environ['PATH']
2024-09-05 18:37:28.883049: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[1], line 13
10 import plotting
11 import os
---> 13 os.environ['PATH'] = os.environ['XILINX_VIVADO'] + '/bin:' + os.environ['PATH']
File ~/miniconda3/envs/hls4ml-tutorial/lib/python3.10/os.py:680, in _Environ.__getitem__(self, key)
677 value = self._data[self.encodekey(key)]
678 except KeyError:
679 # raise KeyError with the original key value
--> 680 raise KeyError(key) from None
681 return self.decodevalue(value)
KeyError: 'XILINX_VIVADO'
Load the dataset#
X_train_val = np.load('X_train_val.npy')
X_test = np.ascontiguousarray(np.load('X_test.npy'))
y_train_val = np.load('y_train_val.npy')
y_test = np.load('y_test.npy', allow_pickle=True)
classes = np.load('classes.npy', allow_pickle=True)
Load the model#
Load the model trained in ‘part1_getting_started’. Make sure you’ve run through that walkthrough first!
from tensorflow.keras.models import load_model
model = load_model('model_1/KERAS_check_best_model.h5')
y_keras = model.predict(X_test)
Make an hls4ml config & model#
This time, we’ll create a config with finer granularity. When we print the config dictionary, you’ll notice that an entry is created for each named Layer of the model. See for the first layer, for example:
fc1:
Precision:
weight: ap_fixed<16,6>
bias: ap_fixed<16,6>
result: ap_fixed<16,6>
ReuseFactor: 1
Taken ‘out of the box’ this config will set all the parameters to the same settings as in part 1, but we can use it as a template to start modifying things.
import hls4ml
config = hls4ml.utils.config_from_keras_model(model, granularity='name')
print("-----------------------------------")
plotting.print_dict(config)
print("-----------------------------------")
Profiling#
As you can see, we can choose the precision of everything in our Neural Network. This is a powerful way to tune the performance, but it’s also complicated. The tools in hls4ml.model.profiling
can help you choose the right precision for your model. (That said, training your model with quantization built in can get around this problem, and that is introduced in Part 4. So, don’t go too far down the rabbit hole of tuning your data types without first trying out quantization aware training with QKeras.)
The first thing to try is to numerically profile your model. This method plots the distribution of the weights (and biases) as a box and whisker plot. The grey boxes show the values which can be represented with the data types used in the hls_model
. Generally, you need the box to overlap completely with the whisker ‘to the right’ (large values) otherwise you’ll get saturation & wrap-around issues. It can be okay for the box not to overlap completely ‘to the left’ (small values), but finding how small you can go is a matter of trial-and-error.
Providing data, in this case just using the first 1000 examples for speed, will show the same distributions captured at the output of each layer.
%matplotlib inline
for layer in config['LayerName'].keys():
config['LayerName'][layer]['Trace'] = True
hls_model = hls4ml.converters.convert_from_keras_model(
model, hls_config=config, output_dir='model_1/hls4ml_prj_2', part='xcu250-figd2104-2L-e'
)
hls4ml.model.profiling.numerical(model=model, hls_model=hls_model, X=X_test[:1000])
Customize#
Let’s just try setting the precision of the first layer weights to something more narrow than 16 bits. Using fewer bits can save resources in the FPGA. After inspecting the profiling plot above, let’s try 8 bits with 1 integer bit.
Then create a new HLSModel
, and display the profiling with the new config. This time, just display the weight profile by not providing any data ‘X
’. Then create the HLSModel
and display the architecture. Notice the box around the weights of the first layer reflects the different precision.
config['LayerName']['fc1']['Precision']['weight'] = 'ap_fixed<8,2>'
hls_model = hls4ml.converters.convert_from_keras_model(
model, hls_config=config, output_dir='model_1/hls4ml_prj_2', part='xcu250-figd2104-2L-e'
)
hls4ml.model.profiling.numerical(model=model, hls_model=hls_model)
hls4ml.utils.plot_model(hls_model, show_shapes=True, show_precision=True, to_file=None)
Trace#
When we start using customised precision throughout the model, it can be useful to collect the output from each layer to find out when things have gone wrong. We enable this trace collection by setting Trace = True
for each layer whose output we want to collect.
for layer in config['LayerName'].keys():
config['LayerName'][layer]['Trace'] = True
hls_model = hls4ml.converters.convert_from_keras_model(
model, hls_config=config, output_dir='model_1/hls4ml_prj_2', part='xcu250-figd2104-2L-e'
)
Compile, trace, predict#
Now we need to check that this model performance is still good after reducing the precision. We compile the hls_model
, and now use the hls_model.trace
method to collect the model output, and also the output for all the layers we enabled tracing for. This returns a dictionary with keys corresponding to the layer names of the model. Stored at that key is the array of values output by that layer, sampled from the provided data.
A helper function get_ymodel_keras
will return the same dictionary for the Keras model.
We’ll just run the trace
for the first 1000 examples, since it takes a bit longer and uses more memory than just running predict
.
hls_model.compile()
hls4ml_pred, hls4ml_trace = hls_model.trace(X_test[:1000])
keras_trace = hls4ml.model.profiling.get_ymodel_keras(model, X_test[:1000])
y_hls = hls_model.predict(X_test)
Inspect#
Now we can print out, make plots, or do any other more detailed analysis on the output of each layer to make sure we haven’t made the performance worse. And if we have, we can quickly find out where. Let’s just print the output of the first layer, for the first sample, for both the Keras and hls4ml models.
print("Keras layer 'fc1', first sample:")
print(keras_trace['fc1'][0])
print("hls4ml layer 'fc1', first sample:")
print(hls4ml_trace['fc1'][0])
Compare#
Let’s see if we lost performance by using 8 bits for the weights of the first layer by inspecting the accuracy and ROC curve.
print("Keras Accuracy: {}".format(accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_keras, axis=1))))
print("hls4ml Accuracy: {}".format(accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_hls, axis=1))))
fig, ax = plt.subplots(figsize=(9, 9))
_ = plotting.makeRoc(y_test, y_keras, classes)
plt.gca().set_prop_cycle(None) # reset the colors
_ = plotting.makeRoc(y_test, y_hls, classes, linestyle='--')
from matplotlib.lines import Line2D
lines = [Line2D([0], [0], ls='-'), Line2D([0], [0], ls='--')]
from matplotlib.legend import Legend
leg = Legend(ax, lines, labels=['keras', 'hls4ml'], loc='lower right', frameon=False)
ax.add_artist(leg)
Profiling & Trace Summary#
We lost a small amount of accuracy compared to when we used ap_fixed<16,6>
, but in many cases this difference will be small enough to be worth the resource saving. You can choose how aggressive to go with quantization, but it’s always sensible to make the profiling plots even with the default configuration. Layer-level trace
is very useful for finding when you reduced the bitwidth too far, or when the default configuration is no good for your model.
With this ‘post training quantization’, around 8-bits width generally seems to be the limit to how low you can go before suffering significant performance loss. In Part 4, we’ll look at using ‘training aware quantization’ with QKeras to go much lower without losing much performance.
ReuseFactor#
Now let’s look at the other configuration parameter: ReuseFactor
.
Recall that ReuseFactor
is our mechanism for tuning the parallelism:
So now let’s make a new configuration for this model, and set the ReuseFactor
to 2
for every layer:
we’ll compile the model, then evaulate its performance. (Note, by creating a new config with granularity=Model
, we’re implicitly resetting the precision to ap_fixed<16,6>
throughout.) Changing the ReuseFactor
should not change the classification results, but let’s just verify that by inspecting the accuracy and ROC curve again!
Then we’ll build the model.
config = hls4ml.utils.config_from_keras_model(model, granularity='Model')
print("-----------------------------------")
print(config)
print("-----------------------------------")
# Set the ReuseFactor to 2 throughout
config['Model']['ReuseFactor'] = 2
hls_model = hls4ml.converters.convert_from_keras_model(
model, hls_config=config, output_dir='model_1/hls4ml_prj_2', part='xcu250-figd2104-2L-e'
)
hls_model.compile()
y_hls = hls_model.predict(X_test)
print("Keras Accuracy: {}".format(accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_keras, axis=1))))
print("hls4ml Accuracy: {}".format(accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_hls, axis=1))))
plt.figure(figsize=(9, 9))
_ = plotting.makeRoc(y_test, y_keras, classes)
plt.gca().set_prop_cycle(None) # reset the colors
_ = plotting.makeRoc(y_test, y_hls, classes, linestyle='--')
Now build the model
This can take several minutes.
While the C-Synthesis is running, we can monitor the progress looking at the log file by opening a terminal from the notebook home, and executing:
tail -f model_1/hls4ml_prj_2/vivado_hls.log
hls_model.build(csim=False)
And now print the report, compare this to the report from Exercise 1
hls4ml.report.read_vivado_report('model_1/hls4ml_prj_2')
hls4ml.report.read_vivado_report('model_1/hls4ml_prj')
Exercise#
Recall the outcome of the exercise of part 1 where we estimated how many DSPs our network should use. How does this change now we’ve used
ReuseFactor = 2
for the network? Does the expectation match the report this time?