I've been studying quantization using Tensorflow's TFLite. As far as I understand it is possible to quantize my model weights (so that they will be stored using 4x less memory) but it doesn't necessary implies that the model won't convert it back to floats to run it. I've also understood that to run my model only using int I need to set the following parameters:
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
I'd like to know what are the diference in the tf.lite.Interpreter between a loaded model in which those parameters were set and one in which they weren't. I tried to investigate .get_tensor_details() for that but I didn't notice any difference.
Depending on your requirements (performance, memory and runtime), post training quantization can be done in two ways.
Approach #1: Post training weight quantization (quantizes weights only) In this case only weights are quantized to int8 but activations remain as they were. Inference input and output are floating-point.
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.experimental_new_converter = True
# Post training quantization
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
tflite_model_quant_file = tflite_models_dir/"lstm_model_quant.tflite"
tflite_model_quant_file.write_bytes(tflite_quant_model)
Approach #2: Full integer quantization (Quantizes weights and activations) In this case weights and activations are quantized to int8. First we need to follow the approach #1 to quantize weight and then implement following code to do full integer quantization. This uses quantized input and output, making it compatible with more accelerators, such as the Coral Edge TPU. Inference input and output are integers.
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model_quant = converter.convert()
tflite_model_quant_file = tflite_models_dir/"lstm_model_quant_io.tflite"
tflite_model_quant_file.write_bytes(tflite_model_quant)
More details on weight quantization are here and you can find more details on full integer quantization here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With