Hi! In this post I wanna get into details of model compression techniques. It’s very important because in deep learning still you get better model metrics when you increase complexity of a model (in terms of computation – FLOPs or number of model parameters – it enducates to model capacity, or how much information model could contain)

Model compression is very helpful when deploying a model, both on a server, mobile or embedded device because you can get great speed boost or batch size increase at inference time.

This info might be not actual in a couple of years, but I hope all such techniques below could only evolve and you will need to know basics. Also I should note that the best improvement comes from using these techniques on image models or transformers.

I’ve splitted post on 3 parts because it was becoming too large 🙂

### Quantization

I will start with very simple and effective technique. Basically, quantization is just stripping leading bits from the right in every model parameter.

Right now, almost every deep learning framework uses float32 numbers for model parameters by default. It means that the model with 100M params will have the size around 400Mb. It’s hard to have that model on a small device.

Good practice for start is to convert all the float32 values to float16 values. It will half the size of a model almost with no drop in metrics! However, there are exceptions, such as generative models (like GANs) or RNNs.

### FP16

Conversion in PyTorch to half is as easy as `.half()`

to `nn.Module`

instance

```
import torch
from torchvision.models import resnet18
model = resnet18(pretrained=True).cuda().half().eval()
torch.save(model.state_dict(), 'model_fp16.pth')
```

We can compare results (label and probability) of float32 and float16 model

- read model and image

```
import torch
import torchvision
import torchvision.transforms as T
from torchvision.models import resnet18
import torch.quantization.quantize_fx as quantize_fx
from torch.fx import symbolic_trace
model = resnet18(pretrained=True).eval()
preprocessing = T.Compose([
T.ToPILImage(),
T.Resize(256),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
im = preprocessing(torchvision.io.read_image('dog.jpg')).unsqueeze(0)
prob1, class_idx1 = torch.nn.functional.softmax(model(im).detach(), -1).max(-1)
```

2. get output for half

```
model_half = resnet18(pretrained=True).eval().cuda().half()
prob2, class_idx2 = torch.nn.functional.softmax(model_half(im.half()).detach(), -1).max(-1)
print(class_idx1, prob1.item())
print(class_idx2, prob2.item())
```

which gets us

```
tensor([215], device='cuda:0') 0.9550936818122864
tensor([215], device='cuda:0') 0.955078125
```

So no drop in perfomance after 5th digit

Note that I used cuda to put the model and image to gpu card. If you try to inference on cpu and fp16 you will get an error:

```
RuntimeError: "unfolded2d_copy" not implemented for 'Half'
```

which actually means that convolution is not implemented with fp16 on cpu. Pytorch uses different operators for different dtypes and this one is not ready yet. So converting to fp16 is only for cuda, or you might save model to fp16 and convert back to float32 if gpu is not available. This will save you some space on disk.

There is one additional perk for using fp16 if you have modern gpus. Starting from V100 and Nvidia 20 series (from rtx 2060 to rtx 2080Ti) these gpu have tensor cores optimized specifically for different precisions (in 2017-2018 it was only fp16 but now you get boost in different precisions as well, see https://www.nvidia.com/en-us/data-center/tensor-cores/). You can expect near 1.5-2x speed improvement using float16.

*Side note*: DO NOT train your model in fp16 when using just .half() type conversion. Your activations and loss will probably explode/vanish very fast. In training this little bits of float32 in comparison with float16 actually matter. Use mixed precision training: when you dynamically scale loss values before computing backprop of network. See https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html or https://github.com/NVIDIA/apex Apex is still my favorite, or use Pytorch-lightning, where this feature is included.

There are some improvements to float16 to perform training without mixed precision. bfloat16 from google brain solves that problem, but currently onlu Google TPU pods and Nvidia A100 supports this data type.

### INT8

We can go further and reduce size even more. For example, ints

It’s not so trivial to convert floats to ints. We cannot just cut down all decimal digits, like 13.2324 -> 13 because parameters are usually small and normal distributed. So we scale parameters and remember scale values for each layer (or each channel in layer). And also we need to remember zero point with no data loss, because it’s critical for some layers or padding.

we can choose scale so our values will be in range [0..255]. Note that we got here round operator which induces rounding error.

Quantization in pytorch works like that: quant -> quantized inference -> dequant. Every layer in network is replaced with its quantized counterpart. For example, Linear replaces with QuantLinear with torch.qint8 weights and scale and zero_point attributes. Model input also quantized using parameters obtained from calibration process (you need to pass some inputs to get outputs both in float32 and int8 and compute such values)

God bless much of the work has already done, so you can call quantized model (and even pretrained on ImageNet) for yourself in 2 lines:

```
# quantized resnet18 class
model_quantized = torchvision.models.quantization.resnet18(pretrained=True, quantize=True)
prob2, class_idx2 = torch.nn.functional.softmax(model_quantized(im).detach(), -1).max(-1)
print(class_idx1, prob1.item())
print(class_idx2, prob2.item())
```

and it gets us

```
tensor([215]) 0.9551088213920593
tensor([215]) 0.9463070631027222
```

So the probability became lower but if your model robust enough it mostly will not affect you accuracy score (because linear hyperplane between classes would still split classes good enough)

Ok, it was for the pretrained model when we already got model class with quantized operators. How to convert your own model?

Starting from PyTorch 1.8 converting to int8 become easier. Previously you needed to manually change operators to quantized ones and now it can be performed automatically with symbolic tracing feature in torch.fx.

torch.fx is Python-to-Python conversion. One can achieve replacing or removing some of the layers of model, changing attributes etc. Read more at https://pytorch.org/docs/stable/fx.html

```
import torch.quantization.quantize_fx as quantize_fx
from torch.fx import symbolic_trace
qconfig_dict = {"": torch.quantization.get_default_qconfig('fbgemm')}
model_for_quantization = symbolic_trace(model)
model_prepared = quantize_fx.prepare_fx(model_for_quantization, qconfig_dict)
# perform calibration on one sample (but usually you should forward a bit more inputs)
model_prepared(im)
model_quantized = quantize_fx.convert_fx(model_prepared)
prob2, class_idx2 = torch.nn.functional.softmax(model_quantized(im).detach(), -1).max(-1)
print(class_idx1, prob1.item())
print(class_idx2, prob2.item())
```

output:

```
tensor([215]) 0.9551088213920593
tensor([215]) 0.9552981853485107
```

don’t forget to perform calibration (even on one sample or random noise). Without it you will not get scale and zero_point values and quantization will not work properly.

*Side note*: high probability that you might stuck with quantization process when some operators are not supported, or even with fp16 when you got inconsistent results. Always check model outputs before deploying compressed model!

Right now quantization with pytorch works only on cpu, but there are different instruments, for example TensorRT which supports int8 gpu inference (you also might get inference boost with Nvidia A100 using tensor cores)

There are more if you want to dive deeper, you can quantize to int4 or even one-byte (binary networks). OpenVINO toolkit supports these conversions. In pytorch there more tooling like dynamic quantization, quantization aware training (when rounding error is still high and you want to fine-tune model, not only calibrate using outputs)

Good luck!