Practical ways to test the correctness of Vision Models on Mobile?

Testing the correctness of a vision (or any other model) can be hard. I’d like to focus this discussion on vision models, since the broad set of models across many verticals can have their own quirks and idiosyncrasies that are extremely domain specific. I’ll consider specific details that matter, and the impact they have on model correctness.

I’d love to hear the experience of the community in dealing with these issues, strategies you have tried, and other factors that I may have missed.

Numeric Consistency

When a vision model is run on a specific platform (e.g. Android, x86, iOS), the kernels could use use a different code path (preferring hardware specific acceleration) which can affect the model’s numeric accuracy. This in turn could result in very slight changes to the floating point numbers in the intermediate activations and/or final output.

Hence, comparing the raw output of a model (when fed the same input) could result in slightly different numeric results, which may not be practically meaningful. If the output of the model is a Tensor (or list of tensors), then they can be compared using torch.allclose, which checks if tensors are the same for most practical purposes.

Quantized Models

When dealing with quantized models, PyTorch supports multiple quantization backends, namely FBGEMM (only x86), and QNNPACK (all hardware, but optimized execution on ARM).

There are minor (but meaningful) differences between the 2, which could affect intermediate activations. These differences could get amplified by subsequent layers, which could then impact the overall model’s correctness and/or accuracy.

For example:

  1. Inconsistent result from FBGEMM and QNNPACK quantization backends talks about how saturation in these backends affects correctness.
  2. This SO post talks about using the same backend during training AND inference. Specifically, this blog post says:

Since these libraries are architecture-dependent, static quantization must be performed on a machine with the same architecture as your deployment target. If you are using FBGEMM, you must perform the calibration pass on an x86 CPU (usually not a problem); if you are using QNNPACK, calibration needs to happen on an ARM CPU (this is quite a bit harder).

The Impact Of Resizing Algorithms On Input Images

Ross Wightman in this twitter thread says:

Did you know that image models are sensitive to the image interpolation they are trained with? timm’s default is to randomly switch between Pillow bilinear and bicubic, it results in weights that are more robust to interpolation differences.

Most other pretrained models out there are trained with one interpolation… tf bilinear, cv2 bicubic, Pillow bicubic, etc. Using a diff interpolation at inference time drops your accuracy… often up to .2-.5% of your top-1 (ImageNet-1k val). Stronger aug helps reduce this too.

Due to differences across image libraries, there are also accuracy drops with the same interpolation type (due to impl diff, downsample anti-alias filtering, etc). I don’t cover that, but training on more than one Pillow interp appears to help a bit there based on measurements.

timm’s had this for over 2 years now, it’s harder to determine the train interp of any timm weights than torchvision, tensorflow models, etc. One of the biggest drops in accuracy when I port weights from tensorflow impl is typically due to TF vs Pillow image proc differences.

This suggests that if you want your vision model to have consistent accuracy across various platforms, then you do need to use the resizing algorithms that those platforms will be using (which will be platform specific, and even the implementations may have specific quirks) during model training.

Why do you need to resize the input image in the first place? Typically vision models accept images that are at least or at most a certain size in a specific dimension, and camera output image sizes vary across devices, platforms, and manufacturers. Yes, there’s a lot of heterogeneity to consider when dealing with mobile inference!

One could consider this to be a type of image augmentation step that is performed on input images! Hence, this consideration would be applicable to any other input preparation step such as input normalization, etc…

Can I verify semantic correctness?

Given all the constraints above, it can be fairly hard to guarantee semantic correctness for vision models across platforms. It’s not clear to me if there’s any way to bound the accuracy error either. The way to go seems to be to test and validate models on all the platforms that you’re targeting and ensure that the accuracy is within an acceptable range for the set of targeted use-cases.

1 Like