Author
Listed:
- Anton Trusov
(Department of Mathematical Software for Computer Science, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 119333 Moscow, Russia
Smart Engines Service LLC, 117312 Moscow, Russia
Phystech School of Applied Mathematics and Informatics, Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia)
- Elena Limonova
(Department of Mathematical Software for Computer Science, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 119333 Moscow, Russia
Smart Engines Service LLC, 117312 Moscow, Russia)
- Dmitry Nikolaev
(Smart Engines Service LLC, 117312 Moscow, Russia
Vision Systems Laboratory, Institute for Information Transmission Problems of Russian Academy of Sciences, 127051 Moscow, Russia)
- Vladimir V. Arlazarov
(Department of Mathematical Software for Computer Science, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 119333 Moscow, Russia
Smart Engines Service LLC, 117312 Moscow, Russia)
Abstract
Quantization is a widespread method for reducing the inference time of neural networks on mobile Central Processing Units (CPUs). Eight-bit quantized networks demonstrate similarly high quality as full precision models and perfectly fit the hardware architecture with one-byte coefficients and thirty-two-bit dot product accumulators. Lower precision quantizations usually suffer from noticeable quality loss and require specific computational algorithms to outperform eight-bit quantization. In this paper, we propose a novel 4.6-bit quantization scheme that allows for more efficient use of CPU resources. This scheme has more quantization bins than four-bit quantization and is more accurate while preserving the computational efficiency of the later (it runs only 4% slower). Our multiplication uses a combination of 16- and 32-bit accumulators and avoids multiplication depth limitation, which the previous 4-bit multiplication algorithm had. The experiments with different convolutional neural networks on CIFAR-10 and ImageNet datasets show that 4.6-bit quantized networks are 1.5–1.6 times faster than eight-bit networks on the ARMv8 CPU. Regarding the quality, the results of the 4.6-bit quantized network are close to the mean of four-bit and eight-bit networks of the same architecture. Therefore, 4.6-bit quantization may serve as an intermediate solution between fast and inaccurate low-bit network quantizations and accurate but relatively slow eight-bit ones.
Suggested Citation
Anton Trusov & Elena Limonova & Dmitry Nikolaev & Vladimir V. Arlazarov, 2024.
"4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs,"
Mathematics, MDPI, vol. 12(5), pages 1-22, February.
Handle:
RePEc:gam:jmathe:v:12:y:2024:i:5:p:651-:d:1344481
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:5:p:651-:d:1344481. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.