Author
Listed:
- Xin Cheng
(Graduate School of Science and Engineering, Hosei University, Tokyo 184-8584, Japan)
- Zhiqiang Zhang
(School of Science and Technology, Southwest University of Science and Technology, Mianyang 621010, China)
- Wei Weng
(Institute of Liberal Arts and Science, Kanazawa University, Kanazawa City 920-1192, Japan)
- Wenxin Yu
(School of Science and Technology, Southwest University of Science and Technology, Mianyang 621010, China)
- Jinjia Zhou
(Graduate School of Science and Engineering, Hosei University, Tokyo 184-8584, Japan)
Abstract
The complexity of deep neural network models (DNNs) severely limits their application on devices with limited computing and storage resources. Knowledge distillation (KD) is an attractive model compression technology that can effectively alleviate this problem. Multi-teacher knowledge distillation (MKD) aims to leverage the valuable and diverse knowledge distilled by multiple teacher networks to improve the performance of the student network. Existing approaches typically rely on simple methods such as averaging the prediction logits or using sub-optimal weighting strategies to fuse distilled knowledge from multiple teachers. However, employing these techniques cannot fully reflect the importance of teachers and may even mislead student’s learning. To address this issue, we propose a novel Decoupled Multi-Teacher Knowledge Distillation based on Entropy (DE-MKD). DE-MKD decouples the vanilla knowledge distillation loss and assigns adaptive weights to each teacher to reflect its importance based on the entropy of their predictions. Furthermore, we extend the proposed approach to distill the intermediate features from multiple powerful but cumbersome teachers to improve the performance of the lightweight student network. Extensive experiments on the publicly available CIFAR-100 image classification benchmark dataset with various teacher-student network pairs demonstrated the effectiveness and flexibility of our approach. For instance, the VGG8|ShuffleNetV2 model trained by DE-MKD reached 75.25%|78.86% top-one accuracy when choosing VGG13|WRN40-2 as the teacher, setting new performance records. In addition, surprisingly, the distilled student model outperformed the teacher in both teacher-student network pairs.
Suggested Citation
Xin Cheng & Zhiqiang Zhang & Wei Weng & Wenxin Yu & Jinjia Zhou, 2024.
"DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy,"
Mathematics, MDPI, vol. 12(11), pages 1-10, May.
Handle:
RePEc:gam:jmathe:v:12:y:2024:i:11:p:1672-:d:1403167
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:11:p:1672-:d:1403167. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.