Efficient Neural Network Inference via Structured Pruning on Edge Devices
Abstract
We present a structured pruning pipeline that reduces transformer model size by 70% while retaining 95% of baseline accuracy on ImageNet. Our approach combines magnitude-based filter pruning with knowledge distillation, enabling real-time inference on Raspberry Pi 4 and Jetson Nano platforms. We benchmark latency, memory footprint, and accuracy across five model architectures (ResNet-50, MobileNetV3, EfficientNet-B0, ViT-Small, DeiT-Tiny) and demonstrate that structured pruning consistently outperforms unstructured methods for edge deployment.
We present a structured pruning pipeline that reduces transformer model size by 70% while retaining 95% of baseline accuracy on ImageNet. Our approach combines magnitude-based filter pruning with knowledge distillation, enabling real-time inference on Raspberry Pi 4 and Jetson Nano platforms. We benchmark latency, memory footprint, and accuracy across five model architectures (ResNet-50, MobileNetV3, EfficientNet-B0, ViT-Small, DeiT-Tiny) and demonstrate that structured pruning consistently outperforms unstructured methods for edge deployment.