We are pleased to release new results on approximation of deep Convolutional Neural Networks (CNN). We used Ristretto to approximate trained 32-bit floating point CNNs. The key results can be summarized as follows:
- 8-bit dynamic fixed point is enough to approximate three ImageNet networks.
- 32-bit multiplications can be replaced by bit-shifts for small networks.
Continue reading “On Resource-Efficient Inference using Trained Convolutional Neural Networks”
This extended abstract describes our Ristretto framework for CNN compression. Last week Philipp presented the Ristretto Poster at ICLR’16 in San Juan.
High computational complexity hinders the widespread usage of Convolutional Neural Networks (CNNs), especially in mobile devices. Hardware accelerators are arguably the most promising approach for reducing both execution time and power consumption. One of the most important steps in accelerator development is hardware-oriented model approximation. In this paper we present Ristretto, a model approximation framework that analyzes a given CNN with respect to numerical resolution used in representing weights and outputs of convolutional and fully connected layers. Ristretto can condense models by using fixed point arithmetic and representation instead of floating point. Moreover, Ristretto fine-tunes the resulting fixed point network. Given a maximum error tolerance of 1%, Ristretto can successfully condense CaffeNet and SqueezeNet to 8-bit. The code for Ristretto is available.
Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such as image classification and speech recognition. Due to their computational complexity, DCNNs demand implementations that utilize custom hardware accelerators to meet performance and energy-efficiency constraints. In this paper, we propose an FPGA-based accelerator architecture which leverages all sources of parallelism in DCNNs. We develop analytical feasibility and performance estimation models that take into account various design and platform parameters. We also present a design space exploration algorithm for obtaining the implementation with the highest performance on a given platform. Simulation results with a real-life DCNN demonstrate that our accelerator outperforms other competing approaches, which disregard some sources of parallelism in the application. Most notably, our accelerator runs 1.9X faster than the state-of-the-art DCNN accelerator on the same FPGA device.
Local Copy Download [PDF]
Presentation Download [PDF]
Link on IEEE Xplore