Machine Intelligence on Resource-Constrained IoT Devices: The Case of Thread Granularity Optimization for CNN Inference


Despite their remarkable performance in various machine intelligence tasks, the computational intensity of
Convolutional Neural Networks (CNNs) has hindered their widespread utilization in resource-constrained
embedded and IoT systems. To address this problem, we present a framework for synthesis of efficient CNN
inference software targeting mobile SoC platforms. We argue that thread granularity can substantially impact the performance and energy dissipation of the synthesized inference software, and demonstrate that
launching the maximum number of logical threads, often promoted as a guiding principle by GPGPU practitioners, does not result in an efficient implementation for mobile SoCs. We hypothesize that the runtime of a
CNN layer on a particular SoC platform can be accurately estimated as a linear function of its computational
complexity, which may seem counter-intuitive, as modern mobile SoCs utilize a plethora of heterogeneous
architectural features and dynamic resource management policies. Consequently, we develop a principled
approach and a data-driven analytical model to optimize granularity of threads during CNN software synthesis. Experimental results with several modern CNNs mapped to a commodity Android smartphone with a
Snapdragon SoC show up to 2.37X speedup in application runtime, and up to 1.9X improvement in its energy
dissipation compared to existing approaches.