WEDNESDAY October 18, 1:00pm  2:30pm  Crystal 2
EVENT TYPE: SPECIAL SESSION
SESSION CO8B
CODES+ISSS: Approximate Computing for Scalable and Energy Efficient Embedded Systems
Chair:
Terrence Mak  Univ. of Southampton
Organizers:
Patti Davide  Univ. of Catania
Mak Terrence  Univ. of Southampton
Palesi Maurizio  Univ. of Catania
Mak Terrence  Univ. of Southampton
Palesi Maurizio  Univ. of Catania
EnergyEfficient Image Processing using Significancedriven Adaptive Approximate Computing
With increasing resolutions the volume of data generated by image processing applications is escalating dramatically. As such, when coupled with realtime performance requirements, reducing energy consumption is proving highly challenging. In this paper, we propose a novel approach for image processing applications using significancedriven approximate computing. Core to our approach is the fundamental tenet that image data should be processed intelligently based on their informational value, i.e. significance. For the first time, we define the concept of significance in the context of image processing. We show how the complexity of data processing tasks can be drastically reduced when computing decisions are synergistically adapted to significance learning principles. Using these principles more significant data are processed at higher precision with higher operating frequencies, while those with less significance are processed at reduced precision at lower operating frequenàè ìùcies, while maintaining a given quality requirement. Two concrete case studies are used to evaluate the effectiveness of our approach: an applicationspecific hardwarebased adaptive approximate image filter and a softwarebased variablekernel based parallel convolution filter running on an Odroid XU4 platform. We demonstrate that our approach reduces energy by up to 40% for a realtime performance requirement of 15 fps, when compared with the existing approaches that are agnostic of significance and quality/slash energy tradeoffs.
Embedded Abundantdata Computing Enabled by Amalgamation, Acceleration and Approximation
The world’s appetite for abundantdata computing such as deep learning has increased dramatically. The computational demands of these applications far exceed the capabilities of today’s systems, especially for energyconstrained embedded systems. These demands cannot be met by isolated improvements in transistor technologies, memories, or integrated circuit (IC) architectures alone. Transformative nanosystems, which leverage the unique properties of emerging nanotechnologies to create new IC architectures, are required to deliver unprecedented functionality, performance, and energy efficiency. Our new nanosystems approach overcomes these challenges through recent advances across the computing stack: (a) highly energyefficient logic and memory nanotechnologies; (b) Ultradense (e.g., monolithic) threedimensional integration with finegrained connectivity which enables new architectures for computation immersed in memory, (c) programmable accelerators that improve domainspecific computing energy efficiency, and, (d) approximation techniques for energy efficiency and error resilience. Compared to conventional approaches, our approach promises to improve energy efficiency of computing systems by several orders of magnitude, thereby paving a path toward embedded abundantdata computing (e.g., deep learning  both training and inference  on mobile devices and IoT nodes).
Navigating AccuracyEnergy Tradeoffs for Hardware Acceleration
Hardware acceleration has emerged as a method of choice to improve the efficiency of systems that need to perform lots of data processing under stringent energy constraints, particularly for IoT and embedded systems. A critical component to maximizing the efficiency of hardware accelerators is to ensure that energy is not being wasted in producing overly accurate results. For instance, in most sensory or signal processing applications, using double precision floating point would certainly be overkill. This paper presents QAPPA, a Quality Autotuner for Precision Programmable Accelerators. QAPPA analyzes applications written in C++ and derives the precision requirements of each compute and arithmetic operations in the program. It utilizes a library of hardware models to predict energy and memory bandwidth savings at different application quality levels. We demonstrate the utility of QAPPA over 14 PERFECT kernels and show that QAPPA can derive minimal quantization settings from userdefined quality constraints to significantly improve the energy efficiency of fixedfunction hardware accelerators.
An Efficient Hardware Design for Cerebellar Models using Approximate Circuits
The superior controllability of the cerebellum of primates has motivated extensive interest in the development of computational cerebellar models. Many models have been applied to the motor control and image stabilization in robots. Usually, cerebellar models are computationally complex, so they have rarely been implemented in dedicated hardware. Instead, a cerebellar model is often implemented in a system using a central processing unit (CPU) or a graphic processing unit (GPU) with a high energy consumption and a long latency. To overcome these drawbacks, we propose a cerebellar model implemented in approximate computing circuits with a low hardware overhead and a high speed, leveraging the inherent error tolerance in the cerebellum. As basic arithmetic elements in a cerebellar model, approximate adders and multipliers are carefully evaluated for implementations in an adaptive filter to achieve a best tradeoff of accuracy and hardware overhead. A saccade system, whose vestibuloocular reflex (VOR) is controlled by the cerebellum, is simulated to show the applicability and effectiveness of the cerebellar model implemented in approximate circuits.
With increasing resolutions the volume of data generated by image processing applications is escalating dramatically. As such, when coupled with realtime performance requirements, reducing energy consumption is proving highly challenging. In this paper, we propose a novel approach for image processing applications using significancedriven approximate computing. Core to our approach is the fundamental tenet that image data should be processed intelligently based on their informational value, i.e. significance. For the first time, we define the concept of significance in the context of image processing. We show how the complexity of data processing tasks can be drastically reduced when computing decisions are synergistically adapted to significance learning principles. Using these principles more significant data are processed at higher precision with higher operating frequencies, while those with less significance are processed at reduced precision at lower operating frequenàè ìùcies, while maintaining a given quality requirement. Two concrete case studies are used to evaluate the effectiveness of our approach: an applicationspecific hardwarebased adaptive approximate image filter and a softwarebased variablekernel based parallel convolution filter running on an Odroid XU4 platform. We demonstrate that our approach reduces energy by up to 40% for a realtime performance requirement of 15 fps, when compared with the existing approaches that are agnostic of significance and quality/slash energy tradeoffs.
Embedded Abundantdata Computing Enabled by Amalgamation, Acceleration and Approximation
The world’s appetite for abundantdata computing such as deep learning has increased dramatically. The computational demands of these applications far exceed the capabilities of today’s systems, especially for energyconstrained embedded systems. These demands cannot be met by isolated improvements in transistor technologies, memories, or integrated circuit (IC) architectures alone. Transformative nanosystems, which leverage the unique properties of emerging nanotechnologies to create new IC architectures, are required to deliver unprecedented functionality, performance, and energy efficiency. Our new nanosystems approach overcomes these challenges through recent advances across the computing stack: (a) highly energyefficient logic and memory nanotechnologies; (b) Ultradense (e.g., monolithic) threedimensional integration with finegrained connectivity which enables new architectures for computation immersed in memory, (c) programmable accelerators that improve domainspecific computing energy efficiency, and, (d) approximation techniques for energy efficiency and error resilience. Compared to conventional approaches, our approach promises to improve energy efficiency of computing systems by several orders of magnitude, thereby paving a path toward embedded abundantdata computing (e.g., deep learning  both training and inference  on mobile devices and IoT nodes).
Navigating AccuracyEnergy Tradeoffs for Hardware Acceleration
Hardware acceleration has emerged as a method of choice to improve the efficiency of systems that need to perform lots of data processing under stringent energy constraints, particularly for IoT and embedded systems. A critical component to maximizing the efficiency of hardware accelerators is to ensure that energy is not being wasted in producing overly accurate results. For instance, in most sensory or signal processing applications, using double precision floating point would certainly be overkill. This paper presents QAPPA, a Quality Autotuner for Precision Programmable Accelerators. QAPPA analyzes applications written in C++ and derives the precision requirements of each compute and arithmetic operations in the program. It utilizes a library of hardware models to predict energy and memory bandwidth savings at different application quality levels. We demonstrate the utility of QAPPA over 14 PERFECT kernels and show that QAPPA can derive minimal quantization settings from userdefined quality constraints to significantly improve the energy efficiency of fixedfunction hardware accelerators.
An Efficient Hardware Design for Cerebellar Models using Approximate Circuits
The superior controllability of the cerebellum of primates has motivated extensive interest in the development of computational cerebellar models. Many models have been applied to the motor control and image stabilization in robots. Usually, cerebellar models are computationally complex, so they have rarely been implemented in dedicated hardware. Instead, a cerebellar model is often implemented in a system using a central processing unit (CPU) or a graphic processing unit (GPU) with a high energy consumption and a long latency. To overcome these drawbacks, we propose a cerebellar model implemented in approximate computing circuits with a low hardware overhead and a high speed, leveraging the inherent error tolerance in the cerebellum. As basic arithmetic elements in a cerebellar model, approximate adders and multipliers are carefully evaluated for implementations in an adaptive filter to achieve a best tradeoff of accuracy and hardware overhead. A saccade system, whose vestibuloocular reflex (VOR) is controlled by the cerebellum, is simulated to show the applicability and effectiveness of the cerebellar model implemented in approximate circuits.
8B.1  SignificanceDriven Adaptive Approximate Computing for EnergyEfficient Image Processing Applications  
Speaker:  Dave Burke  Newcastle Univ. 

Authors:  Dave Burke  Newcastle Univ. Dainius Jenkus  Newcastle Univ. Issa Qiqieh  Newcastle Univ. Rishad Shafik  Newcastle Univ. Shidhartha Das  Arm Ltd. Alex Yakovlev  Newcastle Univ. 

8B.2  3D Nanosystems Enable Embedded AbundantData Computing  
Speaker:  Mohamed M. Sabry Aly  Stanford Univ. 

Authors:  William Hwang  Stanford Univ. Mohamed M. Sabry Aly  Stanford Univ. Yash H. Malviya  Stanford Univ. Mingyu Gao  Stanford Univ. Tony F. Wu  Stanford Univ. Christos Kozyrakis  Stanford Univ. H.S. Philip Wong  Stanford Univ. Subhasish Mitra  Stanford Univ. 

8B.3  Exploiting QualityEnergy Tradeoffs with Arbitrary Quantization  
Speaker:  Thierry Moreau  Univ. of Washington 

Authors:  Thierry Moreau  Univ. of Washington Augusto Felipe  Univ. of Campinas Howe Patrick  Univ. of Washington Armin Alaghi  Univ. of Washington Luis Ceze  Univ. of Washington 

8B.4  An Efficient Hardware Design for Cerebellar Models using Approximate Circuits  
Speaker:  Jie Han  Univ. of Alberta 

Author:  Jie Han  Univ. of Alberta 