While the mobile platform capabilities range widely, long battery life and reliability are common design concerns that are crucial to remain competitive.
Consequently, state-of-the-art mobile platforms have become highly heterogeneous by combining a powerful CPUs with GPUs to accelerate the computation of deep neural networks (DNNs), which are the most common structures to perform ML operations.
But traditional von Neumann architectures are not optimized for the high memory bandwidth and massively parallel computation demands required by DNNs.
Hence, propelling research into non-von Neumann architectures to support the demands of DNNs.
The re-imagining of computer architectures to perform efficient DNN computations requires focusing on the prohibitive demands presented by DNNs and alleviating them. The two central challenges for efficient computation are (1) large memory storage and movement due to weights of the DNN and (2) massively parallel multiplications to compute the DNN output.
Introducing sparsity into the DNNs, where certain percentage of either the weights or the outputs of the DNN are zero, greatly helps with both challenges. This along with algorithm-hardware co-design to compress the DNNs is demonstrated to provide efficient solutions to greatly reduce the power consumption of hardware that compute DNNs. Additionally, exploring emerging technologies such as non-volatile memories and 3-D stacking of silicon in conjunction with algorithm-hardware co-design architectures will pave the way for the next generation of mobile devices.
Towards the objectives stated above, our specific contributions include (a) an architecture based on resistive crosspoint array that can update all values stored and compute matrix vector multiplication in parallel within a single cycle, (b) a framework of training DNNs with a block-wise sparsity to drastically reduce memory storage and total number of computations required to compute the output of DNNs, (c) the exploration of hardware implementations of sparse DNNs and architectural guidelines to reduce power consumption for the implementations in monolithic 3D integrated circuits, and (d) a prototype chip in 65nm CMOS accelerator for long-short term memory networks trained with the proposed block-wise sparsity scheme.
At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism.
Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRA
alongside it, our very approach must evolve. While the recent trend has been to centralize our
computing resources in the cloud, it now looks beneficial to push more computing power
towards the “edge” with so called edge computing, reducing the immense strain on cloud
servers and the latency experienced by IoT devices. A new computing paradigm also brings
new opportunities for innovation, and one such innovation could be the use of FPGAs as edge
servers. In this research project, I learn the design flow for developing OpenCL kernels and
custom FPGA BSPs. Using these tools, I investigate the viability of using FPGAs as standalone
edge computing devices. Concluding that—although the technology is a great fit—the current
necessity of dynamically reprogrammable FPGAs to be closely coupled with a host CPU is
holding them back from this purpose. I propose a modification to the architecture of the Intel
Arria 10 GX that would allow it to be decoupled from its host CPU, allowing it to truly serve as a
viable edge computing solution.
In this thesis, I discuss the development of a novel physical design flow introducing standard-cell neurons for ASIC design. Standard-cell neurons are implemented on silicon as a circuit that realizes a threshold function. Each cell contains flash transistors, the threshold voltages of which correspond to the weights of the threshold function. Since the threshold voltages are programmed after fabrication, any sequential logic containing a standard-cell neuron is a logical black box upon delivery to the foundry. Additionally, previous research has shown significant reductions in delay, power, and area with the utilization of these flash transistor (FTL) cells. This paper aims to reinforce this prior research by demonstrating the first automatically synthesized, placed, and routed secure RISC-V core.