Search Content

Chip level implementation techniques for radiation hardened microprocessors

Description

Microprocessors are the processing heart of any digital system and are central to all the technological advancements of the age including space exploration and monitoring. The demands of space exploration require a special class of microprocessors called radiation hardened microprocessors which are less susceptible to radiation present outside the earth's…

Microprocessors are the processing heart of any digital system and are central to all the technological advancements of the age including space exploration and monitoring. The demands of space exploration require a special class of microprocessors called radiation hardened microprocessors which are less susceptible to radiation present outside the earth's atmosphere, in other words their functioning is not disrupted even in presence of disruptive radiation. The presence of these particles forces the designers to come up with design techniques at circuit and chip levels to alleviate the errors which can be encountered in the functioning of microprocessors. Microprocessor evolution has been very rapid in terms of performance but the same cannot be said about its rad-hard counterpart. With the total data processing capability overall increasing rapidly, the clear lack of performance of the processors manifests as a bottleneck in any processing system. To design high performance rad-hard microprocessors designers have to overcome difficult design problems at various design stages i.e. Architecture, Synthesis, Floorplanning, Optimization, routing and analysis all the while maintaining circuit radiation hardness. The reference design `HERMES' is targeted at 90nm IBM G process and is expected to reach 500Mhz which is twice as fast any processor currently available. Chapter 1 talks about the mechanisms of radiation effects which cause upsets and degradation to the functioning of digital circuits. Chapter 2 gives a brief description of the components which are used in the design and are part of the consistent efforts at ASUVLSI lab culminating in this chip level implementation of the design. Chapter 3 explains the basic digital design ASIC flow and the changes made to it leading to a rad-hard specific ASIC flow used in implementing this chip. Chapter 4 talks about the triple mode redundant (TMR) specific flow which is used in the block implementation, delineating the challenges faced and the solutions proposed to make the flow work. Chapter 5 explains the challenges faced and solutions arrived at while using the top-level flow described in chapter 3. Chapter 6 puts together the results and analyzes the design in terms of basic integrated circuit design constraints.

ContributorsRamamurthy, Chandarasekaran (Author) / Clark, Lawrence T (Thesis advisor) / Holbert, Keith E. (Committee member) / Barnaby, Hugh J (Committee member) / Mayhew, David (Committee member) / Arizona State University (Publisher)

Created2013

Graph Search as a Feature in Imperative/Procedural Programming Languages

Description

Graph theory is a critical component of computer science and software engineering, with algorithms concerning graph traversal and comprehension powering much of the largest problems in both industry and research. Engineers and researchers often have an accurate view of their target graph, however they struggle to implement a correct, and…

Graph theory is a critical component of computer science and software engineering, with algorithms concerning graph traversal and comprehension powering much of the largest problems in both industry and research. Engineers and researchers often have an accurate view of their target graph, however they struggle to implement a correct, and efficient, search over that graph.

To facilitate rapid, correct, efficient, and intuitive development of graph based solutions we propose a new programming language construct - the search statement. Given a supra-root node, a procedure which determines the children of a given parent node, and optional definitions of the fail-fast acceptance or rejection of a solution, the search statement can conduct a search over any graph or network. Structurally, this statement is modelled after the common switch statement and is put into a largely imperative/procedural context to allow for immediate and intuitive development by most programmers. The Go programming language has been used as a foundation and proof-of-concept of the search statement. A Go compiler is provided which implements this construct.

ContributorsHenderson, Christopher (Author) / Bansal, Ajay (Thesis advisor) / Lindquist, Timothy (Committee member) / Acuna, Ruben (Committee member) / Arizona State University (Publisher)

Created2018

Joint Optimization of Quantization and Structured Sparsity for Compressed Deep Neural Networks

Description

Deep neural networks (DNN) have shown tremendous success in various cognitive tasks, such as image classification, speech recognition, etc. However, their usage on resource-constrained edge devices has been limited due to high computation and large memory requirement.

To overcome these challenges, recent works have extensively investigated model compression techniques such…

Deep neural networks (DNN) have shown tremendous success in various cognitive tasks, such as image classification, speech recognition, etc. However, their usage on resource-constrained edge devices has been limited due to high computation and large memory requirement.

To overcome these challenges, recent works have extensively investigated model compression techniques such as element-wise sparsity, structured sparsity and quantization. While most of these works have applied these compression techniques in isolation, there have been very few studies on application of quantization and structured sparsity together on a DNN model.

This thesis co-optimizes structured sparsity and quantization constraints on DNN models during training. Specifically, it obtains optimal setting of 2-bit weight and 2-bit activation coupled with 4X structured compression by performing combined exploration of quantization and structured compression settings. The optimal DNN model achieves 50X weight memory reduction compared to floating-point uncompressed DNN. This memory saving is significant since applying only structured sparsity constraints achieves 2X memory savings and only quantization constraints achieves 16X memory savings. The algorithm has been validated on both high and low capacity DNNs and on wide-sparse and deep-sparse DNN models. Experiments demonstrated that deep-sparse DNN outperforms shallow-dense DNN with varying level of memory savings depending on DNN precision and sparsity levels. This work further proposed a Pareto-optimal approach to systematically extract optimal DNN models from a huge set of sparse and dense DNN models. The resulting 11 optimal designs were further evaluated by considering overall DNN memory which includes activation memory and weight memory. It was found that there is only a small change in the memory footprint of the optimal designs corresponding to the low sparsity DNNs. However, activation memory cannot be ignored for high sparsity DNNs.

ContributorsSrivastava, Gaurav (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2018

Energy Efficient Hardware Design of Neural Networks

Description

Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements…

Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements called neurons with several connections between them called synapses are used to build these networks. Hence, it involves operations that exhibit a high level of parallelism making it computationally and memory intensive. Constrained by computing resources and memory, most of the applications require a neural network which utilizes less energy. Energy efficient implementation of these computationally intense algorithms on neuromorphic hardware demands a lot of architectural optimizations. One of these optimizations would be the reduction in the network size using compression and several studies investigated compression by introducing element-wise or row-/column-/block-wise sparsity via pruning and regularization. Additionally, numerous recent works have concentrated on reducing the precision of activations and weights with some reducing to a single bit. However, combining various sparsity structures with binarized or very-low-precision (2-3 bit) neural networks have not been comprehensively explored. Output activations in these deep neural network algorithms are habitually non-binary making it difficult to exploit sparsity. On the other hand, biologically realistic models like spiking neural networks (SNN) closely mimic the operations in biological nervous systems and explore new avenues for brain-like cognitive computing. These networks deal with binary spikes, and they can exploit the input-dependent sparsity or redundancy to dynamically scale the amount of computation in turn leading to energy-efficient hardware implementation. This work discusses configurable spiking neuromorphic architecture that supports multiple hidden layers exploiting hardware reuse. It also presents design techniques for minimum-area/-energy DNN hardware with minimal degradation in accuracy. Area, performance and energy results of these DNN and SNN hardware is reported for the MNIST dataset. The Neuromorphic hardware designed for SNN algorithm in 28nm CMOS demonstrates high classification accuracy (>98% on MNIST) and low energy (51.4 - 773 (nJ) per classification). The optimized DNN hardware designed in 40nm CMOS that combines 8X structured compression and 3-bit weight precision showed 98.4% accuracy at 33 (nJ) per classification.

ContributorsKolala Venkataramanaiah, Shreyas (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2018

Post-silicon validation of radiation hardened microprocessor, embedded flash and test structures

Description

Digital systems are essential to the technological advancements in space exploration. Microprocessor and flash memory are the essential parts of such a digital system. Space exploration requires a special class of radiation hardened microprocessors and flash memories, which are not functionally disrupted in the presence of radiation. The reference design…

Digital systems are essential to the technological advancements in space exploration. Microprocessor and flash memory are the essential parts of such a digital system. Space exploration requires a special class of radiation hardened microprocessors and flash memories, which are not functionally disrupted in the presence of radiation. The reference design ‘HERMES’ is a radiation-hardened microprocessor with performance comparable to commercially available designs. The reference design ‘eFlash’ is a prototype of soft-error hardened flash memory for configuring Xilinx FPGAs. These designs are manufactured using a foundry bulk CMOS 90-nm low standby power (LP) process. This thesis presents the post-silicon validation results of these designs.

ContributorsGogulamudi, Anudeep Reddy (Author) / Clark, Lawrence T (Thesis advisor) / Holbert, Keith E. (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2016

Data path implementation for a spatially programmable architecture customized for image processing applications

Description

The last decade has witnessed a paradigm shift in computing platforms, from laptops and servers to mobile devices like smartphones and tablets. These devices host an immense variety of applications many of which are computationally expensive and thus are power hungry. As most of these mobile platforms are powered by…

The last decade has witnessed a paradigm shift in computing platforms, from laptops and servers to mobile devices like smartphones and tablets. These devices host an immense variety of applications many of which are computationally expensive and thus are power hungry. As most of these mobile platforms are powered by batteries, energy efficiency has become one of the most critical aspects of such devices. Thus, the energy cost of the fundamental arithmetic operations executed in these applications has to be reduced. As voltage scaling has effectively ended, the energy efficiency of integrated circuits has ceased to improve within successive generations of transistors. This resulted in widespread use of Application Specific Integrated Circuits (ASIC), which provide incredible energy efficiency. However, these are not flexible and have high non-recurring engineering (NRE) cost. Alternatively, Field Programmable Gate Arrays (FPGA) offer flexibility to implement any application, but at the cost of higher area and energy compared to ASIC.

In this work, a spatially programmable architecture customized for image processing applications is proposed. The intent is to bridge the efficiency gap between ASICs and FPGAs, by offering FPGA-like flexibility and ASIC-like energy efficiency. This architecture minimizes the energy overheads in FPGAs, which result from the use of fine-grained programming style and global interconnect. It is flexible compared to an ASIC and can accommodate multiple applications.

The main contribution of the thesis is the feasibility analysis of the data path of this architecture, customized for image processing applications. The data path is implemented at the register transfer level (RTL), and the synthesis results are obtained in 45nm technology cell library from a leading foundry. The results of image-processing applications demonstrate that this architecture is within a factor of 10x of the energy and area efficiency of ASIC implementations.

ContributorsSatapathy, Saktiswarup (Author) / Brunhaver, John (Thesis advisor) / Clark, Lawrence T (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)

Created2016

InCheck - an integrated recovery methodology for fine-grained soft-error detection schemes

Description

Soft errors are considered as a key reliability challenge for sub-nano scale transistors. An ideal solution for such a challenge should ultimately eliminate the effect of soft errors from the microprocessor. While forward recovery techniques achieve fast recovery from errors by simply voting out the wrong values, they incur the…

Soft errors are considered as a key reliability challenge for sub-nano scale transistors. An ideal solution for such a challenge should ultimately eliminate the effect of soft errors from the microprocessor. While forward recovery techniques achieve fast recovery from errors by simply voting out the wrong values, they incur the overhead of three copies execution. Backward recovery techniques only need two copies of execution, but suffer from check-pointing overhead.

In this work I explored the efficiency of integrating check-pointing into the application and the effectiveness of recovery that can be performed upon it. After evaluating the available fine-grained approaches to perform recovery, I am introducing InCheck, an in-application recovery scheme that can be integrated into instruction-duplication based techniques, thus providing a fast error recovery. The proposed technique makes light-weight checkpoints at the basic-block granularity, and uses them for recovery purposes.

To evaluate the effectiveness of the proposed technique, 10,000 fault injection experiments were performed on different hardware components of a modern ARM in-order simulated processor. InCheck was able to recover from all detected errors by replaying about 20 instructions, however, the state of the art recovery scheme failed more than 200 times.

ContributorsLokam, Sai Ram Dheeraj (Author) / Shrivastava, Aviral (Thesis advisor) / Clark, Lawrence T (Committee member) / Mubayi, Anuj (Committee member) / Arizona State University (Publisher)

Created2016

The feasibility of domain specific compilation for spatially programmable architectures

Description

Integrated circuits must be energy efficient. This efficiency affects all aspects of chip design, from the battery life of embedded devices to thermal heating on high performance servers. As technology scaling slows, future generations of transistors will lack the energy efficiency gains as it has had in previous generations. Therefore,…

Integrated circuits must be energy efficient. This efficiency affects all aspects of chip design, from the battery life of embedded devices to thermal heating on high performance servers. As technology scaling slows, future generations of transistors will lack the energy efficiency gains as it has had in previous generations. Therefore, other sources of energy efficiency will be much more important. Many computations have the potential to be executed for extreme energy efficiency but are not instigated because the platforms they run on are not optimized for efficient execution. ASICs improve energy efficiency by reducing flexibility and leveraging the properties of a specific computation. However, ASICs are fixed in function and therefore have incredible opportunity cost. FPGAs offer a reconfigurable solution but are 25x less energy efficient than ASIC implementation. Spatially programmable architectures (SPAs) are similar in design and structure to ASICs and FPGAs but are able bridge the ASIC-FPGA energy efficiency gap by trading flexibility for efficiency. However, SPAs are difficult to program because they do not share the same programming model as normal architectures that execute in time. This work addresses compiler challenges for coarse grained, locally interconnected SPA for domain efficiency (SPADE). A novel SPADE topology, called the wave pipeline, is introduced that is designed for the image signal processing domain that is both efficient and simple to compile to. A compiler for the wave pipeline is created that solves for maximum energy and area efficiency using low complexity, greedy methods. The wave pipeline topology and compiler allow for us to investigate and experiment with image signal processing applications to prove the feasibility of SPADE compilers.

ContributorsMackay, Curtis (Author) / Brunhaver, John (Thesis advisor) / Karam, Lina J (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2016

FinFET Cell Library Design and Characterization

Description

Modern-day integrated circuits are very capable, often containing more than a billion transistors. For example, the Intel Ivy Bridge 4C chip has about 1.2 billion transistors on a 160 mm2 die. Designing such complex circuits requires automation. Therefore, these designs are made with the help of computer aided design (CAD)…

Modern-day integrated circuits are very capable, often containing more than a billion transistors. For example, the Intel Ivy Bridge 4C chip has about 1.2 billion transistors on a 160 mm2 die. Designing such complex circuits requires automation. Therefore, these designs are made with the help of computer aided design (CAD) tools. A major part of this custom design flow for application specific integrated circuits (ASIC) is the design of standard cell libraries. Standard cell libraries are a collection of primitives from which the automatic place and route (APR) tools can choose a collection of cells and implement the design that is being put together. To operate efficiently, the CAD tools require multiple views of each cell in the standard cell library. This data is obtained by characterizing the standard cell libraries and compiling the results in formats that the tools can easily understand and utilize.

My thesis focusses on the design and characterization of one such standard cell library in the ASAP7 7 nm predictive design kit (PDK). The complete design flow, starting from the choice of the cell architecture, design of the cell layouts and the various decisions made in that process to obtain optimum results, to the characterization of those cells using the Liberate tool provided by Cadence design systems Inc., is discussed in this thesis. The end results of the characterized library are used in the APR of a few open source register-transfer logic (RTL) projects and the efficiency of the library is demonstrated.

ContributorsVangala, Manoj (Author) / Clark, Lawrence T (Thesis advisor) / Brunhaver, John S (Committee member) / Allee, David R. (Committee member) / Arizona State University (Publisher)

Created2017

Hardware Acceleration of Most Apparent Distortion Image Quality Assessment Algorithm on FPGA Using OpenCL

Description

The information era has brought about many technological advancements in the past

few decades, and that has led to an exponential increase in the creation of digital images and

videos. Constantly, all digital images go through some image processing algorithm for

various reasons like compression, transmission, storage, etc. There is data loss during…

The information era has brought about many technological advancements in the past

few decades, and that has led to an exponential increase in the creation of digital images and

videos. Constantly, all digital images go through some image processing algorithm for

various reasons like compression, transmission, storage, etc. There is data loss during this

process which leaves us with a degraded image. Hence, to ensure minimal degradation of

images, the requirement for quality assessment has become mandatory. Image Quality

Assessment (IQA) has been researched and developed over the last several decades to

predict the quality score in a manner that agrees with human judgments of quality. Modern

image quality assessment (IQA) algorithms are quite effective at prediction accuracy, and

their development has not focused on improving computational performance. The existing

serial implementation requires a relatively large run-time on the order of seconds for a single

frame. Hardware acceleration using Field programmable gate arrays (FPGAs) provides

reconfigurable computing fabric that can be tailored for a broad range of applications.

Usually, programming FPGAs has required expertise in hardware descriptive languages

(HDLs) or high-level synthesis (HLS) tool. OpenCL is an open standard for cross-platform,

parallel programming of heterogeneous systems along with Altera OpenCL SDK, enabling

developers to use FPGA's potential without extensive hardware knowledge. Hence, this

thesis focuses on accelerating the computationally intensive part of the most apparent

distortion (MAD) algorithm on FPGA using OpenCL. The results are compared with CPU

implementation to evaluate performance and efficiency gains.

ContributorsGunavelu Mohan, Aswin (Author) / Sohoni, Sohum (Thesis advisor) / Ren, Fengbo (Thesis advisor) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2017

Filtering by