Search Content

Stagioni: Temperature management to enable near-sensor processing for performance, fidelity, and energy-efficiency of vision and imaging workloads

Description

Vision processing on traditional architectures is inefficient due to energy-expensive off-chip data movements. Many researchers advocate pushing processing close to the sensor to substantially reduce data movements. However, continuous near-sensor processing raises the sensor temperature, impairing the fidelity of imaging/vision tasks.

The work characterizes the thermal implications of using 3D stacked…

Vision processing on traditional architectures is inefficient due to energy-expensive off-chip data movements. Many researchers advocate pushing processing close to the sensor to substantially reduce data movements. However, continuous near-sensor processing raises the sensor temperature, impairing the fidelity of imaging/vision tasks.

The work characterizes the thermal implications of using 3D stacked image sensors with near-sensor vision processing units. The characterization reveals that near-sensor processing reduces system power but degrades image quality. For reasonable image fidelity, the sensor temperature needs to stay below a threshold, situationally determined by application needs. Fortunately, the characterization also identifies opportunities -- unique to the needs of near-sensor processing -- to regulate temperature based on dynamic visual task requirements and rapidly increase capture quality on demand.

Based on the characterization, the work proposes and investigate two thermal management strategies -- stop-capture-go and seasonal migration -- for imaging-aware thermal management. The work present parameters that govern the policy decisions and explore the trade-offs between system power and policy overhead. The work's evaluation shows that the novel dynamic thermal management strategies can unlock the energy-efficiency potential of near-sensor processing with minimal performance impact, without compromising image fidelity.

ContributorsKodukula, Venkatesh (Author) / LiKamWa, Robert (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2019

Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

Description

With the end of Dennard scaling and Moore's law, architects have moved towards

heterogeneous designs consisting of specialized cores to achieve higher performance

and energy efficiency for a target application domain. Applications of linear algebra

are ubiquitous in the field of scientific computing, machine learning, statistics,

etc. with matrix computations being fundamental to these…

With the end of Dennard scaling and Moore's law, architects have moved towards

heterogeneous designs consisting of specialized cores to achieve higher performance

and energy efficiency for a target application domain. Applications of linear algebra

are ubiquitous in the field of scientific computing, machine learning, statistics,

etc. with matrix computations being fundamental to these linear algebra based solutions.

Design of multiple dense (or sparse) matrix computation routines on the

same platform is quite challenging. Added to the complexity is the fact that dense

and sparse matrix computations have large differences in their storage and access

patterns and are difficult to optimize on the same architecture. This thesis addresses

this challenge and introduces a reconfigurable accelerator that supports both dense

and sparse matrix computations efficiently.

The reconfigurable architecture has been optimized to execute the following linear

algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM

(Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver),

LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication),

SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where

each core consists of a 2D array of processing elements (PE).

The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix

updates efficiently. A sequence of such updates is used to solve a larger problem inside

a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR)

is used to perform sparse kernel updates. Scalable partitioning and mapping schemes

are presented that map input matrices of any given size to the multicore architecture.

Design trade-offs related to the PE array dimension, size of local memory inside a core

and the bandwidth between on-chip memories and the cores have been presented. An

optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto

32 GOPS using a single core.

ContributorsAnimesh, Saurabh (Author) / Chakrabarti, Chaitali (Thesis advisor) / Brunhaver, John (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)

Created2018

Post-silicon validation of radiation hardened microprocessor, embedded flash and test structures

Description

Digital systems are essential to the technological advancements in space exploration. Microprocessor and flash memory are the essential parts of such a digital system. Space exploration requires a special class of radiation hardened microprocessors and flash memories, which are not functionally disrupted in the presence of radiation. The reference design…

Digital systems are essential to the technological advancements in space exploration. Microprocessor and flash memory are the essential parts of such a digital system. Space exploration requires a special class of radiation hardened microprocessors and flash memories, which are not functionally disrupted in the presence of radiation. The reference design ‘HERMES’ is a radiation-hardened microprocessor with performance comparable to commercially available designs. The reference design ‘eFlash’ is a prototype of soft-error hardened flash memory for configuring Xilinx FPGAs. These designs are manufactured using a foundry bulk CMOS 90-nm low standby power (LP) process. This thesis presents the post-silicon validation results of these designs.

ContributorsGogulamudi, Anudeep Reddy (Author) / Clark, Lawrence T (Thesis advisor) / Holbert, Keith E. (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2016

Data path implementation for a spatially programmable architecture customized for image processing applications

Description

The last decade has witnessed a paradigm shift in computing platforms, from laptops and servers to mobile devices like smartphones and tablets. These devices host an immense variety of applications many of which are computationally expensive and thus are power hungry. As most of these mobile platforms are powered by…

The last decade has witnessed a paradigm shift in computing platforms, from laptops and servers to mobile devices like smartphones and tablets. These devices host an immense variety of applications many of which are computationally expensive and thus are power hungry. As most of these mobile platforms are powered by batteries, energy efficiency has become one of the most critical aspects of such devices. Thus, the energy cost of the fundamental arithmetic operations executed in these applications has to be reduced. As voltage scaling has effectively ended, the energy efficiency of integrated circuits has ceased to improve within successive generations of transistors. This resulted in widespread use of Application Specific Integrated Circuits (ASIC), which provide incredible energy efficiency. However, these are not flexible and have high non-recurring engineering (NRE) cost. Alternatively, Field Programmable Gate Arrays (FPGA) offer flexibility to implement any application, but at the cost of higher area and energy compared to ASIC.

In this work, a spatially programmable architecture customized for image processing applications is proposed. The intent is to bridge the efficiency gap between ASICs and FPGAs, by offering FPGA-like flexibility and ASIC-like energy efficiency. This architecture minimizes the energy overheads in FPGAs, which result from the use of fine-grained programming style and global interconnect. It is flexible compared to an ASIC and can accommodate multiple applications.

The main contribution of the thesis is the feasibility analysis of the data path of this architecture, customized for image processing applications. The data path is implemented at the register transfer level (RTL), and the synthesis results are obtained in 45nm technology cell library from a leading foundry. The results of image-processing applications demonstrate that this architecture is within a factor of 10x of the energy and area efficiency of ASIC implementations.

ContributorsSatapathy, Saktiswarup (Author) / Brunhaver, John (Thesis advisor) / Clark, Lawrence T (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)

Created2016

The feasibility of domain specific compilation for spatially programmable architectures

Description

Integrated circuits must be energy efficient. This efficiency affects all aspects of chip design, from the battery life of embedded devices to thermal heating on high performance servers. As technology scaling slows, future generations of transistors will lack the energy efficiency gains as it has had in previous generations. Therefore,…

Integrated circuits must be energy efficient. This efficiency affects all aspects of chip design, from the battery life of embedded devices to thermal heating on high performance servers. As technology scaling slows, future generations of transistors will lack the energy efficiency gains as it has had in previous generations. Therefore, other sources of energy efficiency will be much more important. Many computations have the potential to be executed for extreme energy efficiency but are not instigated because the platforms they run on are not optimized for efficient execution. ASICs improve energy efficiency by reducing flexibility and leveraging the properties of a specific computation. However, ASICs are fixed in function and therefore have incredible opportunity cost. FPGAs offer a reconfigurable solution but are 25x less energy efficient than ASIC implementation. Spatially programmable architectures (SPAs) are similar in design and structure to ASICs and FPGAs but are able bridge the ASIC-FPGA energy efficiency gap by trading flexibility for efficiency. However, SPAs are difficult to program because they do not share the same programming model as normal architectures that execute in time. This work addresses compiler challenges for coarse grained, locally interconnected SPA for domain efficiency (SPADE). A novel SPADE topology, called the wave pipeline, is introduced that is designed for the image signal processing domain that is both efficient and simple to compile to. A compiler for the wave pipeline is created that solves for maximum energy and area efficiency using low complexity, greedy methods. The wave pipeline topology and compiler allow for us to investigate and experiment with image signal processing applications to prove the feasibility of SPADE compilers.

ContributorsMackay, Curtis (Author) / Brunhaver, John (Thesis advisor) / Karam, Lina J (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2016

Electrostatic Analysis of Gate All Around (GAA) Nanowire over FinFET

Description

CMOS Technology has been scaled down to 7 nm with FinFET replacing planar MOSFET devices. Due to short channel effects, the FinFET structure was developed to provide better electrostatic control on subthreshold leakage and saturation current over planar MOSFETs while having the desired current drive. The FinFET structure has an…

CMOS Technology has been scaled down to 7 nm with FinFET replacing planar MOSFET devices. Due to short channel effects, the FinFET structure was developed to provide better electrostatic control on subthreshold leakage and saturation current over planar MOSFETs while having the desired current drive. The FinFET structure has an undoped or fully depleted fin, which supports immunity from random dopant fluctuations (RDF – a phenomenon which causes a reduction in the threshold voltage and is prominent at sub 50 nm tech nodes due to lesser dopant atoms) and thus causes threshold voltage (Vth) roll-off by reducing the Vth. However, as the advanced CMOS technologies are shrinking down to a 5 nm technology node, subthreshold leakage and drain-induced-barrier-lowering (DIBL) are driving the introduction of new metal-oxide-semiconductor field-effect transistor (MOSFET) structures to improve performance. GAA field effect transistors are shown to be the potential candidates for these advanced nodes. In nanowire devices, due to the presence of the gate on all sides of the channel, DIBL should be lower compared to the FinFETs.

A 3-D technology computer aided design (TCAD) device simulation is done to compare the performance of FinFET and GAA nanowire structures with vertically stacked horizontal nanowires. Subthreshold slope, DIBL & saturation current are measured and compared between these devices. The FinFET’s device performance has been matched with the ASAP7 compact model with the impact of tensile and compressive strain on NMOS & PMOS respectively. Metal work function is adjusted for the desired current drive. The nanowires have shown better electrostatic performance over FinFETs with excellent improvement in DIBL and subthreshold slope. This proves that horizontal nanowires can be the potential candidate for 5 nm technology node. A GAA nanowire structure for 5 nm tech node is characterized with a gate length of 15 nm. The structure is scaled down from 7 nm node to 5 nm by using a scaling factor of 0.7.

ContributorsRana, Parshant (Author) / Clark, Lawrence (Thesis advisor) / Ferry, David (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2017

6T-SRAM 1Mb design with test structures and post silicon validation

Description

Static random-access memories (SRAM) are integral part of design systems as caches and data memories that and occupy one-third of design space. The work presents an embedded low power SRAM on a triple well process that allows body-biasing control. In addition to the normal mode operation, the design is embedded…

Static random-access memories (SRAM) are integral part of design systems as caches and data memories that and occupy one-third of design space. The work presents an embedded low power SRAM on a triple well process that allows body-biasing control. In addition to the normal mode operation, the design is embedded with Physical Unclonable Function (PUF) [Suh07] and Sense Amplifier Test (SA Test) mode. With PUF mode structures, the fabrication and environmental mismatches in bit cells are used to generate unique identification bits. These bits are fixed and known as preferred state of an SRAM bit cell. The direct access test structure is a measurement unit for offset voltage analysis of sense amplifiers. These designs are manufactured using a foundry bulk CMOS 55 nm low-power (LP) process. The details about SRAM bit-cell and peripheral circuit design is discussed in detail, for certain cases the circuit simulation analysis is performed with random variations embedded in SPICE models. Further, post-silicon testing results are discussed for normal operation of SRAMs and the special test modes. The silicon and circuit simulation results for various tests are presented.

ContributorsDosi, Ankita (Author) / Clark, Lawrence (Thesis advisor) / Seo, Jae-Sun (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2017

A Scalable FPGA-based Multi-channel Data Acquisition System for Parallel Plate Ionization Chamber

Description

Proton beam therapy has been proven to be effective for cancer treatment. Protons allow for complete energy deposition to occur inside patients, rendering this a superior treatment compared to other types of radiotherapy based on photons or electrons. This same characteristic makes quality assurance critical driving the need for detectors…

Proton beam therapy has been proven to be effective for cancer treatment. Protons allow for complete energy deposition to occur inside patients, rendering this a superior treatment compared to other types of radiotherapy based on photons or electrons. This same characteristic makes quality assurance critical driving the need for detectors capable of direct beam positioning and fluence measurement. This work showcases a flexible and scalable data acquisition system for a multi-channel and segmented readout parallel plate ionization chamber instrument for proton beam fluence and positioning detection. Utilizing readily available, modern, off-the-shelf hardware components, including an FPGA with an embedded CPU in the same package, a data acquisition system for the detector was designed. The undemanding detector signal bandwidth allows the absence of ASICs and their associated costs and lead times in the system. The data acquisition system is showcased experimentally for a 96-readout channel detector demonstrating sub millisecond beam characteristics and beam reconstruction. The system demonstrated scalability up to 1064-readout channels, the limiting factor being FPGA I/O availability as well as amplification and sampling power consumption.

ContributorsAcuna Briceno, Rafael Andres (Author) / Barnaby, Hugh (Thesis advisor) / Brunhaver, John (Committee member) / Blyth, David (Committee member) / Arizona State University (Publisher)

Created2021

Monitoring for Reliable and Secure Power Management Integrated Circuits via Built-In Self-Test

Description

Power management circuits are employed in most electronic integrated systems, including applications for automotive, IoT, and smart wearables. Oftentimes, these power management circuits become a single point of system failure, and since they are present in most modern electronic devices, they become a target for hardware security attacks. Digital circuits…

Power management circuits are employed in most electronic integrated systems, including applications for automotive, IoT, and smart wearables. Oftentimes, these power management circuits become a single point of system failure, and since they are present in most modern electronic devices, they become a target for hardware security attacks. Digital circuits are typically more prone to security attacks compared to analog circuits, but malfunctions in digital circuitry can affect the analog performance/parameters of power management circuits. This research studies the effect that these hacks will have on the analog performance of power circuits, specifically linear and switching power regulators/converters. Apart from security attacks, these circuits suffer from performance degradations due to temperature, aging, and load stress. Power management circuits usually consist of regulators or converters that regulate the load’s voltage supply by employing a feedback loop, and the stability of the feedback loop is a critical parameter in the system design. Oftentimes, the passive components employed in these circuits shift in value over varying conditions and may cause instability within the power converter. Therefore, variations in the passive components, as well as malicious hardware security attacks, can degrade regulator performance and affect the system’s stability. The traditional ways of detecting phase margin, which indicates system stability, employ techniques that require the converter to be in open loop, and hence can’t be used while the system is deployed in-the-field under normal operation. Aging of components and security attacks may occur after the power management systems have completed post-production test and have been deployed, and they may not cause catastrophic failure of the system, hence making them difficult to detect. These two issues of component variations and security attacks can be detected during normal operation over the product lifetime, if the frequency response of the power converter can be monitored in-situ and in-field. This work presents a method to monitor the phase margin (stability) of a power converter without affecting its normal mode of operation by injecting a white noise/ pseudo random binary sequence (PRBS). Furthermore, this work investigates the analog performance parameters, including phase margin, that are affected by various digital hacks on the control circuitry associated with power converters. A case study of potential hardware attacks is completed for a linear low-dropout regulator (LDO).

ContributorsMalakar, Pragya Priya (Author) / Kitchen, Jennifer (Thesis advisor) / Ozev, Sule (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2019

Pre-Silicon Analysis of a Single Event Transient Pulse Measurement Test Structure in a FinFET Process

Description

A Single Event Transient (SET) is a transient voltage pulse induced by an ionizing radiation particle striking a combinational logic node in a circuit. The probability of a storage element capturing the transient pulse depends on the width of the pulse. Measuring the rate of occurrence and the distribution of…

A Single Event Transient (SET) is a transient voltage pulse induced by an ionizing radiation particle striking a combinational logic node in a circuit. The probability of a storage element capturing the transient pulse depends on the width of the pulse. Measuring the rate of occurrence and the distribution of SET pulse widths is essential to understand the likelihood of soft errors and to develop cost-effective mitigation schemes. Existing research measures the pulse width of SETs in bulk Complementary Metal-Oxide-Semiconductor (CMOS) and Silicon On Insulator (SOI) technologies, but not on Fin Field-Effect Transistors (FinFETs). This thesis focuses on developing a test structure on the FinFET process to generate, propagate, and separate SETs and build a time-to-digital converter to measure the pulse width of SET.

The proposed SET test structure statistically separates SETs generated at NMOS and PMOS based on the difference in restoring current. It consists of N-collection devices to collect events at NMOS and P-collection devices to collect events at PMOS. The events that occur in PMOS of the N-collection device and NMOS of the P-collection device are false events. The logic gates of the collection devices are skewed to perform pulse expansion so that a minimally sustained SET propagates without getting suppressed by the contamination delay. A symmetric tree structure with an S-R latch event detector localizes the location of the SET. The Cartesian coordinates-based pulse injection structure injects external pulses at specific nodes to perform instrumentation and calibrate the measurement. A thermometer-encoded chain (vernier chain) with mismatched delay paths measures the width of the SET.

For low Linear Energy Transfer (LET) tests, the false events are entirely masked and do not propagate since the amount of charge that has to be deposited for successful event propagation is significantly high. In the case of high LET tests, the actual events and false events propagate, but they can be separated based on the SET location and the width of the output event. The vernier chain has a high measurement resolution of ~3.5ps, which aids in separating the events.

ContributorsShreedharan, Sanjay (Author) / Brunhaver, John (Thesis advisor) / Clark, Lawrence (Committee member) / Sanchez Esqueda, Ivan (Committee member) / Arizona State University (Publisher)

Created2020

Filtering by