Search Content

Stagioni: Temperature management to enable near-sensor processing for performance, fidelity, and energy-efficiency of vision and imaging workloads

Description

Vision processing on traditional architectures is inefficient due to energy-expensive off-chip data movements. Many researchers advocate pushing processing close to the sensor to substantially reduce data movements. However, continuous near-sensor processing raises the sensor temperature, impairing the fidelity of imaging/vision tasks.

The work characterizes the thermal implications of using 3D stacked…

Vision processing on traditional architectures is inefficient due to energy-expensive off-chip data movements. Many researchers advocate pushing processing close to the sensor to substantially reduce data movements. However, continuous near-sensor processing raises the sensor temperature, impairing the fidelity of imaging/vision tasks.

The work characterizes the thermal implications of using 3D stacked image sensors with near-sensor vision processing units. The characterization reveals that near-sensor processing reduces system power but degrades image quality. For reasonable image fidelity, the sensor temperature needs to stay below a threshold, situationally determined by application needs. Fortunately, the characterization also identifies opportunities -- unique to the needs of near-sensor processing -- to regulate temperature based on dynamic visual task requirements and rapidly increase capture quality on demand.

Based on the characterization, the work proposes and investigate two thermal management strategies -- stop-capture-go and seasonal migration -- for imaging-aware thermal management. The work present parameters that govern the policy decisions and explore the trade-offs between system power and policy overhead. The work's evaluation shows that the novel dynamic thermal management strategies can unlock the energy-efficiency potential of near-sensor processing with minimal performance impact, without compromising image fidelity.

ContributorsKodukula, Venkatesh (Author) / LiKamWa, Robert (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2019

Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

Description

With the end of Dennard scaling and Moore's law, architects have moved towards

heterogeneous designs consisting of specialized cores to achieve higher performance

and energy efficiency for a target application domain. Applications of linear algebra

are ubiquitous in the field of scientific computing, machine learning, statistics,

etc. with matrix computations being fundamental to these…

With the end of Dennard scaling and Moore's law, architects have moved towards

heterogeneous designs consisting of specialized cores to achieve higher performance

and energy efficiency for a target application domain. Applications of linear algebra

are ubiquitous in the field of scientific computing, machine learning, statistics,

etc. with matrix computations being fundamental to these linear algebra based solutions.

Design of multiple dense (or sparse) matrix computation routines on the

same platform is quite challenging. Added to the complexity is the fact that dense

and sparse matrix computations have large differences in their storage and access

patterns and are difficult to optimize on the same architecture. This thesis addresses

this challenge and introduces a reconfigurable accelerator that supports both dense

and sparse matrix computations efficiently.

The reconfigurable architecture has been optimized to execute the following linear

algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM

(Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver),

LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication),

SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where

each core consists of a 2D array of processing elements (PE).

The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix

updates efficiently. A sequence of such updates is used to solve a larger problem inside

a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR)

is used to perform sparse kernel updates. Scalable partitioning and mapping schemes

are presented that map input matrices of any given size to the multicore architecture.

Design trade-offs related to the PE array dimension, size of local memory inside a core

and the bandwidth between on-chip memories and the cores have been presented. An

optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto

32 GOPS using a single core.

ContributorsAnimesh, Saurabh (Author) / Chakrabarti, Chaitali (Thesis advisor) / Brunhaver, John (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)

Created2018

Post-silicon validation of radiation hardened microprocessor, embedded flash and test structures

Description

Digital systems are essential to the technological advancements in space exploration. Microprocessor and flash memory are the essential parts of such a digital system. Space exploration requires a special class of radiation hardened microprocessors and flash memories, which are not functionally disrupted in the presence of radiation. The reference design…

Digital systems are essential to the technological advancements in space exploration. Microprocessor and flash memory are the essential parts of such a digital system. Space exploration requires a special class of radiation hardened microprocessors and flash memories, which are not functionally disrupted in the presence of radiation. The reference design ‘HERMES’ is a radiation-hardened microprocessor with performance comparable to commercially available designs. The reference design ‘eFlash’ is a prototype of soft-error hardened flash memory for configuring Xilinx FPGAs. These designs are manufactured using a foundry bulk CMOS 90-nm low standby power (LP) process. This thesis presents the post-silicon validation results of these designs.

ContributorsGogulamudi, Anudeep Reddy (Author) / Clark, Lawrence T (Thesis advisor) / Holbert, Keith E. (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2016

The feasibility of domain specific compilation for spatially programmable architectures

Description

Integrated circuits must be energy efficient. This efficiency affects all aspects of chip design, from the battery life of embedded devices to thermal heating on high performance servers. As technology scaling slows, future generations of transistors will lack the energy efficiency gains as it has had in previous generations. Therefore,…

Integrated circuits must be energy efficient. This efficiency affects all aspects of chip design, from the battery life of embedded devices to thermal heating on high performance servers. As technology scaling slows, future generations of transistors will lack the energy efficiency gains as it has had in previous generations. Therefore, other sources of energy efficiency will be much more important. Many computations have the potential to be executed for extreme energy efficiency but are not instigated because the platforms they run on are not optimized for efficient execution. ASICs improve energy efficiency by reducing flexibility and leveraging the properties of a specific computation. However, ASICs are fixed in function and therefore have incredible opportunity cost. FPGAs offer a reconfigurable solution but are 25x less energy efficient than ASIC implementation. Spatially programmable architectures (SPAs) are similar in design and structure to ASICs and FPGAs but are able bridge the ASIC-FPGA energy efficiency gap by trading flexibility for efficiency. However, SPAs are difficult to program because they do not share the same programming model as normal architectures that execute in time. This work addresses compiler challenges for coarse grained, locally interconnected SPA for domain efficiency (SPADE). A novel SPADE topology, called the wave pipeline, is introduced that is designed for the image signal processing domain that is both efficient and simple to compile to. A compiler for the wave pipeline is created that solves for maximum energy and area efficiency using low complexity, greedy methods. The wave pipeline topology and compiler allow for us to investigate and experiment with image signal processing applications to prove the feasibility of SPADE compilers.

ContributorsMackay, Curtis (Author) / Brunhaver, John (Thesis advisor) / Karam, Lina J (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2016

Electrostatic Analysis of Gate All Around (GAA) Nanowire over FinFET

Description

CMOS Technology has been scaled down to 7 nm with FinFET replacing planar MOSFET devices. Due to short channel effects, the FinFET structure was developed to provide better electrostatic control on subthreshold leakage and saturation current over planar MOSFETs while having the desired current drive. The FinFET structure has an…

CMOS Technology has been scaled down to 7 nm with FinFET replacing planar MOSFET devices. Due to short channel effects, the FinFET structure was developed to provide better electrostatic control on subthreshold leakage and saturation current over planar MOSFETs while having the desired current drive. The FinFET structure has an undoped or fully depleted fin, which supports immunity from random dopant fluctuations (RDF – a phenomenon which causes a reduction in the threshold voltage and is prominent at sub 50 nm tech nodes due to lesser dopant atoms) and thus causes threshold voltage (Vth) roll-off by reducing the Vth. However, as the advanced CMOS technologies are shrinking down to a 5 nm technology node, subthreshold leakage and drain-induced-barrier-lowering (DIBL) are driving the introduction of new metal-oxide-semiconductor field-effect transistor (MOSFET) structures to improve performance. GAA field effect transistors are shown to be the potential candidates for these advanced nodes. In nanowire devices, due to the presence of the gate on all sides of the channel, DIBL should be lower compared to the FinFETs.

A 3-D technology computer aided design (TCAD) device simulation is done to compare the performance of FinFET and GAA nanowire structures with vertically stacked horizontal nanowires. Subthreshold slope, DIBL & saturation current are measured and compared between these devices. The FinFET’s device performance has been matched with the ASAP7 compact model with the impact of tensile and compressive strain on NMOS & PMOS respectively. Metal work function is adjusted for the desired current drive. The nanowires have shown better electrostatic performance over FinFETs with excellent improvement in DIBL and subthreshold slope. This proves that horizontal nanowires can be the potential candidate for 5 nm technology node. A GAA nanowire structure for 5 nm tech node is characterized with a gate length of 15 nm. The structure is scaled down from 7 nm node to 5 nm by using a scaling factor of 0.7.

ContributorsRana, Parshant (Author) / Clark, Lawrence (Thesis advisor) / Ferry, David (Committee member) / Brunhaver, John (Committee member) / Arizona State University (Publisher)

Created2017

A Scalable FPGA-based Multi-channel Data Acquisition System for Parallel Plate Ionization Chamber

Description

Proton beam therapy has been proven to be effective for cancer treatment. Protons allow for complete energy deposition to occur inside patients, rendering this a superior treatment compared to other types of radiotherapy based on photons or electrons. This same characteristic makes quality assurance critical driving the need for detectors…

Proton beam therapy has been proven to be effective for cancer treatment. Protons allow for complete energy deposition to occur inside patients, rendering this a superior treatment compared to other types of radiotherapy based on photons or electrons. This same characteristic makes quality assurance critical driving the need for detectors capable of direct beam positioning and fluence measurement. This work showcases a flexible and scalable data acquisition system for a multi-channel and segmented readout parallel plate ionization chamber instrument for proton beam fluence and positioning detection. Utilizing readily available, modern, off-the-shelf hardware components, including an FPGA with an embedded CPU in the same package, a data acquisition system for the detector was designed. The undemanding detector signal bandwidth allows the absence of ASICs and their associated costs and lead times in the system. The data acquisition system is showcased experimentally for a 96-readout channel detector demonstrating sub millisecond beam characteristics and beam reconstruction. The system demonstrated scalability up to 1064-readout channels, the limiting factor being FPGA I/O availability as well as amplification and sampling power consumption.

ContributorsAcuna Briceno, Rafael Andres (Author) / Barnaby, Hugh (Thesis advisor) / Brunhaver, John (Committee member) / Blyth, David (Committee member) / Arizona State University (Publisher)

Created2021

Pre-Silicon Analysis of a Single Event Transient Pulse Measurement Test Structure in a FinFET Process

Description

A Single Event Transient (SET) is a transient voltage pulse induced by an ionizing radiation particle striking a combinational logic node in a circuit. The probability of a storage element capturing the transient pulse depends on the width of the pulse. Measuring the rate of occurrence and the distribution of…

A Single Event Transient (SET) is a transient voltage pulse induced by an ionizing radiation particle striking a combinational logic node in a circuit. The probability of a storage element capturing the transient pulse depends on the width of the pulse. Measuring the rate of occurrence and the distribution of SET pulse widths is essential to understand the likelihood of soft errors and to develop cost-effective mitigation schemes. Existing research measures the pulse width of SETs in bulk Complementary Metal-Oxide-Semiconductor (CMOS) and Silicon On Insulator (SOI) technologies, but not on Fin Field-Effect Transistors (FinFETs). This thesis focuses on developing a test structure on the FinFET process to generate, propagate, and separate SETs and build a time-to-digital converter to measure the pulse width of SET.

The proposed SET test structure statistically separates SETs generated at NMOS and PMOS based on the difference in restoring current. It consists of N-collection devices to collect events at NMOS and P-collection devices to collect events at PMOS. The events that occur in PMOS of the N-collection device and NMOS of the P-collection device are false events. The logic gates of the collection devices are skewed to perform pulse expansion so that a minimally sustained SET propagates without getting suppressed by the contamination delay. A symmetric tree structure with an S-R latch event detector localizes the location of the SET. The Cartesian coordinates-based pulse injection structure injects external pulses at specific nodes to perform instrumentation and calibrate the measurement. A thermometer-encoded chain (vernier chain) with mismatched delay paths measures the width of the SET.

For low Linear Energy Transfer (LET) tests, the false events are entirely masked and do not propagate since the amount of charge that has to be deposited for successful event propagation is significantly high. In the case of high LET tests, the actual events and false events propagate, but they can be separated based on the SET location and the width of the output event. The vernier chain has a high measurement resolution of ~3.5ps, which aids in separating the events.

ContributorsShreedharan, Sanjay (Author) / Brunhaver, John (Thesis advisor) / Clark, Lawrence (Committee member) / Sanchez Esqueda, Ivan (Committee member) / Arizona State University (Publisher)

Created2020

Filtering by

Stagioni: Temperature management to enable near-sensor processing for performance, fidelity, and energy-efficiency of vision and imaging workloads

Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

Post-silicon validation of radiation hardened microprocessor, embedded flash and test structures

The feasibility of domain specific compilation for spatially programmable architectures

Electrostatic Analysis of Gate All Around (GAA) Nanowire over FinFET

A Scalable FPGA-­based Multi­-channel Data Acquisition System for Parallel Plate Ionization Chamber

Pre-Silicon Analysis of a Single Event Transient Pulse Measurement Test Structure in a FinFET Process

A Scalable FPGA-based Multi-channel Data Acquisition System for Parallel Plate Ionization Chamber