Matching Items (9)
Filtering by

Clear all filters

151945-Thumbnail Image.png
Description
In recent years we have witnessed a shift towards multi-processor system-on-chips (MPSoCs) to address the demands of embedded devices (such as cell phones, GPS devices, luxury car features, etc.). Highly optimized MPSoCs are well-suited to tackle the complex application demands desired by the end user customer. These MPSoCs incorporate a

In recent years we have witnessed a shift towards multi-processor system-on-chips (MPSoCs) to address the demands of embedded devices (such as cell phones, GPS devices, luxury car features, etc.). Highly optimized MPSoCs are well-suited to tackle the complex application demands desired by the end user customer. These MPSoCs incorporate a constellation of heterogeneous processing elements (PEs) (general purpose PEs and application-specific integrated circuits (ASICS)). A typical MPSoC will be composed of a application processor, such as an ARM Coretex-A9 with cache coherent memory hierarchy, and several application sub-systems. Each of these sub-systems are composed of highly optimized instruction processors, graphics/DSP processors, and custom hardware accelerators. Typically, these sub-systems utilize scratchpad memories (SPM) rather than support cache coherency. The overall architecture is an integration of the various sub-systems through a high bandwidth system-level interconnect (such as a Network-on-Chip (NoC)). The shift to MPSoCs has been fueled by three major factors: demand for high performance, the use of component libraries, and short design turn around time. As customers continue to desire more and more complex applications on their embedded devices the performance demand for these devices continues to increase. Designers have turned to using MPSoCs to address this demand. By using pre-made IP libraries designers can quickly piece together a MPSoC that will meet the application demands of the end user with minimal time spent designing new hardware. Additionally, the use of MPSoCs allows designers to generate new devices very quickly and thus reducing the time to market. In this work, a complete MPSoC synthesis design flow is presented. We first present a technique \cite{leary1_intro} to address the synthesis of the interconnect architecture (particularly Network-on-Chip (NoC)). We then address the synthesis of the memory architecture of a MPSoC sub-system \cite{leary2_intro}. Lastly, we present a co-synthesis technique to generate the functional and memory architectures simultaneously. The validity and quality of each synthesis technique is demonstrated through extensive experimentation.
ContributorsLeary, Glenn (Author) / Chatha, Karamvir S (Thesis advisor) / Vrudhula, Sarma (Committee member) / Shrivastava, Aviral (Committee member) / Beraha, Rudy (Committee member) / Arizona State University (Publisher)
Created2013
152415-Thumbnail Image.png
Description
We are expecting hundreds of cores per chip in the near future. However, scaling the memory architecture in manycore architectures becomes a major challenge. Cache coherence provides a single image of memory at any time in execution to all the cores, yet coherent cache architectures are believed will not scale

We are expecting hundreds of cores per chip in the near future. However, scaling the memory architecture in manycore architectures becomes a major challenge. Cache coherence provides a single image of memory at any time in execution to all the cores, yet coherent cache architectures are believed will not scale to hundreds and thousands of cores. In addition, caches and coherence logic already take 20-50% of the total power consumption of the processor and 30-60% of die area. Therefore, a more scalable architecture is needed for manycore architectures. Software Managed Manycore (SMM) architectures emerge as a solution. They have scalable memory design in which each core has direct access to only its local scratchpad memory, and any data transfers to/from other memories must be done explicitly in the application using Direct Memory Access (DMA) commands. Lack of automatic memory management in the hardware makes such architectures extremely power-efficient, but they also become difficult to program. If the code/data of the task mapped onto a core cannot fit in the local scratchpad memory, then DMA calls must be added to bring in the code/data before it is required, and it may need to be evicted after its use. However, doing this adds a lot of complexity to the programmer's job. Now programmers must worry about data management, on top of worrying about the functional correctness of the program - which is already quite complex. This dissertation presents a comprehensive compiler and runtime integration to automatically manage the code and data of each task in the limited local memory of the core. We firstly developed a Complete Circular Stack Management. It manages stack frames between the local memory and the main memory, and addresses the stack pointer problem as well. Though it works, we found we could further optimize the management for most cases. Thus a Smart Stack Data Management (SSDM) is provided. In this work, we formulate the stack data management problem and propose a greedy algorithm for the same. Later on, we propose a general cost estimation algorithm, based on which CMSM heuristic for code mapping problem is developed. Finally, heap data is dynamic in nature and therefore it is hard to manage it. We provide two schemes to manage unlimited amount of heap data in constant sized region in the local memory. In addition to those separate schemes for different kinds of data, we also provide a memory partition methodology.
ContributorsBai, Ke (Author) / Shrivastava, Aviral (Thesis advisor) / Chatha, Karamvir (Committee member) / Xue, Guoliang (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)
Created2014
150743-Thumbnail Image.png
Description
Thanks to continuous technology scaling, intelligent, fast and smaller digital systems are now available at affordable costs. As a result, digital systems have found use in a wide range of application areas that were not even imagined before, including medical (e.g., MRI, remote or post-operative monitoring devices, etc.), automotive (e.g.,

Thanks to continuous technology scaling, intelligent, fast and smaller digital systems are now available at affordable costs. As a result, digital systems have found use in a wide range of application areas that were not even imagined before, including medical (e.g., MRI, remote or post-operative monitoring devices, etc.), automotive (e.g., adaptive cruise control, anti-lock brakes, etc.), security systems (e.g., residential security gateways, surveillance devices, etc.), and in- and out-of-body sensing (e.g., capsule swallowed by patients measuring digestive system pH, heart monitors, etc.). Such computing systems, which are completely embedded within the application, are called embedded systems, as opposed to general purpose computing systems. In the design of such embedded systems, power consumption and reliability are indispensable system requirements. In battery operated portable devices, the battery is the single largest factor contributing to device cost, weight, recharging time, frequency and ultimately its usability. For example, in the Apple iPhone 4 smart-phone, the battery is $40\%$ of the device weight, occupies $36\%$ of its volume and allows only $7$ hours (over 3G) of talk time. As embedded systems find use in a range of sensitive applications, from bio-medical applications to safety and security systems, the reliability of the computations performed becomes a crucial factor. At our current technology-node, portable embedded systems are prone to expect failures due to soft errors at the rate of once-per-year; but with aggressive technology scaling, the rate is predicted to increase exponentially to once-per-hour. Over the years, researchers have been successful in developing techniques, implemented at different layers of the design-spectrum, to improve system power efficiency and reliability. Among the layers of design abstraction, I observe that the interface between the compiler and processor micro-architecture possesses a unique potential for efficient design optimizations. A compiler designer is able to observe and analyze the application software at a finer granularity; while the processor architect analyzes the system output (power, performance, etc.) for each executed instruction. At the compiler micro-architecture interface, if the system knowledge at the two design layers can be integrated, design optimizations at the two layers can be modified to efficiently utilize available resources and thereby achieve appreciable system-level benefits. To this effect, the thesis statement is that, ``by merging system design information at the compiler and micro-architecture design layers, smart compilers can be developed, that achieve reliable and power-efficient embedded computing through: i) Pure compiler techniques, ii) Hybrid compiler micro-architecture techniques, and iii) Compiler-aware architectures''. In this dissertation demonstrates, through contributions in each of the three compiler-based techniques, the effectiveness of smart compilers in achieving power-efficiency and reliability in embedded systems.
ContributorsJeyapaul, Reiley (Author) / Shrivastava, Aviral (Thesis advisor) / Vrudhula, Sarma (Committee member) / Clark, Lawrence (Committee member) / Colbourn, Charles (Committee member) / Arizona State University (Publisher)
Created2012
156190-Thumbnail Image.png
Description
The quality and quantity of talented members of the US STEM workforce has

been a subject of great interest to policy and decision makers for the past 40 years.

Recent research indicates that while there exist specific shortages in specific disciplines

and areas of expertise in the private sector and the federal government,

The quality and quantity of talented members of the US STEM workforce has

been a subject of great interest to policy and decision makers for the past 40 years.

Recent research indicates that while there exist specific shortages in specific disciplines

and areas of expertise in the private sector and the federal government, there is no

noticeable shortage in any STEM academic discipline, but rather a surplus of PhDs

vying for increasingly scarce tenure track positions. Despite the seeming availability

of industry and private sector jobs, recent PhDs still struggle to find employment in

those areas. I argue that the decades old narrative suggesting a shortage of STEM

PhDs in the US poses a threat to the value of the natural science PhD, and that

this narrative contributes significantly to why so many PhDs struggle to find career

employment in their fields. This study aims to address the following question: what is

the value of a STEM PhD outside academia? I begin with a critical review of existing

literature, and then analyze programmatic documents for STEM PhD programs at

ASU, interviews with industry employers, and an examination the public face of value

for these degrees. I then uncover the nature of the value alignment, value disconnect,

and value erosion in the ecosystem which produces and then employs STEM PhDs,

concluding with specific areas which merit special consideration in an effort to increase

the value of these degrees for all stakeholders involved.
ContributorsGarbee, Elizabeth (Author) / Maynard, Andrew D. (Thesis advisor) / Wetmore, Jameson (Committee member) / Anderson, Derrick (Committee member) / Arizona State University (Publisher)
Created2018
155944-Thumbnail Image.png
Description
Caches have long been used to reduce memory access latency. However, the increased complexity of cache coherence brings significant challenges in processor design as the number of cores increases. While making caches scalable is still an important research problem, some researchers are exploring the possibility of a more power-efficient SRAM

Caches have long been used to reduce memory access latency. However, the increased complexity of cache coherence brings significant challenges in processor design as the number of cores increases. While making caches scalable is still an important research problem, some researchers are exploring the possibility of a more power-efficient SRAM called scratchpad memories or SPMs. SPMs consume significantly less area, and are more energy-efficient per access than caches, and therefore make the design of on-chip memories much simpler. Unlike caches, which fetch data from memories automatically, an SPM requires explicit instructions for data transfers. SPM-only architectures are thus named as software managed manycore (SMM), since the data movements of such architectures rely on software. SMM processors have been widely used in different areas, such as embedded computing, network processing, or even high performance computing. While SMM processors provide a low-power platform, the hardware alone does not guarantee power efficiency, if applications on such processors deliver low performance. Efficient software techniques are therefore required. A big body of management techniques for SMM architectures are compiler-directed, as inserting data movement operations by hand forces programmers to trace flow of data, which can be error-prone and sometimes difficult if not impossible. This thesis develops compiler-directed techniques to manage data transfers for embedded applications on SMMs efficiently. The techniques analyze and find out the proper program points and insert data movement instructions accordingly. The techniques manage code, stack and heap data of applications, and reduce execution time by 14%, 52% and 80% respectively compared to their predecessors on typical embedded applications. On top of managing local data, a technique is also developed for shared data in SMM architectures. Experimental results show it achieves more than 2X speedup than the previous technique on average.
ContributorsCai, Jian (Author) / Shrivastava, Aviral (Thesis advisor) / Wu, Carole (Committee member) / Ren, Fengbo (Committee member) / Dasgupta, Partha (Committee member) / Arizona State University (Publisher)
Created2017
155058-Thumbnail Image.png
Description
Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable

of accelerating even non-parallel loops and loops with low trip-counts. One challenge

in compiling for CGRAs is to manage both recurring and nonrecurring variables in

the register file (RF) of the CGRA. Although prior works have managed recurring

variables via rotating RF, they access the nonrecurring

Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable

of accelerating even non-parallel loops and loops with low trip-counts. One challenge

in compiling for CGRAs is to manage both recurring and nonrecurring variables in

the register file (RF) of the CGRA. Although prior works have managed recurring

variables via rotating RF, they access the nonrecurring variables through either a

global RF or from a constant memory. The former does not scale well, and the latter

degrades the mapping quality. This work proposes a hardware-software codesign

approach in order to manage all the variables in a local nonrotating RF. Hardware

provides modulo addition based indexing mechanism to enable correct addressing

of recurring variables in a nonrotating RF. The compiler determines the number of

registers required for each recurring variable and configures the boundary between the

registers used for recurring and nonrecurring variables. The compiler also pre-loads

the read-only variables and constants into the local registers in the prologue of the

schedule. Synthesis and place-and-route results of the previous and the proposed RF

design show that proposed solution achieves 17% better cycle time. Experiments of

mapping several important and performance-critical loops collected from MiBench

show proposed approach improves performance (through better mapping) by 18%,

compared to using constant memory.
ContributorsDave, Shail (Author) / Shrivastava, Aviral (Thesis advisor) / Ren, Fengbo (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)
Created2016
155240-Thumbnail Image.png
Description
Cyber-physical systems and hard real-time systems have strict timing constraints that specify deadlines until which tasks must finish their execution. Missing a deadline can cause unexpected outcome or endanger human lives in safety-critical applications, such as automotive or aeronautical systems. It is, therefore, of utmost importance to obtain and optimize

Cyber-physical systems and hard real-time systems have strict timing constraints that specify deadlines until which tasks must finish their execution. Missing a deadline can cause unexpected outcome or endanger human lives in safety-critical applications, such as automotive or aeronautical systems. It is, therefore, of utmost importance to obtain and optimize a safe upper bound of each task’s execution time or the worst-case execution time (WCET), to guarantee the absence of any missed deadline. Unfortunately, conventional microarchitectural components, such as caches and branch predictors, are only optimized for average-case performance and often make WCET analysis complicated and pessimistic. Caches especially have a large impact on the worst-case performance due to expensive off- chip memory accesses involved in cache miss handling. In this regard, software-controlled scratchpad memories (SPMs) have become a promising alternative to caches. An SPM is a raw SRAM, controlled only by executing data movement instructions explicitly at runtime, and such explicit control facilitates static analyses to obtain safe and tight upper bounds of WCETs. SPM management techniques, used in compilers targeting an SPM-based processor, determine how to use a given SPM space by deciding where to insert data movement instructions and what operations to perform at those program locations. This dissertation presents several management techniques for program code and stack data, which aim to optimize the WCETs of a given program. The proposed code management techniques include optimal allocation algorithms and a polynomial-time heuristic for allocating functions to the SPM space, with or without the use of abstraction of SPM regions, and a heuristic for splitting functions into smaller partitions. The proposed stack data management technique, on the other hand, finds an optimal set of program locations to evict and restore stack frames to avoid stack overflows, when the call stack resides in a size-limited SPM. In the evaluation, the WCETs of various benchmarks including real-world automotive applications are statically calculated for SPMs and caches in several different memory configurations.
ContributorsKim, Yooseong (Author) / Shrivastava, Aviral (Thesis advisor) / Broman, David (Committee member) / Fainekos, Georgios (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2017
155791-Thumbnail Image.png
Description
Caches pose a serious limitation in scaling many-core architectures since the demand of area and power for maintaining cache coherence increases rapidly with the number of cores. Scratch-Pad Memories (SPMs) provide a cheaper and lower power alternative that can be used to build a more scalable many-core architecture. The trade-off

Caches pose a serious limitation in scaling many-core architectures since the demand of area and power for maintaining cache coherence increases rapidly with the number of cores. Scratch-Pad Memories (SPMs) provide a cheaper and lower power alternative that can be used to build a more scalable many-core architecture. The trade-off of substituting SPMs for caches is however that the data must be explicitly managed in software. Heap management on SPM poses a major challenge due to the highly dynamic nature of of heap data access. Most existing heap management techniques implement a software caching scheme on SPM, emulating the behavior of hardware caches. The state-of-the-art heap management scheme implements a 4-way set-associative software cache on SPM for a single program running with one thread on one core. While the technique works correctly, it suffers from signifcant performance overhead. This paper presents a series of compiler-based efficient heap management approaches that reduces heap management overhead through several optimization techniques. Experimental results on benchmarks from MiBenchGuthaus et al. (2001) executed on an SMM processor modeled in gem5Binkert et al. (2011) demonstrate that our approach (implemented in llvm v3.8Lattner and Adve (2004)) can improve execution time by 80% on average compared to the previous state-of-the-art.
ContributorsLin, Jinn-Pean (Author) / Shrivastava, Aviral (Thesis advisor) / Ren, Fengbo (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)
Created2017
153968-Thumbnail Image.png
Description
The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density

The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels.

At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism.

Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRA
ContributorsHamzeh, Mahdi (Author) / Vrudhula, Sarma (Thesis advisor) / Gopalakrishnan, Kailash (Committee member) / Shrivastava, Aviral (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)
Created2015