Search Content

Towards energy efficient computing with Linux: enabling task level power awareness and support for energy efficient accelerator

Description

With increasing transistor volume and reducing feature size, it has become a major design constraint to reduce power consumption also. This has given rise to aggressive architectural changes for on-chip power management and rapid development to energy efficient hardware accelerators. Accordingly, the objective of this research work is to facilitate…

With increasing transistor volume and reducing feature size, it has become a major design constraint to reduce power consumption also. This has given rise to aggressive architectural changes for on-chip power management and rapid development to energy efficient hardware accelerators. Accordingly, the objective of this research work is to facilitate software developers to leverage these hardware techniques and improve energy efficiency of the system. To achieve this, I propose two solutions for Linux kernel: Optimal use of these architectural enhancements to achieve greater energy efficiency requires accurate modeling of processor power consumption. Though there are many models available in literature to model processor power consumption, there is a lack of such models to capture power consumption at the task-level. Task-level energy models are a requirement for an operating system (OS) to perform real-time power management as OS time multiplexes tasks to enable sharing of hardware resources. I propose a detailed design methodology for constructing an architecture agnostic task-level power model and incorporating it into a modern operating system to build an online task-level power profiler. The profiler is implemented inside the latest Linux kernel and validated for Intel Sandy Bridge processor. It has a negligible overhead of less than 1\% hardware resource consumption. The profiler power prediction was demonstrated for various application benchmarks from SPEC to PARSEC with less than 4\% error. I also demonstrate the importance of the proposed profiler for emerging architectural techniques through use case scenarios, which include heterogeneous computing and fine grained per-core DVFS. Along with architectural enhancement in general purpose processors to improve energy efficiency, hardware accelerators like Coarse Grain reconfigurable architecture (CGRA) are gaining popularity. Unlike vector processors, which rely on data parallelism, CGRA can provide greater flexibility and compiler level control making it more suitable for present SoC environment. To provide streamline development environment for CGRA, I propose a flexible framework in Linux to do design space exploration for CGRA. With accurate and flexible hardware models, fine grained integration with accurate architectural simulator, and Linux memory management and DMA support, a user can carry out limitless experiments on CGRA in full system environment.

ContributorsDesai, Digant Pareshkumar (Author) / Vrudhula, Sarma (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)

Created2013

System-level synthesis of dataplane subsystems for MPSoCs

Description

In recent years we have witnessed a shift towards multi-processor system-on-chips (MPSoCs) to address the demands of embedded devices (such as cell phones, GPS devices, luxury car features, etc.). Highly optimized MPSoCs are well-suited to tackle the complex application demands desired by the end user customer. These MPSoCs incorporate a…

In recent years we have witnessed a shift towards multi-processor system-on-chips (MPSoCs) to address the demands of embedded devices (such as cell phones, GPS devices, luxury car features, etc.). Highly optimized MPSoCs are well-suited to tackle the complex application demands desired by the end user customer. These MPSoCs incorporate a constellation of heterogeneous processing elements (PEs) (general purpose PEs and application-specific integrated circuits (ASICS)). A typical MPSoC will be composed of a application processor, such as an ARM Coretex-A9 with cache coherent memory hierarchy, and several application sub-systems. Each of these sub-systems are composed of highly optimized instruction processors, graphics/DSP processors, and custom hardware accelerators. Typically, these sub-systems utilize scratchpad memories (SPM) rather than support cache coherency. The overall architecture is an integration of the various sub-systems through a high bandwidth system-level interconnect (such as a Network-on-Chip (NoC)). The shift to MPSoCs has been fueled by three major factors: demand for high performance, the use of component libraries, and short design turn around time. As customers continue to desire more and more complex applications on their embedded devices the performance demand for these devices continues to increase. Designers have turned to using MPSoCs to address this demand. By using pre-made IP libraries designers can quickly piece together a MPSoC that will meet the application demands of the end user with minimal time spent designing new hardware. Additionally, the use of MPSoCs allows designers to generate new devices very quickly and thus reducing the time to market. In this work, a complete MPSoC synthesis design flow is presented. We first present a technique \cite{leary1_intro} to address the synthesis of the interconnect architecture (particularly Network-on-Chip (NoC)). We then address the synthesis of the memory architecture of a MPSoC sub-system \cite{leary2_intro}. Lastly, we present a co-synthesis technique to generate the functional and memory architectures simultaneously. The validity and quality of each synthesis technique is demonstrated through extensive experimentation.

ContributorsLeary, Glenn (Author) / Chatha, Karamvir S (Thesis advisor) / Vrudhula, Sarma (Committee member) / Shrivastava, Aviral (Committee member) / Beraha, Rudy (Committee member) / Arizona State University (Publisher)

Created2013

Unified framework for energy-proportional computing in multicore processors: novel algorithms and practical implementation

Description

Multicore processors have proliferated in nearly all forms of computing, from servers, desktop, to smartphones. The primary reason for this large adoption of multicore processors is due to its ability to overcome the power-wall by providing higher performance at a lower power consumption rate. With multi-cores, there is increased need…

Multicore processors have proliferated in nearly all forms of computing, from servers, desktop, to smartphones. The primary reason for this large adoption of multicore processors is due to its ability to overcome the power-wall by providing higher performance at a lower power consumption rate. With multi-cores, there is increased need for dynamic energy management (DEM), much more than for single-core processors, as DEM for multi-cores is no more a mechanism just to ensure that a processor is kept under specified temperature limits, but also a set of techniques that manage various processor controls like dynamic voltage and frequency scaling (DVFS), task migration, fan speed, etc. to achieve a stated objective. The objectives span a wide range from maximizing throughput, minimizing power consumption, reducing peak temperature, maximizing energy efficiency, maximizing processor reliability, and so on, along with much more wider constraints of temperature, power, timing, and reliability constraints. Thus DEM can be very complex and challenging to achieve. Since often times many DEMs operate together on a single processor, there is a need to unify various DEM techniques. This dissertation address such a need. In this work, a framework for DEM is proposed that provides a unifying processor model that includes processor power, thermal, timing, and reliability models, supports various DEM control mechanisms, many different objective functions along with equally diverse constraint specifications. Using the framework, a range of novel solutions is derived for instances of DEM problems, that include maximizing processor performance, energy efficiency, or minimizing power consumption, peak temperature under constraints of maximum temperature, memory reliability and task deadlines. Finally, a robust closed-loop controller to implement the above solutions on a real processor platform with a very low operational overhead is proposed. Along with the controller design, a model identification methodology for obtaining the required power and thermal models for the controller is also discussed. The controller is architecture independent and hence easily portable across many platforms. The controller has been successfully deployed on Intel Sandy Bridge processor and the use of the controller has increased the energy efficiency of the processor by over 30%

ContributorsHanumaiah, Vinay (Author) / Vrudhula, Sarma (Thesis advisor) / Chatha, Karamvir (Committee member) / Chakrabarti, Chaitali (Committee member) / Rodriguez, Armando (Committee member) / Askin, Ronald (Committee member) / Arizona State University (Publisher)

Created2013

Quantitative evaluation of control flow based soft error protection mechanisms

Description

Rapid technology scaling, the main driver of the power and performance improvements of computing solutions, has also rendered our computing systems extremely susceptible to transient errors called soft errors. Among the arsenal of techniques to protect computation from soft errors, Control Flow Checking (CFC) based techniques have gained a reputation…

Rapid technology scaling, the main driver of the power and performance improvements of computing solutions, has also rendered our computing systems extremely susceptible to transient errors called soft errors. Among the arsenal of techniques to protect computation from soft errors, Control Flow Checking (CFC) based techniques have gained a reputation of effective, yet low-cost protection mechanism. The basic idea is that, there is a high probability that a soft-fault in program execution will eventually alter the control flow of the program. Therefore just by making sure that the control flow of the program is correct, significant protection can be achieved. More than a dozen techniques for CFC have been developed over the last several decades, ranging from hardware techniques, software techniques, and hardware-software hybrid techniques as well. Our analysis shows that existing CFC techniques are not only ineffective in protecting from soft errors, but cause additional power and performance overheads. For this analysis, we develop and validate a simulation based experimental setup to accurately and quantitatively estimate the architectural vulnerability of a program execution on a processor micro-architecture. We model the protection achieved by various state-of-the-art CFC techniques in this quantitative vulnerability estimation setup, and find out that software only CFC protection schemes (CFCSS, CFCSS+NA, CEDA) increase system vulnerability by 18% to 21% with 17% to 38% performance overhead. Hybrid CFC protection (CFEDC) increases vulnerability by 5%, while the vulnerability remains almost the same for hardware only CFC protection (CFCET); notwithstanding the hardware overheads of design cost, area, and power incurred in the hardware modifications required for their implementations.

ContributorsRhisheekesan, Abhishek (Author) / Shrivastava, Aviral (Thesis advisor) / Colbourn, Charles Joseph (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)

Created2013

Compiler and runtime for memory management on software managed manycore processors

Description

We are expecting hundreds of cores per chip in the near future. However, scaling the memory architecture in manycore architectures becomes a major challenge. Cache coherence provides a single image of memory at any time in execution to all the cores, yet coherent cache architectures are believed will not scale…

We are expecting hundreds of cores per chip in the near future. However, scaling the memory architecture in manycore architectures becomes a major challenge. Cache coherence provides a single image of memory at any time in execution to all the cores, yet coherent cache architectures are believed will not scale to hundreds and thousands of cores. In addition, caches and coherence logic already take 20-50% of the total power consumption of the processor and 30-60% of die area. Therefore, a more scalable architecture is needed for manycore architectures. Software Managed Manycore (SMM) architectures emerge as a solution. They have scalable memory design in which each core has direct access to only its local scratchpad memory, and any data transfers to/from other memories must be done explicitly in the application using Direct Memory Access (DMA) commands. Lack of automatic memory management in the hardware makes such architectures extremely power-efficient, but they also become difficult to program. If the code/data of the task mapped onto a core cannot fit in the local scratchpad memory, then DMA calls must be added to bring in the code/data before it is required, and it may need to be evicted after its use. However, doing this adds a lot of complexity to the programmer's job. Now programmers must worry about data management, on top of worrying about the functional correctness of the program - which is already quite complex. This dissertation presents a comprehensive compiler and runtime integration to automatically manage the code and data of each task in the limited local memory of the core. We firstly developed a Complete Circular Stack Management. It manages stack frames between the local memory and the main memory, and addresses the stack pointer problem as well. Though it works, we found we could further optimize the management for most cases. Thus a Smart Stack Data Management (SSDM) is provided. In this work, we formulate the stack data management problem and propose a greedy algorithm for the same. Later on, we propose a general cost estimation algorithm, based on which CMSM heuristic for code mapping problem is developed. Finally, heap data is dynamic in nature and therefore it is hard to manage it. We provide two schemes to manage unlimited amount of heap data in constant sized region in the local memory. In addition to those separate schemes for different kinds of data, we also provide a memory partition methodology.

ContributorsBai, Ke (Author) / Shrivastava, Aviral (Thesis advisor) / Chatha, Karamvir (Committee member) / Xue, Guoliang (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)

Created2014

Testing of threshold logic latch based hybrid circuits

Description

The advent of threshold logic simplifies the traditional Boolean logic to the single level multi-input function. Threshold logic latch (TLL), among implementations of threshold logic, is functionally equivalent to a multi-input function with an edge triggered flip-flop, which stands out to improve area and both dynamic and leakage power consumption,…

The advent of threshold logic simplifies the traditional Boolean logic to the single level multi-input function. Threshold logic latch (TLL), among implementations of threshold logic, is functionally equivalent to a multi-input function with an edge triggered flip-flop, which stands out to improve area and both dynamic and leakage power consumption, providing an appropriate design alternative. Accordingly, the TLL standard cell library is designed. Through technology mapping, hybrid circuit is generated by absorbing the logic cone backward from each flip-flip to get the smallest remaining feeder. With the scan test methodology adopted, design for testability (DFT) is proposed, including scan element design and scan chain insertion. Test synthesis flow is then introduced, according to the Cadence tool, RTL compiler. Test application is the process of applying vectors and the response analysis, which is mainly about the testbench design. A parameterized generic self-checking Verilog testbench is designed for static fault detection. Test development refers to the fault modeling, and test generation. Firstly, functional truth table test generation on TLL cells is proposed. Before the truth table test of the threshold function, the dependence of sequence of vectors applied, i.e., the dependence of current state on the previous state, should be eliminated. Transition test (dynamic pattern) on all weak inputs is proved to be able to test the reset function, which is supposed to erase the history in the reset phase before every evaluation phase. Remaining vectors in the truth table except the weak inputs are then applied statically (static pattern). Secondly, dynamic patterns for all weak inputs are proposed to detect structural transistor level faults analyzed in the TLL cell, with single fault assumption and stuck-at faults, stuck-on faults, and stuck-open faults under consideration. Containing those patterns, the functional test covers all testable structural faults inside the TLL. Thirdly, with the scope of the whole hybrid netlist, the procedure of test generation is proposed with three steps: scan chain test; test of feeders and other scan elements except TLLs; functional pattern test of TLL cells. Implementation of this procedure is discussed in the automatic test pattern generation (ATPG) chapter.

ContributorsHu, Yang (Author) / Vrudhula, Sarma (Thesis advisor) / Barnaby, Hugh (Committee member) / Yu, Shimeng (Committee member) / Arizona State University (Publisher)

Created2013

Dynamic scheduling of stream programs on embedded multi-core processors

Description

Stream computing has emerged as an importantmodel of computation for embedded system applications particularly in the multimedia and network processing domains. In recent past several programming languages and embedded multi-core processors have been proposed for streaming applications. This thesis examines the execution and dynamic scheduling of stream programs on embedded…

Stream computing has emerged as an importantmodel of computation for embedded system applications particularly in the multimedia and network processing domains. In recent past several programming languages and embedded multi-core processors have been proposed for streaming applications. This thesis examines the execution and dynamic scheduling of stream programs on embedded multi-core processors. The thesis addresses the problem in the context of a multi-tasking environment with a time varying allocation of processing elements for a particular streaming application. As a solution the thesis proposes a two step approach where the stream program is compiled to gather key application information, and to generate re-targetable code. A light weight dynamic scheduler incorporates the second stage of the approach. The dynamic scheduler utilizes the static information and available resources to assign or partition the application across the multi-core architecture. The objective of the dynamic scheduler is to maximize the throughput of the application, and it is sensitive to the resource (processing elements, scratch-pad memory, DMA bandwidth) constraints imposed by the target architecture. We evaluate the proposed approach by compiling and scheduling benchmark stream programs on a representative embedded multi-core processor. We present experimental results that evaluate the quality of the solutions generated by the proposed approach by comparisons with existing techniques.

ContributorsLee, Haeseung (Author) / Chatha, Karamvir (Thesis advisor) / Vrudhula, Sarma (Committee member) / Chakrabarti, Chaitali (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)

Created2013

Routing and scheduling of electric and alternative-fuel vehicles

Description

Vehicles powered by electricity and alternative-fuels are becoming a more popular form of transportation since they have less of an environmental impact than standard gasoline vehicles. Unfortunately, their success is currently inhibited by the sparseness of locations where the vehicles can refuel as well as the fact that many of…

Vehicles powered by electricity and alternative-fuels are becoming a more popular form of transportation since they have less of an environmental impact than standard gasoline vehicles. Unfortunately, their success is currently inhibited by the sparseness of locations where the vehicles can refuel as well as the fact that many of the vehicles have a range that is less than those powered by gasoline. These factors together create a "range anxiety" in drivers, which causes the drivers to worry about the utility of alternative-fuel and electric vehicles and makes them less likely to purchase these vehicles. For the new vehicle technologies to thrive it is critical that range anxiety is minimized and performance is increased as much as possible through proper routing and scheduling. In the case of long distance trips taken by individual vehicles, the routes must be chosen such that the vehicles take the shortest routes while not running out of fuel on the trip. When many vehicles are to be routed during the day, if the refueling stations have limited capacity then care must be taken to avoid having too many vehicles arrive at the stations at any time. If the vehicles that will need to be routed in the future are unknown then this problem is stochastic. For fleets of vehicles serving scheduled operations, switching to alternative-fuels requires ensuring the schedules do not cause the vehicles to run out of fuel. This is especially problematic since the locations where the vehicles may refuel are limited due to the technology being new. This dissertation covers three related optimization problems: routing a single electric or alternative-fuel vehicle on a long distance trip, routing many electric vehicles in a network where the stations have limited capacity and the arrivals into the system are stochastic, and scheduling fleets of electric or alternative-fuel vehicles with limited locations to refuel. Different algorithms are proposed to solve each of the three problems, of which some are exact and some are heuristic. The algorithms are tested on both random data and data relating to the State of Arizona.

ContributorsAdler, Jonathan D (Author) / Mirchandani, Pitu B. (Thesis advisor) / Askin, Ronald (Committee member) / Gel, Esma (Committee member) / Xue, Guoliang (Committee member) / Zhang, Muhong (Committee member) / Arizona State University (Publisher)

Created2014

Design, analytics and quality assurance for emerging personalized clinical diagnostics based on next-gen sequencing

Description

Major advancements in biology and medicine have been realized during recent decades, including massively parallel sequencing, which allows researchers to collect millions or billions of short reads from a DNA or RNA sample. This capability opens the door to a renaissance in personalized medicine if effectively deployed. Three projects that…

Major advancements in biology and medicine have been realized during recent decades, including massively parallel sequencing, which allows researchers to collect millions or billions of short reads from a DNA or RNA sample. This capability opens the door to a renaissance in personalized medicine if effectively deployed. Three projects that address major and necessary advancements in massively parallel sequencing are included in this dissertation. The first study involves a pair of algorithms to verify patient identity based on single nucleotide polymorphisms (SNPs). In brief, we developed a method that allows de novo construction of sample relationships, e.g., which ones are from the same individuals and which are from different individuals. We also developed a method to confirm the hypothesis that a tumor came from a known individual. The second study derives an algorithm to multiplex multiple Polymerase Chain Reaction (PCR) reactions, while minimizing interference between reactions that compromise results. PCR is a powerful technique that amplifies pre-determined regions of DNA and is often used to selectively amplify DNA and RNA targets that are destined for sequencing. It is highly desirable to multiplex reactions to save on reagent and assay setup costs as well as equalize the effect of minor handling issues across gene targets. Our solution involves a binary integer program that minimizes events that are likely to cause interference between PCR reactions. The third study involves design and analysis methods required to analyze gene expression and copy number results against a reference range in a clinical setting for guiding patient treatments. Our goal is to determine which events are present in a given tumor specimen. These events may be mutation, DNA copy number or RNA expression. All three techniques are being used in major research and diagnostic projects for their intended purpose at the time of writing this manuscript. The SNP matching solution has been selected by The Cancer Genome Atlas to determine sample identity. Paradigm Diagnostics, Viomics and International Genomics Consortium utilize the PCR multiplexing technique to multiplex various types of PCR reactions on multi-million dollar projects. The reference range-based normalization method is used by Paradigm Diagnostics to analyze results from every patient.

ContributorsMorris, Scott (Author) / Gel, Esma S (Thesis advisor) / Runger, George C. (Thesis advisor) / Askin, Ronald (Committee member) / Paulauskis, Joseph (Committee member) / Arizona State University (Publisher)

Created2014

Dynamic programming algorithm for computing temporal logic robustness

Description

In this thesis we deal with the problem of temporal logic robustness estimation. We present a dynamic programming algorithm for the robust estimation problem of Metric Temporal Logic (MTL) formulas regarding a finite trace of time stated sequence. This algorithm not only tests if the MTL specification is satisfied by…

In this thesis we deal with the problem of temporal logic robustness estimation. We present a dynamic programming algorithm for the robust estimation problem of Metric Temporal Logic (MTL) formulas regarding a finite trace of time stated sequence. This algorithm not only tests if the MTL specification is satisfied by the given input which is a finite system trajectory, but also quantifies to what extend does the sequence satisfies or violates the MTL specification. The implementation of the algorithm is the DP-TALIRO toolbox for MATLAB. Currently it is used as the temporal logic robust computing engine of S-TALIRO which is a tool for MATLAB searching for trajectories of minimal robustness in Simulink/ Stateflow. DP-TALIRO is expected to have near linear running time and constant memory requirement depending on the structure of the MTL formula. DP-TALIRO toolbox also integrates new features not supported in its ancestor FW-TALIRO such as parameter replacement, most related iteration and most related predicate. A derivative of DP-TALIRO which is DP-T-TALIRO is also addressed in this thesis which applies dynamic programming algorithm for time robustness computation. We test the running time of DP-TALIRO and compare it with FW-TALIRO. Finally, we present an application where DP-TALIRO is used as the robustness computation core of S-TALIRO for a parameter estimation problem.

ContributorsYang, Hengyi (Author) / Fainekos, Georgios (Thesis advisor) / Sarjoughian, Hessam S. (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)

Created2013