Matching Items (3)
Description
Stream processing has emerged as an important model of computation especially in the context of multimedia and communication sub-systems of embedded System-on-Chip (SoC) architectures. The dataflow nature of streaming applications allows them to be most naturally expressed as a set of kernels iteratively operating on continuous streams of data. The kernels are computationally intensive and are mainly characterized by real-time constraints that demand high throughput and data bandwidth with limited global data reuse. Conventional architectures fail to meet these demands due to their poorly matched execution models and the overheads associated with instruction and data movements.
This work presents StreamWorks, a multi-core embedded architecture for energy-efficient stream computing. The basic processing element in the StreamWorks architecture is the StreamEngine (SE) which is responsible for iteratively executing a stream kernel. SE introduces an instruction locking mechanism that exploits the iterative nature of the kernels and enables fine-grain instruction reuse. Each instruction in a SE is locked to a Reservation Station (RS) and revitalizes itself after execution; thus never retiring from the RS. The entire kernel is hosted in RS Banks (RSBs) close to functional units for energy-efficient instruction delivery. The dataflow semantics of stream kernels are captured by a context-aware dataflow execution mode that efficiently exploits the Instruction Level Parallelism (ILP) and Data-level parallelism (DLP) within stream kernels.
Multiple SEs are grouped together to form a StreamCluster (SC) that communicate via a local interconnect. A novel software FIFO virtualization technique with split-join functionality is proposed for efficient and scalable stream communication across SEs. The proposed communication mechanism exploits the Task-level parallelism (TLP) of the stream application. The performance and scalability of the communication mechanism is evaluated against the existing data movement schemes for scratchpad based multi-core architectures. Further, overlay schemes and architectural support are proposed that allow hosting any number of kernels on the StreamWorks architecture. The proposed oevrlay schemes for code management supports kernel(context) switching for the most common use cases and can be adapted for any multi-core architecture that use software managed local memories.
The performance and energy-efficiency of the StreamWorks architecture is evaluated for stream kernel and application benchmarks by implementing the architecture in 45nm TSMC and comparison with a low power RISC core and a contemporary accelerator.
This work presents StreamWorks, a multi-core embedded architecture for energy-efficient stream computing. The basic processing element in the StreamWorks architecture is the StreamEngine (SE) which is responsible for iteratively executing a stream kernel. SE introduces an instruction locking mechanism that exploits the iterative nature of the kernels and enables fine-grain instruction reuse. Each instruction in a SE is locked to a Reservation Station (RS) and revitalizes itself after execution; thus never retiring from the RS. The entire kernel is hosted in RS Banks (RSBs) close to functional units for energy-efficient instruction delivery. The dataflow semantics of stream kernels are captured by a context-aware dataflow execution mode that efficiently exploits the Instruction Level Parallelism (ILP) and Data-level parallelism (DLP) within stream kernels.
Multiple SEs are grouped together to form a StreamCluster (SC) that communicate via a local interconnect. A novel software FIFO virtualization technique with split-join functionality is proposed for efficient and scalable stream communication across SEs. The proposed communication mechanism exploits the Task-level parallelism (TLP) of the stream application. The performance and scalability of the communication mechanism is evaluated against the existing data movement schemes for scratchpad based multi-core architectures. Further, overlay schemes and architectural support are proposed that allow hosting any number of kernels on the StreamWorks architecture. The proposed oevrlay schemes for code management supports kernel(context) switching for the most common use cases and can be adapted for any multi-core architecture that use software managed local memories.
The performance and energy-efficiency of the StreamWorks architecture is evaluated for stream kernel and application benchmarks by implementing the architecture in 45nm TSMC and comparison with a low power RISC core and a contemporary accelerator.
ContributorsPanda, Amrit (Author) / Chatha, Karam S. (Thesis advisor) / Wu, Carole-Jean (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2014
Description
Coarse Grain Reconfigurable Arrays (CGRAs) are promising accelerators capable of
achieving high performance at low power consumption. While CGRAs can efficiently
accelerate loop kernels, accelerating loops with control flow (loops with if-then-else
structures) is quite challenging. Techniques that handle control flow execution in
CGRAs generally use predication. Such techniques execute both branches of an
if-then-else structure and select outcome of either branch to commit based on the
result of the conditional. This results in poor utilization of CGRA s computational
resources. Dual-issue scheme which is the state of the art technique for control flow
fetches instructions from both paths of the branch and selects one to execute at
runtime based on the result of the conditional. This technique has an overhead in
instruction fetch bandwidth. In this thesis, to improve performance of control flow
execution in CGRAs, I propose a solution in which the result of the conditional
expression that decides the branch outcome is communicated to the instruction fetch
unit to selectively issue instructions from the path taken by the branch at run time.
Experimental results show that my solution can achieve 34.6% better performance
and 52.1% improvement in energy efficiency on an average compared to state of the
art dual issue scheme without imposing any overhead in instruction fetch bandwidth.
achieving high performance at low power consumption. While CGRAs can efficiently
accelerate loop kernels, accelerating loops with control flow (loops with if-then-else
structures) is quite challenging. Techniques that handle control flow execution in
CGRAs generally use predication. Such techniques execute both branches of an
if-then-else structure and select outcome of either branch to commit based on the
result of the conditional. This results in poor utilization of CGRA s computational
resources. Dual-issue scheme which is the state of the art technique for control flow
fetches instructions from both paths of the branch and selects one to execute at
runtime based on the result of the conditional. This technique has an overhead in
instruction fetch bandwidth. In this thesis, to improve performance of control flow
execution in CGRAs, I propose a solution in which the result of the conditional
expression that decides the branch outcome is communicated to the instruction fetch
unit to selectively issue instructions from the path taken by the branch at run time.
Experimental results show that my solution can achieve 34.6% better performance
and 52.1% improvement in energy efficiency on an average compared to state of the
art dual issue scheme without imposing any overhead in instruction fetch bandwidth.
ContributorsRajendran Radhika, Shri Hari (Author) / Shrivastava, Aviral (Thesis advisor) / Christen, Jennifer Blain (Committee member) / Cao, Yu (Committee member) / Arizona State University (Publisher)
Created2014
Description
Modern computer processors contain an embedded firmware known as microcode that controls decode and execution of x86 instructions. Although proprietary and relatively obscure, this microcode can be modified using updates released by hardware manufacturers to correct processor logic flaws (errata). At the same time, a malicious microcode update could compromise a processor by implementing new malicious instructions or altering the functionality of existing instructions, including processor-accelerated virtualization or cryptographic primitives. Not only is this attack vector capable of subverting all software-enforced security policies and access controls, but it also leaves behind no postmortem forensic evidence since the write-only patch memory is cleared upon system reset. Although supervisor privileges (ring zero) are required to update processor microcode, this attack cannot be easily mitigated due to the implementation of microcode update functionality within processor silicon. In this paper, we reveal the microarchitecture and mechanism of microcode updates, present a security analysis of this attack vector, and provide some mitigation suggestions.
ContributorsChen, Daming Dominic (Author) / Ahn, Gail-Joon (Thesis director) / Lee, Joohyung (Committee member) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor) / School of International Letters and Cultures (Contributor) / School of Mathematical and Statistical Sciences (Contributor)
Created2014-05