Matching Items (107)
Filtering by
- All Subjects: Computer Engineering
- Genre: Masters Thesis
- Resource Type: Text
Description
Historically, wireless communication devices have been developed to process one specific waveform. In contrast, a modern cellular phone supports multiple waveforms corresponding to LTE, WCDMA(3G) and 2G standards. The selection of the network is controlled by software running on a general purpose processor, not by the user. Now, instead of selecting from a set of complete radios as in software controlled radio, what if the software could select the building blocks based on the user needs. This is the new software-defined flexible radio which would enable users to construct wireless systems that fit their needs, rather than forcing to use from a small set of pre-existing protocols.
To develop and implement flexible protocols, a flexible hardware very similar to a Software Defined Radio (SDR) is required. In this thesis, the Intel T2200 board is chosen as the SDR platform. It is a heterogeneous platform with ARM, CEVA DSP and several accelerators. A wide range of protocols is mapped onto this platform and their performance evaluated. These include two OFDM based protocols (WiFi-Lite-A, WiFi-Lite-B), one DFT-spread OFDM based protocol (SCFDM-Lite) and one single carrier based protocol (SC-Lite). The transmitter and receiver blocks of the different protocols are first mapped on ARM in the T2200 board. The timing results show that IFFT, FFT, and Viterbi decoder blocks take most of the transmitter and receiver execution time and so in the next step these are mapped onto CEVA DSP. Mapping onto CEVA DSP resulted in significant execution time savings. The savings for WiFi-Lite-A were 60%, for WiFi-Lite-B were 64%, and for SCFDM-Lite were 71.5%. No savings are reported for SC-Lite since it was not mapped onto CEVA DSP.
Significant reduction in execution time is achieved for WiFi-Lite-A and WiFi-Lite-B protocols by implementing the entire transmitter and receiver chains on CEVA DSP. For instance, for WiFi-Lite-A, the savings were as large as 90%. Such huge savings are because the entire transmitter or receiver chain are implemented on CEVA and the timing overhead due to ARM-CEVA communication is completely eliminated. Finally, over-the-air testing was done for WiFi-Lite-A and WiFi-Lite-B protocols. Data was sent over the air using one Intel T2200 WBS board and received using another Intel T2200 WBS board. The received frames were decoded with no errors, thereby validating the over-the-air-communications.
To develop and implement flexible protocols, a flexible hardware very similar to a Software Defined Radio (SDR) is required. In this thesis, the Intel T2200 board is chosen as the SDR platform. It is a heterogeneous platform with ARM, CEVA DSP and several accelerators. A wide range of protocols is mapped onto this platform and their performance evaluated. These include two OFDM based protocols (WiFi-Lite-A, WiFi-Lite-B), one DFT-spread OFDM based protocol (SCFDM-Lite) and one single carrier based protocol (SC-Lite). The transmitter and receiver blocks of the different protocols are first mapped on ARM in the T2200 board. The timing results show that IFFT, FFT, and Viterbi decoder blocks take most of the transmitter and receiver execution time and so in the next step these are mapped onto CEVA DSP. Mapping onto CEVA DSP resulted in significant execution time savings. The savings for WiFi-Lite-A were 60%, for WiFi-Lite-B were 64%, and for SCFDM-Lite were 71.5%. No savings are reported for SC-Lite since it was not mapped onto CEVA DSP.
Significant reduction in execution time is achieved for WiFi-Lite-A and WiFi-Lite-B protocols by implementing the entire transmitter and receiver chains on CEVA DSP. For instance, for WiFi-Lite-A, the savings were as large as 90%. Such huge savings are because the entire transmitter or receiver chain are implemented on CEVA and the timing overhead due to ARM-CEVA communication is completely eliminated. Finally, over-the-air testing was done for WiFi-Lite-A and WiFi-Lite-B protocols. Data was sent over the air using one Intel T2200 WBS board and received using another Intel T2200 WBS board. The received frames were decoded with no errors, thereby validating the over-the-air-communications.
ContributorsChagari, Vamsi Reddy (Author) / Chakrabarti, Chaitali (Thesis advisor) / Lee, Hyunseok (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)
Created2016
Description
The last decade has witnessed a paradigm shift in computing platforms, from laptops and servers to mobile devices like smartphones and tablets. These devices host an immense variety of applications many of which are computationally expensive and thus are power hungry. As most of these mobile platforms are powered by batteries, energy efficiency has become one of the most critical aspects of such devices. Thus, the energy cost of the fundamental arithmetic operations executed in these applications has to be reduced. As voltage scaling has effectively ended, the energy efficiency of integrated circuits has ceased to improve within successive generations of transistors. This resulted in widespread use of Application Specific Integrated Circuits (ASIC), which provide incredible energy efficiency. However, these are not flexible and have high non-recurring engineering (NRE) cost. Alternatively, Field Programmable Gate Arrays (FPGA) offer flexibility to implement any application, but at the cost of higher area and energy compared to ASIC.
In this work, a spatially programmable architecture customized for image processing applications is proposed. The intent is to bridge the efficiency gap between ASICs and FPGAs, by offering FPGA-like flexibility and ASIC-like energy efficiency. This architecture minimizes the energy overheads in FPGAs, which result from the use of fine-grained programming style and global interconnect. It is flexible compared to an ASIC and can accommodate multiple applications.
The main contribution of the thesis is the feasibility analysis of the data path of this architecture, customized for image processing applications. The data path is implemented at the register transfer level (RTL), and the synthesis results are obtained in 45nm technology cell library from a leading foundry. The results of image-processing applications demonstrate that this architecture is within a factor of 10x of the energy and area efficiency of ASIC implementations.
In this work, a spatially programmable architecture customized for image processing applications is proposed. The intent is to bridge the efficiency gap between ASICs and FPGAs, by offering FPGA-like flexibility and ASIC-like energy efficiency. This architecture minimizes the energy overheads in FPGAs, which result from the use of fine-grained programming style and global interconnect. It is flexible compared to an ASIC and can accommodate multiple applications.
The main contribution of the thesis is the feasibility analysis of the data path of this architecture, customized for image processing applications. The data path is implemented at the register transfer level (RTL), and the synthesis results are obtained in 45nm technology cell library from a leading foundry. The results of image-processing applications demonstrate that this architecture is within a factor of 10x of the energy and area efficiency of ASIC implementations.
ContributorsSatapathy, Saktiswarup (Author) / Brunhaver, John (Thesis advisor) / Clark, Lawrence T (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)
Created2016
Description
Social media has become popular in the past decade. Facebook for example has 1.59 billion active users monthly. With such massive social networks generating lot of data, everyone is constantly looking for ways of leveraging the knowledge from social networks to make their systems more personalized to their end users. And with rapid increase in the usage of mobile phones and wearables, social media data is being tied to spatial networks. This research document proposes an efficient technique that answers socially k-Nearest Neighbors with Spatial Range Filter. The proposed approach performs a joint search on both the social and spatial domains which radically improves the performance compared to straight forward solutions. The research document proposes a novel index that combines social and spatial indexes. In other words, graph data is stored in an organized manner to filter it based on spatial (region of interest) and social constraints (top-k closest vertices) at query time. That leads to pruning necessary paths during the social graph traversal procedure, and only returns the top-K social close venues. The research document then experimentally proves how the proposed approach outperforms existing baseline approaches by at least three times and also compare how each of our algorithms perform under various conditions on a real geo-social dataset extracted from Yelp.
ContributorsPasumarthy, Nitin (Author) / Sarwat, Mohamed (Thesis advisor) / Papotti, Paolo (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)
Created2016
Description
The availability of a wide range of general purpose as well as accelerator cores on
modern smartphones means that a significant number of applications can be executed
on a smartphone simultaneously, resulting in an ever increasing demand on the memory
subsystem. While the increased computation capability is intended for improving
user experience, memory requests from each concurrent application exhibit unique
memory access patterns as well as specific timing constraints. If not considered, this
could lead to significant memory contention and result in lowered user experience.
This work first analyzes the impact of memory degradation caused by the interference
at the memory system for a broad range of commonly-used smartphone applications.
The real system characterization results show that smartphone applications,
such as web browsing and media playback, suffer significant performance degradation.
This is caused by shared resource contention at the application processor’s last-level
cache, the communication fabric, and the main memory.
Based on the detailed characterization results, rest of this thesis focuses on the
design of an effective memory interference mitigation technique. Since web browsing,
being one of the most commonly-used smartphone applications and represents many
html-based smartphone applications, my thesis focuses on meeting the performance
requirement of a web browser on a smartphone in the presence of background processes
and co-scheduled applications. My thesis proposes a light-weight user space frequency
governor to mitigate the degradation caused by interfering applications, by predicting
the performance and power consumption of web browsing. The governor selects an
optimal energy-efficient frequency setting periodically by using the statically-trained
performance and power models with dynamically-varying architecture and system
conditions, such as the memory access intensity of background processes and/or coscheduled applications, and temperature of cores. The governor has been extensively evaluated on a Nexus 5 smartphone over a diverse range of mobile workloads. By
operating at the most energy-efficient frequency setting in the presence of interference,
energy efficiency is improved by as much as 35% and with an average of 18% compared
to the existing interactive governor, while maintaining the satisfactory performance
of web page loading under 3 seconds.
modern smartphones means that a significant number of applications can be executed
on a smartphone simultaneously, resulting in an ever increasing demand on the memory
subsystem. While the increased computation capability is intended for improving
user experience, memory requests from each concurrent application exhibit unique
memory access patterns as well as specific timing constraints. If not considered, this
could lead to significant memory contention and result in lowered user experience.
This work first analyzes the impact of memory degradation caused by the interference
at the memory system for a broad range of commonly-used smartphone applications.
The real system characterization results show that smartphone applications,
such as web browsing and media playback, suffer significant performance degradation.
This is caused by shared resource contention at the application processor’s last-level
cache, the communication fabric, and the main memory.
Based on the detailed characterization results, rest of this thesis focuses on the
design of an effective memory interference mitigation technique. Since web browsing,
being one of the most commonly-used smartphone applications and represents many
html-based smartphone applications, my thesis focuses on meeting the performance
requirement of a web browser on a smartphone in the presence of background processes
and co-scheduled applications. My thesis proposes a light-weight user space frequency
governor to mitigate the degradation caused by interfering applications, by predicting
the performance and power consumption of web browsing. The governor selects an
optimal energy-efficient frequency setting periodically by using the statically-trained
performance and power models with dynamically-varying architecture and system
conditions, such as the memory access intensity of background processes and/or coscheduled applications, and temperature of cores. The governor has been extensively evaluated on a Nexus 5 smartphone over a diverse range of mobile workloads. By
operating at the most energy-efficient frequency setting in the presence of interference,
energy efficiency is improved by as much as 35% and with an average of 18% compared
to the existing interactive governor, while maintaining the satisfactory performance
of web page loading under 3 seconds.
ContributorsShingari, Davesh (Author) / Wu, Carole-Jean (Thesis advisor) / Vrudhula, Sarma (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)
Created2016
Description
Soft errors are considered as a key reliability challenge for sub-nano scale transistors. An ideal solution for such a challenge should ultimately eliminate the effect of soft errors from the microprocessor. While forward recovery techniques achieve fast recovery from errors by simply voting out the wrong values, they incur the overhead of three copies execution. Backward recovery techniques only need two copies of execution, but suffer from check-pointing overhead.
In this work I explored the efficiency of integrating check-pointing into the application and the effectiveness of recovery that can be performed upon it. After evaluating the available fine-grained approaches to perform recovery, I am introducing InCheck, an in-application recovery scheme that can be integrated into instruction-duplication based techniques, thus providing a fast error recovery. The proposed technique makes light-weight checkpoints at the basic-block granularity, and uses them for recovery purposes.
To evaluate the effectiveness of the proposed technique, 10,000 fault injection experiments were performed on different hardware components of a modern ARM in-order simulated processor. InCheck was able to recover from all detected errors by replaying about 20 instructions, however, the state of the art recovery scheme failed more than 200 times.
In this work I explored the efficiency of integrating check-pointing into the application and the effectiveness of recovery that can be performed upon it. After evaluating the available fine-grained approaches to perform recovery, I am introducing InCheck, an in-application recovery scheme that can be integrated into instruction-duplication based techniques, thus providing a fast error recovery. The proposed technique makes light-weight checkpoints at the basic-block granularity, and uses them for recovery purposes.
To evaluate the effectiveness of the proposed technique, 10,000 fault injection experiments were performed on different hardware components of a modern ARM in-order simulated processor. InCheck was able to recover from all detected errors by replaying about 20 instructions, however, the state of the art recovery scheme failed more than 200 times.
ContributorsLokam, Sai Ram Dheeraj (Author) / Shrivastava, Aviral (Thesis advisor) / Clark, Lawrence T (Committee member) / Mubayi, Anuj (Committee member) / Arizona State University (Publisher)
Created2016
Description
Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable
of accelerating even non-parallel loops and loops with low trip-counts. One challenge
in compiling for CGRAs is to manage both recurring and nonrecurring variables in
the register file (RF) of the CGRA. Although prior works have managed recurring
variables via rotating RF, they access the nonrecurring variables through either a
global RF or from a constant memory. The former does not scale well, and the latter
degrades the mapping quality. This work proposes a hardware-software codesign
approach in order to manage all the variables in a local nonrotating RF. Hardware
provides modulo addition based indexing mechanism to enable correct addressing
of recurring variables in a nonrotating RF. The compiler determines the number of
registers required for each recurring variable and configures the boundary between the
registers used for recurring and nonrecurring variables. The compiler also pre-loads
the read-only variables and constants into the local registers in the prologue of the
schedule. Synthesis and place-and-route results of the previous and the proposed RF
design show that proposed solution achieves 17% better cycle time. Experiments of
mapping several important and performance-critical loops collected from MiBench
show proposed approach improves performance (through better mapping) by 18%,
compared to using constant memory.
of accelerating even non-parallel loops and loops with low trip-counts. One challenge
in compiling for CGRAs is to manage both recurring and nonrecurring variables in
the register file (RF) of the CGRA. Although prior works have managed recurring
variables via rotating RF, they access the nonrecurring variables through either a
global RF or from a constant memory. The former does not scale well, and the latter
degrades the mapping quality. This work proposes a hardware-software codesign
approach in order to manage all the variables in a local nonrotating RF. Hardware
provides modulo addition based indexing mechanism to enable correct addressing
of recurring variables in a nonrotating RF. The compiler determines the number of
registers required for each recurring variable and configures the boundary between the
registers used for recurring and nonrecurring variables. The compiler also pre-loads
the read-only variables and constants into the local registers in the prologue of the
schedule. Synthesis and place-and-route results of the previous and the proposed RF
design show that proposed solution achieves 17% better cycle time. Experiments of
mapping several important and performance-critical loops collected from MiBench
show proposed approach improves performance (through better mapping) by 18%,
compared to using constant memory.
ContributorsDave, Shail (Author) / Shrivastava, Aviral (Thesis advisor) / Ren, Fengbo (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)
Created2016
Description
Integrated circuits must be energy efficient. This efficiency affects all aspects of chip design, from the battery life of embedded devices to thermal heating on high performance servers. As technology scaling slows, future generations of transistors will lack the energy efficiency gains as it has had in previous generations. Therefore, other sources of energy efficiency will be much more important. Many computations have the potential to be executed for extreme energy efficiency but are not instigated because the platforms they run on are not optimized for efficient execution. ASICs improve energy efficiency by reducing flexibility and leveraging the properties of a specific computation. However, ASICs are fixed in function and therefore have incredible opportunity cost. FPGAs offer a reconfigurable solution but are 25x less energy efficient than ASIC implementation. Spatially programmable architectures (SPAs) are similar in design and structure to ASICs and FPGAs but are able bridge the ASIC-FPGA energy efficiency gap by trading flexibility for efficiency. However, SPAs are difficult to program because they do not share the same programming model as normal architectures that execute in time. This work addresses compiler challenges for coarse grained, locally interconnected SPA for domain efficiency (SPADE). A novel SPADE topology, called the wave pipeline, is introduced that is designed for the image signal processing domain that is both efficient and simple to compile to. A compiler for the wave pipeline is created that solves for maximum energy and area efficiency using low complexity, greedy methods. The wave pipeline topology and compiler allow for us to investigate and experiment with image signal processing applications to prove the feasibility of SPADE compilers.
ContributorsMackay, Curtis (Author) / Brunhaver, John (Thesis advisor) / Karam, Lina J (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)
Created2016
Description
The technological advances in the past few decades have made possible creation and consumption of digital visual content at an explosive rate. Consequently, there is a need for efficient quality monitoring systems to ensure minimal degradation of images and videos during various processing operations like compression, transmission, storage etc. Objective Image Quality Assessment (IQA) algorithms have been developed that predict quality scores which match well with human subjective quality assessment. However, a lot of research still remains to be done before IQA algorithms can be deployed in real world systems. Long runtimes for one frame of image is a major hurdle. Graphics Processing Units (GPUs), equipped with massive number of computational cores, provide an opportunity to accelerate IQA algorithms by performing computations in parallel. Indeed, General Purpose Graphics Processing Units (GPGPU) techniques have been applied to a few Full Reference IQA algorithms which fall under the. We present a GPGPU implementation of Blind Image Integrity Notator using DCT Statistics (BLIINDS-II), which falls under the No Reference IQA algorithm paradigm. We have been able to achieve a speedup of over 30x over the previous CPU version of this algorithm. We test our implementation using various distorted images from the CSIQ database and present the performance trends observed. We achieve a very consistent performance of around 9 milliseconds per distorted image, which made possible the execution of over 100 images per second (100 fps).
ContributorsYadav, Aman (Author) / Sohoni, Sohum (Thesis advisor) / Aukes, Daniel (Committee member) / Redkar, Sangram (Committee member) / Arizona State University (Publisher)
Created2016
Description
High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.
Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information.
In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.
Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information.
In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.
ContributorsHu, Sheng-Hung (Author) / Li, Baoxin (Thesis advisor) / Turaga, Pavan (Committee member) / Liang, Jianming (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)
Created2016
Description
Traditional methods for detecting the status of traffic lights used in autonomous vehicles may be susceptible to errors, which is troublesome in a safety-critical environment. In the case of vision-based recognition methods, failures may arise due to disturbances in the environment such as occluded views or poor lighting conditions. Some methods also depend on high-precision meta-data which is not always available. This thesis proposes a complementary detection approach based on an entirely new source of information: the movement patterns of other nearby vehicles. This approach is robust to traditional sources of error, and may serve as a viable supplemental detection method. Several different classification models are presented for inferring traffic light status based on these patterns. Their performance is evaluated over real-world and simulation data sets, resulting in up to 97% accuracy in each set.
ContributorsCampbell, Joseph (Author) / Fainekos, Georgios (Thesis advisor) / Ben Amor, Heni (Committee member) / Artemiadis, Panagiotis (Committee member) / Arizona State University (Publisher)
Created2016