Search Content

FPGA-based implementation of QR decomposition

Description

This thesis report aims at introducing the background of QR decomposition and its application. QR decomposition using Givens rotations is a efficient method to prevent directly matrix inverse in solving least square minimization problem, which is a typical approach for weight calculation in adaptive beamforming. Furthermore, this thesis introduces Givens…

This thesis report aims at introducing the background of QR decomposition and its application. QR decomposition using Givens rotations is a efficient method to prevent directly matrix inverse in solving least square minimization problem, which is a typical approach for weight calculation in adaptive beamforming. Furthermore, this thesis introduces Givens rotations algorithm and two general VLSI (very large scale integrated circuit) architectures namely triangular systolic array and linear systolic array for numerically QR decomposition. To fulfill the goal, a 4 input channels triangular systolic array with 16 bits fixed-point format and a 5 input channels linear systolic array are implemented on FPGA (Field programmable gate array). The final result shows that the estimated clock frequencies of 65 MHz and 135 MHz on post-place and route static timing report could be achieved using Xilinx Virtex 6 xc6vlx240t chip. Meanwhile, this report proposes a new method to test the dynamic range of QR-D. The dynamic range of the both architectures can be achieved around 110dB.

ContributorsYu, Hanguang (Author) / Bliss, Daniel W (Thesis advisor) / Ying, Lei (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)

Created2014

The Architecture Design and Hardware Implementation of Communications and High-Precision Positioning System

Description

Within the near future, a vast demand for autonomous vehicular techniques can be forecast on both aviation and ground platforms, including autonomous driving, automatic landing, air traffic management. These techniques usually rely on the positioning system and the communication system independently, where it potentially causes spectrum congestion. Inspired by the…

Within the near future, a vast demand for autonomous vehicular techniques can be forecast on both aviation and ground platforms, including autonomous driving, automatic landing, air traffic management. These techniques usually rely on the positioning system and the communication system independently, where it potentially causes spectrum congestion. Inspired by the spectrum sharing technique, Communications and High-Precision Positioning (CHP2) system is invented to provide a high precision position service (precision ~1cm) while performing the communication task simultaneously under the same spectrum. CHP2 system is implemented on the consumer-off-the-shelf (COTS) software-defined radio (SDR) platform with customized hardware. Taking the advantages of the SDR platform, the completed baseband processing chain, time-of-arrival estimation (ToA), time-of-flight estimation (ToF) are mathematically modeled and then implemented onto the system-on-chip (SoC) system. Due to the compact size and cost economy, the CHP2 system can be installed on different aerial or ground platforms enabling a high-mobile and reconfigurable network.

In this dissertation report, the implementation procedure of the CHP2 system is discussed in detail. It mainly focuses on the system construction on the Xilinx Ultrascale+ SoC platform. The CHP2 waveform design, ToA solution, and timing exchanging algorithms are also introduced. Finally, several in-lab tests and over-the-air demonstrations are conducted. The demonstration shows the best ranging performance achieves the ~1 cm standard deviation and 10Hz refreshing rate of estimation by using a 10MHz narrow-band signal over 915MHz (US ISM) or 783MHz (EU Licensed) carrier frequency.

ContributorsYu, Hanguang (Author) / Bliss, Daniel (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Alkhateeb, Ahmed (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)

Created2020

Novel Learning-Based Task Schedulers for Domain-Specific SoCs

Description

This Master’s thesis includes the design, integration on-chip, and evaluation of a set of imitation learning (IL)-based scheduling policies: deep neural network (DNN)and decision tree (DT). We first developed IL-based scheduling policies for heterogeneous systems-on-chips (SoCs). Then, we tested these policies using a system-level domain-specific system-on-chip simulation framework [11]. Finally,…

This Master’s thesis includes the design, integration on-chip, and evaluation of a set of imitation learning (IL)-based scheduling policies: deep neural network (DNN)and decision tree (DT). We first developed IL-based scheduling policies for heterogeneous systems-on-chips (SoCs). Then, we tested these policies using a system-level domain-specific system-on-chip simulation framework [11]. Finally, we transformed them into efficient code using a cloud engine [1] and implemented on a user-space emulation framework [61] on a Unix-based SoC. IL is one area of machine learning (ML) and a useful method to train artificial intelligence (AI) models by imitating the decisions of an expert or Oracle that knows the optimal solution. This thesis's primary focus is to adapt an ML model to work on-chip and optimize the resource allocation for a set of domain-specific wireless and radar systems applications. Evaluation results with four streaming applications from wireless communications and radar domains show how the proposed IL-based scheduler approximates an offline Oracle expert with more than 97% accuracy and 1.20× faster execution time. The models have been implemented as an add-on, making it easy to port to other SoCs.

ContributorsHolt, Conrad Mestres (Author) / Ogras, Umit Y. (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Akoglu, Ali (Committee member) / Arizona State University (Publisher)

Created2020

Efficient and Secure Deep Learning Inference System: A Software and Hardware Co-design Perspective

Description

The advances of Deep Learning (DL) achieved recently have successfully demonstrated its great potential of surpassing or close to human-level performance across multiple domains. Consequently, there exists a rising demand to deploy state-of-the-art DL algorithms, e.g., Deep Neural Networks (DNN), in real-world applications to release labors from repetitive work. On…

The advances of Deep Learning (DL) achieved recently have successfully demonstrated its great potential of surpassing or close to human-level performance across multiple domains. Consequently, there exists a rising demand to deploy state-of-the-art DL algorithms, e.g., Deep Neural Networks (DNN), in real-world applications to release labors from repetitive work. On the one hand, the impressive performance achieved by the DNN normally accompanies with the drawbacks of intensive memory and power usage due to enormous model size and high computation workload, which significantly hampers their deployment on the resource-limited cyber-physical systems or edge devices. Thus, the urgent demand for enhancing the inference efficiency of DNN has also great research interests across various communities. On the other hand, scientists and engineers still have insufficient knowledge about the principles of DNN which makes it mostly be treated as a black-box. Under such circumstance, DNN is like "the sword of Damocles" where its security or fault-tolerance capability is an essential concern which cannot be circumvented.

Motivated by the aforementioned concerns, this dissertation comprehensively investigates the emerging efficiency and security issues of DNNs, from both software and hardware design perspectives. From the efficiency perspective, as the foundation technique for efficient inference of target DNN, the model compression via quantization is elaborated. In order to maximize the inference performance boost, the deployment of quantized DNN on the revolutionary Computing-in-Memory based neural accelerator is presented in a cross-layer (device/circuit/system) fashion. From the security perspective, the well known adversarial attack is investigated spanning from its original input attack form (aka. Adversarial example generation) to its parameter attack variant.

Contributorshe, zhezhi (Author) / Fan, Deliang (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Cao, Yu (Committee member) / Seo, Jae-Sun (Committee member) / Arizona State University (Publisher)

Created2020

Communications and High-Precision Positioning (CHP2) System: Enabling Distributed Coherence and Precise Positioning for Resource-Limited Air Transport Systems

Description

Unmanned aerial systems (UASs) have recently enabled novel applications such as passenger transport and package delivery, but are increasingly vulnerable to cyberattack and therefore difficult to certify. Legacy systems such as GPS provide these capabilities extremely well, but are sensitive to spoofing and hijacking. An alternative intelligent transport system (ITS)…

Unmanned aerial systems (UASs) have recently enabled novel applications such as passenger transport and package delivery, but are increasingly vulnerable to cyberattack and therefore difficult to certify. Legacy systems such as GPS provide these capabilities extremely well, but are sensitive to spoofing and hijacking. An alternative intelligent transport system (ITS) was developed that provides highly secure communications, positioning, and timing synchronization services to networks of cooperative RF users, termed Communications and High-Precision Positioning (CHP2) system. This technology was implemented on consumer-off-the-shelf (COTS) hardware and it offers rapid (<100 ms) and precise (<5 cm) positioning capabilities in over-the-air experiments using flexible ground stations and UAS platforms using limited bandwidth (10 MHz). In this study, CHP2 is considered in the context of safety-critical and resource limited transport applications and urban air mobility. The two-way ranging (TWR) protocol over a joint positioning-communications waveform enables distributed coherence and time-of-flight(ToF) estimation. In a multi-antenna setup, the cross-platform ranging on participating nodes in the network translate to precise target location and orientation. In the current form, CHP2 necessitates a cooperative timing exchange at regular intervals. Dynamic resource management supports higher user densities by constantly renegotiating spectral access depending on need and opportunity. With these novel contributions to the field of integrated positioning and communications, CHP2 is a suitable candidate to provide both communications, navigation, and surveillance (CNS) and alternative positioning, navigation, and timing (APNT) services for high density safety-critical transport applications on a variety of vehicular platforms.

ContributorsSrinivas, Sharanya (Author) / Bliss, Daniel W. (Thesis advisor) / Richmond, Christ D. (Committee member) / Chakrabarti, Chaitali (Committee member) / Alkhateeb, Ahmed (Committee member) / Arizona State University (Publisher)

Created2020

Experimental Evaluation of the Feasibility of Wearable Piezoelectric Energy Harvesting

Description

Technological advances in low power wearable electronics and energy optimization techniques

make motion energy harvesting a viable energy source. However, it has not been

widely adopted due to bulky energy harvester designs that are uncomfortable to wear. This

work addresses this problem by analyzing the feasibility of powering low wearable power

devices using piezoelectric…

Technological advances in low power wearable electronics and energy optimization techniques

make motion energy harvesting a viable energy source. However, it has not been

widely adopted due to bulky energy harvester designs that are uncomfortable to wear. This

work addresses this problem by analyzing the feasibility of powering low wearable power

devices using piezoelectric energy generated at the human knee. We start with a novel

mathematical model for estimating the power generated from human knee joint movements.

This thesis’s major contribution is to analyze the feasibility of human motion energy harvesting

and validating this analytical model using a commercially available piezoelectric

module. To this end, we implemented an experimental setup that replicates a human knee.

Then, we performed experiments at different excitation frequencies and amplitudes with

two commercially available Macro Fiber Composite (MFC) modules. These experimental

results are used to validate the analytical model and predict the energy harvested as a function

of the number of steps taken in a day. The model estimates that 13μWcan be generated

on an average while walking with a 4.8% modeling error. The obtained results show that

piezoelectricity is indeed a viable approach for powering low-power wearable devices.

ContributorsBandyopadhyay, Shiva (Author) / Ogras, Umit Y. (Thesis advisor) / Fan, Deliang (Committee member) / Trichopoulos, Georgios (Committee member) / Arizona State University (Publisher)

Created2020

Algorithm and Hardware Design for High Volume Rate 3-D Medical Ultrasound Imaging

Description

Ultrasound B-mode imaging is an increasingly significant medical imaging modality for clinical applications. Compared to other imaging modalities like computed tomography (CT) or magnetic resonance imaging (MRI), ultrasound imaging has the advantage of being safe, inexpensive, and portable. While two dimensional (2-D) ultrasound imaging is very popular, three dimensional (3-D)…

Ultrasound B-mode imaging is an increasingly significant medical imaging modality for clinical applications. Compared to other imaging modalities like computed tomography (CT) or magnetic resonance imaging (MRI), ultrasound imaging has the advantage of being safe, inexpensive, and portable. While two dimensional (2-D) ultrasound imaging is very popular, three dimensional (3-D) ultrasound imaging provides distinct advantages over its 2-D counterpart by providing volumetric imaging, which leads to more accurate analysis of tumor and cysts. However, the amount of received data at the front-end of 3-D system is extremely large, making it impractical for power-constrained portable systems.

In this thesis, algorithm and hardware design techniques to support a hand-held 3-D ultrasound imaging system are proposed. Synthetic aperture sequential beamforming (SASB) is chosen since its computations can be split into two stages, where the output generated of Stage 1 is significantly smaller in size compared to the input. This characteristic enables Stage 1 to be done in the front end while Stage 2 can be sent out to be processed elsewhere.

The contributions of this thesis are as follows. First, 2-D SASB is extended to 3-D. Techniques to increase the volume rate of 3-D SASB through a new multi-line firing scheme and use of linear chirp as the excitation waveform, are presented. A new sparse array design that not only reduces the number of active transducers but also avoids the imaging degradation caused by grating lobes, is proposed. A combination of these techniques increases the volume rate of 3-D SASB by 4\texttimes{} without introducing extra computations at the front end.

Next, algorithmic techniques to further reduce the Stage 1 computations in the front end are presented. These include reducing the number of distinct apodization coefficients and operating with narrow-bit-width fixed-point data. A 3-D die stacked architecture is designed for the front end. This highly parallel architecture enables the signals received by 961 active transducers to be digitalized, routed by a network-on-chip, and processed in parallel. The processed data are accumulated through a bus-based structure. This architecture is synthesized using TSMC 28 nm technology node and the estimated power consumption of the front end is less than 2 W.

Finally, the Stage 2 computations are mapped onto a reconfigurable multi-core architecture, TRANSFORMER, which supports different types of on-chip memory banks and run-time reconfigurable connections between general processing elements and memory banks. The matched filtering step and the beamforming step in Stage 2 are mapped onto TRANSFORMER with different memory configurations. Gem5 simulations show that the private cache mode generates shorter execution time and higher computation efficiency compared to other cache modes. The overall execution time for Stage 2 is 14.73 ms. The average power consumption and the average Giga-operations-per-second/Watt in 14 nm technology node are 0.14 W and 103.84, respectively.

ContributorsZhou, Jian (Author) / Chakrabarti, Chaitali (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Wenisch, Thomas F. (Committee member) / Ogras, Umit Y. (Committee member) / Arizona State University (Publisher)

Created2019

Predictive Process Design Kits for the 7 nm and 5 nm Technology Nodes

Description

Recent years have seen fin field effect transistors (finFETs) dominate modern complementary metal oxide semiconductor (CMOS) processes, [1][2], e.g., at the sub 20 nm technology nodes, as they alleviate short channel effects, provide lower leakage, and enable some continued VDD scaling. However, a realistic finFET based predictive process design kit…

Recent years have seen fin field effect transistors (finFETs) dominate modern complementary metal oxide semiconductor (CMOS) processes, [1][2], e.g., at the sub 20 nm technology nodes, as they alleviate short channel effects, provide lower leakage, and enable some continued VDD scaling. However, a realistic finFET based predictive process design kit (PDK) that supports investigation into both circuit and physical design, encompassing all aspects of digital design, for academic use has been unavailable. While the finFET based FreePDK15 was supplemented with a standard cell library, it lacked full physical verification (LVS) and parasitic extraction at the time [3][4]. Consequently, the only available sub 45 nm educational PDKs are the planar CMOS based Synopsys 32/28 nm and FreePDK45 (45 nm PDK) [5][6]. The cell libraries available for those processes are not realistic since they use large cell heights, in contrast to recent industry trends. Additionally, the SRAM rules and cells provided by these PDKs are not realistic. Because finFETs have a 3D structure, which affects transistor density, using planar libraries scaled to sub 22 nm dimensions for research is likely to give poor accuracy.

Commercial libraries and PDKs, especially for advanced nodes, are often difficult to obtain for academic use, and access to the actual physical layouts is even more restricted. Furthermore, the necessary non disclosure agreements (NDAs) are un manageable for large university classes and the plethora of design rules can distract from the key points. NDAs also make it difficult for the publication of physical design as these may disclose proprietary design rules and structures.

This work focuses on the development of realistic PDKs for academic use that overcome these limitations. These PDKs, developed for the N7 and N5 nodes, even before 7 nm and 5 nm processes were available in industry, are thus predictive. The predictions have been based on publications of the continually improving lithography, as well as estimates of what would be available at N7 and N5. For the most part, these assumptions have been accurate with regards to N7, except for the expectation that extreme ultraviolet (EUV) lithography would be widely available, which has turned out to be optimistic.

ContributorsVashishtha, Vinay (Author) / Clark, Lawrence T. (Thesis advisor) / Allee, David R. (Committee member) / Ogras, Umit Y. (Committee member) / Seo, Jae sun (Committee member) / Arizona State University (Publisher)

Created2019

Software defined pulse-doppler radar for over-the-air applications: the joint radar-communications experiment

Description

In this paper, the Software Defined Radio (SDR) platform is considered for building a pseudo-monostatic, 100MHz Pulse-Doppler radar. The SDR platform has many benefits for experimental communications systems as it offers relatively cheap, parametrically dynamic, off-the-shelf access to the Radiofrequency (RF) spectrum. For this application, the Universal Software Radio Peripheral…

In this paper, the Software Defined Radio (SDR) platform is considered for building a pseudo-monostatic, 100MHz Pulse-Doppler radar. The SDR platform has many benefits for experimental communications systems as it offers relatively cheap, parametrically dynamic, off-the-shelf access to the Radiofrequency (RF) spectrum. For this application, the Universal Software Radio Peripheral (USRP) X310 hardware package is utilized with GNURadio for interfacing to the device and Matlab for signal post- processing. Pulse doppler radar processing is used to ascertain the range and velocity of a target considered in simulation and in real, over-the-air (OTA) experiments. The USRP platform offers a scalable and dynamic hardware package that can, with relatively low overhead, be incorporated into other experimental systems. This radar system will be considered for implementation into existing over-the-air Joint Radar- Communications (JRC) spectrum sharing experiments. The JRC system considers a co-designed architecture in which a communications user and a radar user share the same spectral allocation. Where the two systems would traditionally consider one another a source of interference, the receiver is able to decode communications information and discern target information via pulse-doppler radar simultaneously.

ContributorsGubash, Gerard (Author) / Bliss, Daniel W (Thesis advisor) / Richmond, Christ (Committee member) / Chakrabarti, Chaitali (Committee member) / Arizona State University (Publisher)

Created2019

Power, Performance, and Energy Management of Heterogeneous Architectures

Description

Many core modern multiprocessor systems-on-chip offers tremendous power and performance

optimization opportunities by tuning thousands of potential voltage, frequency

and core configurations. Applications running on these architectures are becoming increasingly

complex. As the basic building blocks, which make up the application, change during

runtime, different configurations may become optimal with respect to power, performance

or…

Many core modern multiprocessor systems-on-chip offers tremendous power and performance

optimization opportunities by tuning thousands of potential voltage, frequency

and core configurations. Applications running on these architectures are becoming increasingly

complex. As the basic building blocks, which make up the application, change during

runtime, different configurations may become optimal with respect to power, performance

or other metrics. Identifying the optimal configuration at runtime is a daunting task due

to a large number of workloads and configurations. Therefore, there is a strong need to

evaluate the metrics of interest as a function of the supported configurations.

This thesis focuses on two different types of modern multiprocessor systems-on-chip

(SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture.

For mobile heterogeneous systems, this thesis presents a novel methodology that can

accurately instrument different types of applications with specific performance monitoring

calls. These calls provide a rich set of performance statistics at a basic block level while the

application runs on the target platform. The target architecture used for this work (Odroid

XU3) is capable of running at 4940 different frequency and core combinations. With the

help of instrumented application vast amount of characterization data is collected that provides

details about performance, power and CPU state at every instrumented basic block

across 19 different types of applications. The vast amount of data collected has enabled

two runtime schemes. The first work provides a methodology to find optimal configurations

in heterogeneous architecture using classifiers and demonstrates an average increase

of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and

powersave governors, respectively. The second work using same data shows a novel imitation

learning framework for dynamically controlling the type, number, and the frequencies

of active cores to achieve an average of 109% PPW improvement compared to the default

governors.

This work also presents how to accurately profile tile based Intel Xeon Phi architecture

while training different types of neural networks using open image dataset on deep learning

framework. The data collected allows deep exploratory analysis. It also showcases how

different hardware parameters affect performance of Xeon Phi.

ContributorsPatil, Chetan Arvind (Author) / Ogras, Umit Y. (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Shrivastava, Aviral (Committee member) / Arizona State University (Publisher)

Created2019