Matching Items (35)

Filtering by

Clear all filters

152360-Thumbnail Image.png

Image processing using approximate data-path units

Description

In this work, we present approximate adders and multipliers to reduce data-path complexity of specialized hardware for various image processing systems. These approximate circuits have a lower area, latency and power consumption compared to their accurate counterparts and produce fairly

In this work, we present approximate adders and multipliers to reduce data-path complexity of specialized hardware for various image processing systems. These approximate circuits have a lower area, latency and power consumption compared to their accurate counterparts and produce fairly accurate results. We build upon the work on approximate adders and multipliers presented in [23] and [24]. First, we show how choice of algorithm and parallel adder design can be used to implement 2D Discrete Cosine Transform (DCT) algorithm with good performance but low area. Our implementation of the 2D DCT has comparable PSNR performance with respect to the algorithm presented in [23] with ~35-50% reduction in area. Next, we use the approximate 2x2 multiplier presented in [24] to implement parallel approximate multipliers. We demonstrate that if some of the 2x2 multipliers in the design of the parallel multiplier are accurate, the accuracy of the multiplier improves significantly, especially when two large numbers are multiplied. We choose Gaussian FIR Filter and Fast Fourier Transform (FFT) algorithms to illustrate the efficacy of our proposed approximate multiplier. We show that application of the proposed approximate multiplier improves the PSNR performance of 32x32 FFT implementation by 4.7 dB compared to the implementation using the approximate multiplier described in [24]. We also implement a state-of-the-art image enlargement algorithm, namely Segment Adaptive Gradient Angle (SAGA) [29], in hardware. The algorithm is mapped to pipelined hardware blocks and we synthesized the design using 90 nm technology. We show that a 64x64 image can be processed in 496.48 µs when clocked at 100 MHz. The average PSNR performance of our implementation using accurate parallel adders and multipliers is 31.33 dB and that using approximate parallel adders and multipliers is 30.86 dB, when evaluated against the original image. The PSNR performance of both designs is comparable to the performance of the double precision floating point MATLAB implementation of the algorithm.

Contributors

Agent

Created

Date Created
2013

152527-Thumbnail Image.png

FPGA-based implementation of QR decomposition

Description

This thesis report aims at introducing the background of QR decomposition and its application. QR decomposition using Givens rotations is a efficient method to prevent directly matrix inverse in solving least square minimization problem, which is a typical approach for

This thesis report aims at introducing the background of QR decomposition and its application. QR decomposition using Givens rotations is a efficient method to prevent directly matrix inverse in solving least square minimization problem, which is a typical approach for weight calculation in adaptive beamforming. Furthermore, this thesis introduces Givens rotations algorithm and two general VLSI (very large scale integrated circuit) architectures namely triangular systolic array and linear systolic array for numerically QR decomposition. To fulfill the goal, a 4 input channels triangular systolic array with 16 bits fixed-point format and a 5 input channels linear systolic array are implemented on FPGA (Field programmable gate array). The final result shows that the estimated clock frequencies of 65 MHz and 135 MHz on post-place and route static timing report could be achieved using Xilinx Virtex 6 xc6vlx240t chip. Meanwhile, this report proposes a new method to test the dynamic range of QR-D. The dynamic range of the both architectures can be achieved around 110dB.

Contributors

Agent

Created

Date Created
2014

153890-Thumbnail Image.png

RRAM-based PUF: design and applications in cryptography

Description

The recent flurry of security breaches have raised serious concerns about the security of data communication and storage. A promising way to enhance the security of the system is through physical root of trust, such as, through use of physical

The recent flurry of security breaches have raised serious concerns about the security of data communication and storage. A promising way to enhance the security of the system is through physical root of trust, such as, through use of physical unclonable functions (PUF). PUF leverages the inherent randomness in physical systems to provide device specific authentication and encryption.

In this thesis, first the design of a highly reliable resistive random access memory (RRAM) PUF is presented. Compared to existing 1 cell/bit RRAM, here the sum of the read-out currents of multiple RRAM cells are used for generating one response bit. This method statistically minimizes any early-lifetime failure due to RRAM retention degradation at high temperature or under voltage stress. Using a device model that was calibrated using IMEC HfOx RRAM experimental data, it was shown that an 8 cells/bit architecture achieves 99.9999% reliability for a lifetime >10 years at 125℃ . Also, the hardware area overhead of the proposed 8 cells/bit RRAM PUF architecture was smaller than 1 cell/bit RRAM PUF that requires error correction coding to achieve the same reliability.

Next, a basic security primitive is presented, where the RRAM PUF is embedded in the cryptographic module, SHA-256. This architecture is referred to as Embedded PUF or EPUF. EPUF has a security advantage over SHA-256 as it never exposes the PUF response to the outside world. Instead, in each round, the PUF response is used to change a few bits of the message word to produce a unique message digest for each IC. The use of EPUF as a key generation module for AES is also shown. The hardware area requirement for SHA-256 and AES-128 is then analyzed using synthesis results based on TSMC 65nm library. It is shown that the area overhead of 8 cells/bit RRAM PUF is only 1.08% of the SHA-256 module and 0.04% of the AES-128 module. The security analysis of the PUF based systems is also presented. It is shown that the EPUF-based systems are resistant towards standard attacks on PUFs, and that the security of the cryptographic modules is not compromised.

Contributors

Agent

Created

Date Created
2015

151941-Thumbnail Image.png

Towards energy efficient computing with Linux: enabling task level power awareness and support for energy efficient accelerator

Description

With increasing transistor volume and reducing feature size, it has become a major design constraint to reduce power consumption also. This has given rise to aggressive architectural changes for on-chip power management and rapid development to energy efficient hardware accelerators.

With increasing transistor volume and reducing feature size, it has become a major design constraint to reduce power consumption also. This has given rise to aggressive architectural changes for on-chip power management and rapid development to energy efficient hardware accelerators. Accordingly, the objective of this research work is to facilitate software developers to leverage these hardware techniques and improve energy efficiency of the system. To achieve this, I propose two solutions for Linux kernel: Optimal use of these architectural enhancements to achieve greater energy efficiency requires accurate modeling of processor power consumption. Though there are many models available in literature to model processor power consumption, there is a lack of such models to capture power consumption at the task-level. Task-level energy models are a requirement for an operating system (OS) to perform real-time power management as OS time multiplexes tasks to enable sharing of hardware resources. I propose a detailed design methodology for constructing an architecture agnostic task-level power model and incorporating it into a modern operating system to build an online task-level power profiler. The profiler is implemented inside the latest Linux kernel and validated for Intel Sandy Bridge processor. It has a negligible overhead of less than 1\% hardware resource consumption. The profiler power prediction was demonstrated for various application benchmarks from SPEC to PARSEC with less than 4\% error. I also demonstrate the importance of the proposed profiler for emerging architectural techniques through use case scenarios, which include heterogeneous computing and fine grained per-core DVFS. Along with architectural enhancement in general purpose processors to improve energy efficiency, hardware accelerators like Coarse Grain reconfigurable architecture (CGRA) are gaining popularity. Unlike vector processors, which rely on data parallelism, CGRA can provide greater flexibility and compiler level control making it more suitable for present SoC environment. To provide streamline development environment for CGRA, I propose a flexible framework in Linux to do design space exploration for CGRA. With accurate and flexible hardware models, fine grained integration with accurate architectural simulator, and Linux memory management and DMA support, a user can carry out limitless experiments on CGRA in full system environment.

Contributors

Agent

Created

Date Created
2013

149992-Thumbnail Image.png

An analytical approach to efficient circuit variability analysis in scaled CMOS design

Description

Process variations have become increasingly important for scaled technologies starting at 45nm. The increased variations are primarily due to random dopant fluctuations, line-edge roughness and oxide thickness fluctuation. These variations greatly impact all aspects of circuit performance and pose a

Process variations have become increasingly important for scaled technologies starting at 45nm. The increased variations are primarily due to random dopant fluctuations, line-edge roughness and oxide thickness fluctuation. These variations greatly impact all aspects of circuit performance and pose a grand challenge to future robust IC design. To improve robustness, efficient methodology is required that considers effect of variations in the design flow. Analyzing timing variability of complex circuits with HSPICE simulations is very time consuming. This thesis proposes an analytical model to predict variability in CMOS circuits that is quick and accurate. There are several analytical models to estimate nominal delay performance but very little work has been done to accurately model delay variability. The proposed model is comprehensive and estimates nominal delay and variability as a function of transistor width, load capacitance and transition time. First, models are developed for library gates and the accuracy of the models is verified with HSPICE simulations for 45nm and 32nm technology nodes. The difference between predicted and simulated σ/μ for the library gates is less than 1%. Next, the accuracy of the model for nominal delay is verified for larger circuits including ISCAS'85 benchmark circuits. The model predicted results are within 4% error of HSPICE simulated results and take a small fraction of the time, for 45nm technology. Delay variability is analyzed for various paths and it is observed that non-critical paths can become critical because of Vth variation. Variability on shortest paths show that rate of hold violations increase enormously with increasing Vth variation.

Contributors

Agent

Created

Date Created
2011

150167-Thumbnail Image.png

Energy-efficient DSP system design based on the Redundant Binary number system

Description

Redundant Binary (RBR) number representations have been extensively used in the past for high-throughput Digital Signal Processing (DSP) systems. Data-path components based on this number system have smaller critical path delay but larger area compared to conventional two's complement systems.

Redundant Binary (RBR) number representations have been extensively used in the past for high-throughput Digital Signal Processing (DSP) systems. Data-path components based on this number system have smaller critical path delay but larger area compared to conventional two's complement systems. This work explores the use of RBR number representation for implementing high-throughput DSP systems that are also energy-efficient. Data-path components such as adders and multipliers are evaluated with respect to critical path delay, energy and Energy-Delay Product (EDP). A new design for a RBR adder with very good EDP performance has been proposed. The corresponding RBR parallel adder has a much lower critical path delay and EDP compared to two's complement carry select and carry look-ahead adder implementations. Next, several RBR multiplier architectures are investigated and their performance compared to two's complement systems. These include two new multiplier architectures: a purely RBR multiplier where both the operands are in RBR form, and a hybrid multiplier where the multiplicand is in RBR form and the other operand is represented in conventional two's complement form. Both the RBR and hybrid designs are demonstrated to have better EDP performance compared to conventional two's complement multipliers. The hybrid multiplier is also shown to have a superior EDP performance compared to the RBR multiplier, with much lower implementation area. Analysis on the effect of bit-precision is also performed, and it is shown that the performance gain of RBR systems improves for higher bit precision. Next, in order to demonstrate the efficacy of the RBR representation at the system-level, the performance of RBR and hybrid implementations of some common DSP kernels such as Discrete Cosine Transform, edge detection using Sobel operator, complex multiplication, Lifting-based Discrete Wavelet Transform (9, 7) filter, and FIR filter, is compared with two's complement systems. It is shown that for relatively large computation modules, the RBR to two's complement conversion overhead gets amortized. In case of systems with high complexity, for iso-throughput, both the hybrid and RBR implementations are demonstrated to be superior with lower average energy consumption. For low complexity systems, the conversion overhead is significant, and overpowers the EDP performance gain obtained from the RBR computation operation.

Contributors

Agent

Created

Date Created
2011

150423-Thumbnail Image.png

Dynamic waveform design for track-before-detect algorithms in radar

Description

In this thesis, an adaptive waveform selection technique for dynamic target tracking under low signal-to-noise ratio (SNR) conditions is investigated. The approach is integrated with a track-before-detect (TBD) algorithm and uses delay-Doppler matched filter (MF) outputs as raw measurements without

In this thesis, an adaptive waveform selection technique for dynamic target tracking under low signal-to-noise ratio (SNR) conditions is investigated. The approach is integrated with a track-before-detect (TBD) algorithm and uses delay-Doppler matched filter (MF) outputs as raw measurements without setting any threshold for extracting delay-Doppler estimates. The particle filter (PF) Bayesian sequential estimation approach is used with the TBD algorithm (PF-TBD) to estimate the dynamic target state. A waveform-agile TBD technique is proposed that integrates the PF-TBD with a waveform selection technique. The new approach predicts the waveform to transmit at the next time step by minimizing the predicted mean-squared error (MSE). As a result, the radar parameters are adaptively and optimally selected for superior performance. Based on previous work, this thesis highlights the applicability of the predicted covariance matrix to the lower SNR waveform-agile tracking problem. The adaptive waveform selection algorithm's MSE performance was compared against fixed waveforms using Monte Carlo simulations. It was found that the adaptive approach performed at least as well as the best fixed waveform when focusing on estimating only position or only velocity. When these estimates were weighted by different amounts, then the adaptive performance exceeded all fixed waveforms. This improvement in performance demonstrates the utility of the predicted covariance in waveform design, at low SNR conditions that are poorly handled with more traditional tracking algorithms.

Contributors

Agent

Created

Date Created
2011

150830-Thumbnail Image.png

Scheduling neural sensors to estimate brain activity

Description

Research on developing new algorithms to improve information on brain functionality and structure is ongoing. Studying neural activity through dipole source localization with electroencephalography (EEG) and magnetoencephalography (MEG) sensor measurements can lead to diagnosis and treatment of a brain disorder

Research on developing new algorithms to improve information on brain functionality and structure is ongoing. Studying neural activity through dipole source localization with electroencephalography (EEG) and magnetoencephalography (MEG) sensor measurements can lead to diagnosis and treatment of a brain disorder and can also identify the area of the brain from where the disorder has originated. Designing advanced localization algorithms that can adapt to environmental changes is considered a significant shift from manual diagnosis which is based on the knowledge and observation of the doctor, to an adaptive and improved brain disorder diagnosis as these algorithms can track activities that might not be noticed by the human eye. An important consideration of these localization algorithms, however, is to try and minimize the overall power consumption in order to improve the study and treatment of brain disorders. This thesis considers the problem of estimating dynamic parameters of neural dipole sources while minimizing the system's overall power consumption; this is achieved by minimizing the number of EEG/MEG measurements sensors without a loss in estimation performance accuracy. As the EEG/MEG measurements models are related non-linearity to the dipole source locations and moments, these dynamic parameters can be estimated using sequential Monte Carlo methods such as particle filtering. Due to the large number of sensors required to record EEG/MEG Measurements for use in the particle filter, over long period recordings, a large amounts of power is required for storage and transmission. In order to reduce the overall power consumption, two methods are proposed. The first method used the predicted mean square estimation error as the performance metric under the constraint of a maximum power consumption. The performance metric of the second method uses the distance between the location of the sensors and the location estimate of the dipole source at the previous time step; this sensor scheduling scheme results in maximizing the overall signal-to-noise ratio. The performance of both methods is demonstrated using simulated data, and both methods show that they can provide good estimation results with significant reduction in the number of activated sensors at each time step.

Contributors

Agent

Created

Date Created
2012

149422-Thumbnail Image.png

3D modeling using multi-view images

Description

There is a growing interest in the creation of three-dimensional (3D) images and videos due to the growing demand for 3D visual media in commercial markets. A possible solution to produce 3D media files is to convert existing 2D images

There is a growing interest in the creation of three-dimensional (3D) images and videos due to the growing demand for 3D visual media in commercial markets. A possible solution to produce 3D media files is to convert existing 2D images and videos to 3D. The 2D to 3D conversion methods that estimate the depth map from 2D scenes for 3D reconstruction present an efficient approach to save on the cost of the coding, transmission and storage of 3D visual media in practical applications. Various 2D to 3D conversion methods based on depth maps have been developed using existing image and video processing techniques. The depth maps can be estimated either from a single 2D view or from multiple 2D views. This thesis presents a MATLAB-based 2D to 3D conversion system from multiple views based on the computation of a sparse depth map. The 2D to 3D conversion system is able to deal with the multiple views obtained from uncalibrated hand-held cameras without knowledge of the prior camera parameters or scene geometry. The implemented system consists of techniques for image feature detection and registration, two-view geometry estimation, projective 3D scene reconstruction and metric upgrade to reconstruct the 3D structures by means of a metric transformation. The implemented 2D to 3D conversion system is tested using different multi-view image sets. The obtained experimental results of reconstructed sparse depth maps of feature points in 3D scenes provide relative depth information of the objects. Sample ground-truth depth data points are used to calculate a scale factor in order to estimate the true depth by scaling the obtained relative depth information using the estimated scale factor. It was found out that the obtained reconstructed depth map is consistent with the ground-truth depth data.

Contributors

Agent

Created

Date Created
2010

152892-Thumbnail Image.png

Constrained energy optimization in heterogeneous platforms using generalized scaling models

Description

Mobile platforms are becoming highly heterogeneous by combining a powerful multiprocessor system-on-chip (MpSoC) with numerous resources including display, memory, power management IC (PMIC), battery and wireless modems into a compact package. Furthermore, the MpSoC itself is a heterogeneous resource that

Mobile platforms are becoming highly heterogeneous by combining a powerful multiprocessor system-on-chip (MpSoC) with numerous resources including display, memory, power management IC (PMIC), battery and wireless modems into a compact package. Furthermore, the MpSoC itself is a heterogeneous resource that integrates many processing elements such as CPU cores, GPU, video, image, and audio processors. As a result, optimization approaches targeting mobile computing needs to consider the platform at various levels of granularity.

Platform energy consumption and responsiveness are two major considerations for mobile systems since they determine the battery life and user satisfaction, respectively. In this work, the models for power consumption, response time, and energy consumption of heterogeneous mobile platforms are presented. Then, these models are used to optimize the energy consumption of baseline platforms under power, response time, and temperature constraints with and without introducing new resources. It is shown, the optimal design choices depend on dynamic power management algorithm, and adding new resources is more energy efficient than scaling existing resources alone. The framework is verified through actual experiments on Qualcomm Snapdragon 800 based tablet MDP/T. Furthermore, usage of the framework at both design and runtime optimization is also presented.

Contributors

Agent

Created

Date Created
2014