Search Content

Unsupervised Bayesian data cleaning techniques for structured data

Description

Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization; few of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which…

Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization; few of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this thesis, I provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. I thus avoid the necessity for a domain expert or master data. I also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. A Map-Reduce architecture to perform this computation in a distributed manner is also shown. I evaluate these methods over both synthetic and real data.

ContributorsDe, Sushovan (Author) / Kambhampati, Subbarao (Thesis advisor) / Chen, Yi (Committee member) / Candan, K. Selcuk (Committee member) / Liu, Huan (Committee member) / Arizona State University (Publisher)

Created2014

Simultaneous variable and feature group selection in heterogeneous learning: optimization and applications

Description

Advances in data collection technologies have made it cost-effective to obtain heterogeneous data from multiple data sources. Very often, the data are of very high dimension and feature selection is preferred in order to reduce noise, save computational cost and learn interpretable models. Due to the multi-modality nature of heterogeneous…

Advances in data collection technologies have made it cost-effective to obtain heterogeneous data from multiple data sources. Very often, the data are of very high dimension and feature selection is preferred in order to reduce noise, save computational cost and learn interpretable models. Due to the multi-modality nature of heterogeneous data, it is interesting to design efficient machine learning models that are capable of performing variable selection and feature group (data source) selection simultaneously (a.k.a bi-level selection). In this thesis, I carry out research along this direction with a particular focus on designing efficient optimization algorithms. I start with a unified bi-level learning model that contains several existing feature selection models as special cases. Then the proposed model is further extended to tackle the block-wise missing data, one of the major challenges in the diagnosis of Alzheimer's Disease (AD). Moreover, I propose a novel interpretable sparse group feature selection model that greatly facilitates the procedure of parameter tuning and model selection. Last but not least, I show that by solving the sparse group hard thresholding problem directly, the sparse group feature selection model can be further improved in terms of both algorithmic complexity and efficiency. Promising results are demonstrated in the extensive evaluation on multiple real-world data sets.

ContributorsXiang, Shuo (Author) / Ye, Jieping (Thesis advisor) / Mittelmann, Hans D (Committee member) / Davulcu, Hasan (Committee member) / He, Jingrui (Committee member) / Arizona State University (Publisher)

Created2014

Test algebra for concurrent combinatorial testing

Description

A new algebraic system, Test Algebra (TA), is proposed for identifying faults in combinatorial testing for SaaS (Software-as-a-Service) applications. In the context of cloud computing, SaaS is a new software delivery model, in which mission-critical applications are composed, deployed, and executed on cloud platforms. Testing SaaS applications is challenging because…

A new algebraic system, Test Algebra (TA), is proposed for identifying faults in combinatorial testing for SaaS (Software-as-a-Service) applications. In the context of cloud computing, SaaS is a new software delivery model, in which mission-critical applications are composed, deployed, and executed on cloud platforms. Testing SaaS applications is challenging because new applications need to be tested once they are composed, and prior to their deployment. A composition of components providing services yields a configuration providing a SaaS application. While individual components

in the configuration may have been thoroughly tested, faults still arise due to interactions among the components composed, making the configuration faulty. When there are k components, combinatorial testing algorithms can be used to identify faulty interactions for t or fewer components, for some threshold 2 <= t <= k on the size of interactions considered. In general these methods do not identify specific faults, but rather indicate the presence or absence of some fault. To identify specific faults, an adaptive testing regime repeatedly constructs and tests configurations in order to determine, for each interaction of interest, whether it is faulty or not. In order to perform such testing in a loosely coupled distributed environment such as

the cloud, it is imperative that testing results can be combined from many different servers. The TA defines rules to permit results to be combined, and to identify the faulty interactions. Using the TA, configurations can be tested concurrently on different servers and in any order. The results, using the TA, remain the same.

ContributorsQi, Guanqiu (Author) / Tsai, Wei-Tek (Thesis advisor) / Davulcu, Hasan (Committee member) / Sarjoughian, Hessam S. (Committee member) / Yu, Hongyu (Committee member) / Arizona State University (Publisher)

Created2014

Personalized POI recommendation on location-based social networks

Description

The rapid urban expansion has greatly extended the physical boundary of our living area, along with a large number of POIs (points of interest) being developed. A POI is a specific location (e.g., hotel, restaurant, theater, mall) that a user may find useful or interesting. When exploring the city and…

The rapid urban expansion has greatly extended the physical boundary of our living area, along with a large number of POIs (points of interest) being developed. A POI is a specific location (e.g., hotel, restaurant, theater, mall) that a user may find useful or interesting. When exploring the city and neighborhood, the increasing number of POIs could enrich people's daily life, providing them with more choices of life experience than before, while at the same time also brings the problem of "curse of choices", resulting in the difficulty for a user to make a satisfied decision on "where to go" in an efficient way. Personalized POI recommendation is a task proposed on purpose of helping users filter out uninteresting POIs and reduce time in decision making, which could also benefit virtual marketing.

Developing POI recommender systems requires observation of human mobility w.r.t. real-world POIs, which is infeasible with traditional mobile data. However, the recent development of location-based social networks (LBSNs) provides such observation. Typical location-based social networking sites allow users to "check in" at POIs with smartphones, leave tips and share that experience with their online friends. The increasing number of LBSN users has generated large amounts of LBSN data, providing an unprecedented opportunity to study human mobility for personalized POI recommendation in spatial, temporal, social, and content aspects.

Different from recommender systems in other categories, e.g., movie recommendation in NetFlix, friend recommendation in dating websites, item recommendation in online shopping sites, personalized POI recommendation on LBSNs has its unique challenges due to the stochastic property of human mobility and the mobile behavior indications provided by LBSN information layout. The strong correlations between geographical POI information and other LBSN information result in three major human mobile properties, i.e., geo-social correlations, geo-temporal patterns, and geo-content indications, which are neither observed in other recommender systems, nor exploited in current POI recommendation. In this dissertation, we investigate these properties on LBSNs, and propose personalized POI recommendation models accordingly. The performance evaluated on real-world LBSN datasets validates the power of these properties in capturing user mobility, and demonstrates the ability of our models for personalized POI recommendation.

ContributorsGao, Huiji (Author) / Liu, Huan (Thesis advisor) / Xue, Guoliang (Committee member) / Ye, Jieping (Committee member) / Caverlee, James (Committee member) / Arizona State University (Publisher)

Created2014

Establishing the software-defined networking based defensive system in clouds

Description

Cloud computing is regarded as one of the most revolutionary technologies in the past decades. It provides scalable, flexible and secure resource provisioning services, which is also the reason why users prefer to migrate their locally processing workloads onto remote clouds. Besides commercial cloud system (i.e., Amazon EC2), ProtoGENI…

Cloud computing is regarded as one of the most revolutionary technologies in the past decades. It provides scalable, flexible and secure resource provisioning services, which is also the reason why users prefer to migrate their locally processing workloads onto remote clouds. Besides commercial cloud system (i.e., Amazon EC2), ProtoGENI and PlanetLab have further improved the current Internet-based resource provisioning system by allowing end users to construct a virtual networking environment. By archiving the similar goal but with more flexible and efficient performance, I present the design and implementation of MobiCloud that is a geo-distributed mobile cloud computing platform, and G-PLaNE that focuses on how to construct the virtual networking environment upon the self-designed resource provisioning system consisting of multiple geo-distributed clusters. Furthermore, I conduct a comprehensive study to layout existing Mobile Cloud Computing (MCC) service models and corresponding representative related work. A new user-centric mobile cloud computing service model is proposed to advance the existing mobile cloud computing research.

After building the MobiCloud, G-PLaNE and studying the MCC model, I have been using Software Defined Networking (SDN) approaches to enhance the system security in the cloud virtual networking environment. I present an OpenFlow based IPS solution called SDNIPS that includes a new IPS architecture based on Open vSwitch (OVS) in the cloud software-based networking environment. It is enabled with elasticity service provisioning and Network Reconfiguration (NR) features based on POX controller. Finally, SDNIPS demonstrates the feasibility and shows more efficiency than traditional approaches through a thorough evaluation.

At last, I propose an OpenFlow-based defensive module composition framework called CloudArmour that is able to perform query, aggregation, analysis, and control function over distributed OpenFlow-enabled devices. I propose several modules and use the DDoS attack as an example to illustrate how to composite the comprehensive defensive solution based on CloudArmour framework. I introduce total 20 Python-based CloudArmour APIs. Finally, evaluation results prove the feasibility and efficiency of CloudArmour framework.

ContributorsXing, Tianyi (Author) / Huang, Dijiang (Thesis advisor) / Xue, Guoliang (Committee member) / Sen, Arunabha (Committee member) / Medhi, Deepankar (Committee member) / Arizona State University (Publisher)

Created2014

Discovering and using patterns for countering security challenges

Description

Most existing security decisions for both defending and attacking are made based on some deterministic approaches that only give binary answers. Even though these approaches can achieve low false positive rate for decision making, they have high false negative rates due to the lack of accommodations to new attack methods…

Most existing security decisions for both defending and attacking are made based on some deterministic approaches that only give binary answers. Even though these approaches can achieve low false positive rate for decision making, they have high false negative rates due to the lack of accommodations to new attack methods and defense techniques. In this dissertation, I study how to discover and use patterns with uncertainty and randomness to counter security challenges. By extracting and modeling patterns in security events, I am able to handle previously unknown security events with quantified confidence, rather than simply making binary decisions. In particular, I cope with the following four real-world security challenges by modeling and analyzing with pattern-based approaches: 1) How to detect and attribute previously unknown shellcode? I propose instruction sequence abstraction that extracts coarse-grained patterns from an instruction sequence and use Markov chain-based model and support vector machines to detect and attribute shellcode; 2) How to safely mitigate routing attacks in mobile ad hoc networks? I identify routing table change patterns caused by attacks, propose an extended Dempster-Shafer theory to measure the risk of such changes, and use a risk-aware response mechanism to mitigate routing attacks; 3) How to model, understand, and guess human-chosen picture passwords? I analyze collected human-chosen picture passwords, propose selection function that models patterns in password selection, and design two algorithms to optimize password guessing paths; and 4) How to identify influential figures and events in underground social networks? I analyze collected underground social network data, identify user interaction patterns, and propose a suite of measures for systematically discovering and mining adversarial evidence. By solving these four problems, I demonstrate that discovering and using patterns could help deal with challenges in computer security, network security, human-computer interaction security, and social network security.

ContributorsZhao, Ziming (Author) / Ahn, Gail-Joon (Thesis advisor) / Yau, Stephen S. (Committee member) / Huang, Dijiang (Committee member) / Santanam, Raghu (Committee member) / Arizona State University (Publisher)

Created2014

Connectivity control for quad-dominant meshes

Description

Quad-dominant (QD) meshes, i.e., three-dimensional, 2-manifold polygonal meshes comprising mostly four-sided faces (i.e., quads), are a popular choice for many applications such as polygonal shape modeling, computer animation, base meshes for spline and subdivision surface, simulation, and architectural design. This thesis investigates the topic of connectivity control, i.e., exploring different…

Quad-dominant (QD) meshes, i.e., three-dimensional, 2-manifold polygonal meshes comprising mostly four-sided faces (i.e., quads), are a popular choice for many applications such as polygonal shape modeling, computer animation, base meshes for spline and subdivision surface, simulation, and architectural design. This thesis investigates the topic of connectivity control, i.e., exploring different choices of mesh connectivity to represent the same 3D shape or surface. One key concept of QD mesh connectivity is the distinction between regular and irregular elements: a vertex with valence 4 is regular; otherwise, it is irregular. In a similar sense, a face with four sides is regular; otherwise, it is irregular. For QD meshes, the placement of irregular elements is especially important since it largely determines the achievable geometric quality of the final mesh.

Traditionally, the research on QD meshes focuses on the automatic generation of pure quadrilateral or QD meshes from a given surface. Explicit control of the placement of irregular elements can only be achieved indirectly. To fill this gap, in this thesis, we make the following contributions. First, we formulate the theoretical background about the fundamental combinatorial properties of irregular elements in QD meshes. Second, we develop algorithms for the explicit control of irregular elements and the exhaustive enumeration of QD mesh connectivities. Finally, we demonstrate the importance of connectivity control for QD meshes in a wide range of applications.

ContributorsPeng, Chi-Han (Author) / Wonka, Peter (Thesis advisor) / Maciejewski, Ross (Committee member) / Farin, Gerald (Committee member) / Razdan, Anshuman (Committee member) / Arizona State University (Publisher)

Created2014

Sustainable cloud computing

Description

Energy consumption of the data centers worldwide is rapidly growing fueled by ever-increasing demand for Cloud computing applications ranging from social networking to e-commerce. Understandably, ensuring energy-efficiency and sustainability of Cloud data centers without compromising performance is important for both economic and environmental reasons. This dissertation develops a cyber-physical multi-tier…

Energy consumption of the data centers worldwide is rapidly growing fueled by ever-increasing demand for Cloud computing applications ranging from social networking to e-commerce. Understandably, ensuring energy-efficiency and sustainability of Cloud data centers without compromising performance is important for both economic and environmental reasons. This dissertation develops a cyber-physical multi-tier server and workload management architecture which operates at the local and the global (geo-distributed) data center level. We devise optimization frameworks for each tier to optimize energy consumption, energy cost and carbon footprint of the data centers. The proposed solutions are aware of various energy management tradeoffs that manifest due to the cyber-physical interactions in data centers, while providing provable guarantee on the solutions' computation efficiency and energy/cost efficiency. The local data center level energy management takes into account the impact of server consolidation on the cooling energy, avoids cooling-computing power tradeoff, and optimizes the total energy (computing and cooling energy) considering the data centers' technology trends (servers' power proportionality and cooling system power efficiency). The global data center level cost management explores the diversity of the data centers to minimize the utility cost while satisfying the carbon cap requirement of the Cloud and while dealing with the adversity of the prediction error on the data center parameters. Finally, the synergy of the local and the global data center energy and cost optimization is shown to help towards achieving carbon neutrality (net-zero) in a cost efficient manner.

ContributorsAbbasi, Zahra (Author) / Gupta, Sandeep K. S. (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Shrivastava, Aviral (Committee member) / Wu, Carole-Jean (Committee member) / Arizona State University (Publisher)

Created2014

Efficient processing of skyline queries on static data sources, data streams and incomplete datasets

Description

Skyline queries extract interesting points that are non-dominated and help paint the bigger picture of the data in question. They are valuable in many multi-criteria decision applications and are becoming a staple of decision support systems.

An assumption commonly made by many skyline algorithms is that a skyline query is applied…

Skyline queries extract interesting points that are non-dominated and help paint the bigger picture of the data in question. They are valuable in many multi-criteria decision applications and are becoming a staple of decision support systems.

An assumption commonly made by many skyline algorithms is that a skyline query is applied to a single static data source or data stream. Unfortunately, this assumption does not hold in many applications in which a skyline query may involve attributes belonging to multiple data sources and requires a join operation to be performed before the skyline can be produced. Recently, various skyline-join algorithms have been proposed to address this problem in the context of static data sources. However, these algorithms suffer from several drawbacks: they often need to scan the data sources exhaustively to obtain the skyline-join results; moreover, the pruning techniques employed to eliminate tuples are largely based on expensive tuple-to-tuple comparisons. On the other hand, most data stream techniques focus on single stream skyline queries, thus rendering them unsuitable for skyline-join queries.

Another assumption typically made by most of the earlier skyline algorithms is that the data is complete and all skyline attribute values are available. Due to this constraint, these algorithms cannot be applied to incomplete data sources in which some of the attribute values are missing and are represented by NULL values. There exists a definition of dominance for incomplete data, but this leads to undesirable consequences such as non-transitive and cyclic dominance relations both of which are detrimental to skyline processing.

Based on the aforementioned observations, the main goal of the research described in this dissertation is the design and development of a framework of skyline operators that effectively handles three distinct types of skyline queries: 1) skyline-join queries on static data sources, 2) skyline-window-join queries over data streams, and 3) strata-skyline queries on incomplete datasets. This dissertation presents the unique challenges posed by these skyline queries and addresses the shortcomings of current skyline techniques by proposing efficient methods to tackle the added overhead in processing skyline queries on static data sources, data streams, and incomplete datasets.

ContributorsNagendra, Mithila (Author) / Candan, Kasim Selcuk (Thesis advisor) / Chen, Yi (Committee member) / Davulcu, Hasan (Committee member) / Silva, Yasin N. (Committee member) / Sundaram, Hari (Committee member) / Arizona State University (Publisher)

Created2014

Graph-based sparse learning: models, algorithms, and applications

Description

Sparse learning is a powerful tool to generate models of high-dimensional data with high interpretability, and it has many important applications in areas such as bioinformatics, medical image processing, and computer vision. Recently, the a priori structural information has been shown to be powerful for improving the performance of sparse…

Sparse learning is a powerful tool to generate models of high-dimensional data with high interpretability, and it has many important applications in areas such as bioinformatics, medical image processing, and computer vision. Recently, the a priori structural information has been shown to be powerful for improving the performance of sparse learning models. A graph is a fundamental way to represent structural information of features. This dissertation focuses on graph-based sparse learning. The first part of this dissertation aims to integrate a graph into sparse learning to improve the performance. Specifically, the problem of feature grouping and selection over a given undirected graph is considered. Three models are proposed along with efficient solvers to achieve simultaneous feature grouping and selection, enhancing estimation accuracy. One major challenge is that it is still computationally challenging to solve large scale graph-based sparse learning problems. An efficient, scalable, and parallel algorithm for one widely used graph-based sparse learning approach, called anisotropic total variation regularization is therefore proposed, by explicitly exploring the structure of a graph. The second part of this dissertation focuses on uncovering the graph structure from the data. Two issues in graphical modeling are considered. One is the joint estimation of multiple graphical models using a fused lasso penalty and the other is the estimation of hierarchical graphical models. The key technical contribution is to establish the necessary and sufficient condition for the graphs to be decomposable. Based on this key property, a simple screening rule is presented, which reduces the size of the optimization problem, dramatically reducing the computational cost.

ContributorsYang, Sen (Author) / Ye, Jieping (Thesis advisor) / Wonka, Peter (Thesis advisor) / Wang, Yalin (Committee member) / Li, Jing (Committee member) / Arizona State University (Publisher)

Created2014

Filtering by