Search Content

Mining Data with Feature Interactions

Description

Models using feature interactions have been applied successfully in many areas such as biomedical analysis, recommender systems. The popularity of using feature interactions mainly lies in (1) they are able to capture the nonlinearity of the data compared with linear effects and (2) they enjoy great interpretability. In this thesis,…

Models using feature interactions have been applied successfully in many areas such as biomedical analysis, recommender systems. The popularity of using feature interactions mainly lies in (1) they are able to capture the nonlinearity of the data compared with linear effects and (2) they enjoy great interpretability. In this thesis, I propose a series of formulations using feature interactions for real world problems and develop efficient algorithms for solving them.

Specifically, I first propose to directly solve the non-convex formulation of the weak hierarchical Lasso which imposes weak hierarchy on individual features and interactions but can only be approximately solved by a convex relaxation in existing studies. I further propose to use the non-convex weak hierarchical Lasso formulation for hypothesis testing on the interaction features with hierarchical assumptions. Secondly, I propose a type of bi-linear models that take advantage of interactions of features for drug discovery problems where specific drug-drug pairs or drug-disease pairs are of interest. These models are learned by maximizing the number of positive data pairs that rank above the average score of unlabeled data pairs. Then I generalize the method to the case of using the top-ranked unlabeled data pairs for representative construction and derive an efficient algorithm for the extended formulation. Last but not least, motivated by a special form of bi-linear models, I propose a framework that enables simultaneously subgrouping data points and building specific models on the subgroups for learning on massive and heterogeneous datasets. Experiments on synthetic and real datasets are conducted to demonstrate the effectiveness or efficiency of the proposed methods.

ContributorsLiu, Yashu (Author) / Ye, Jieping (Thesis advisor) / Xue, Guoliang (Thesis advisor) / Liu, Huan (Committee member) / Mittelmann, Hans D (Committee member) / Arizona State University (Publisher)

Created2018

Comparing a commercial and an SDN-based load balancer in a campus network

Description

Commercial load balancers are often in use, and the production network at Arizona State University (ASU) is no exception. However, because the load balancer uses IP addresses, the solution does not apply to all applications. One such application is Rsyslog. This software processes syslog packets and stores them in files.…

Commercial load balancers are often in use, and the production network at Arizona State University (ASU) is no exception. However, because the load balancer uses IP addresses, the solution does not apply to all applications. One such application is Rsyslog. This software processes syslog packets and stores them in files. The loss rate of incoming log packets is high due to the incoming rate of the data. The Rsyslog servers are overwhelmed by the continuous data stream. To solve this problem a software defined networking (SDN) based load balancer is designed to perform a transport-level load balancing over the incoming load to Rsyslog servers. In this solution the load is forwarded to one Rsyslog server at a time, according to one of a Round-Robin, Random, or Load-Based policy. This gives time to other servers to process the data they have received and prevent them from being overwhelmed. The evaluation of the proposed solution is conducted a physical testbed with the same data feed as the commercial solution. The results suggest that the SDN-based load balancer is competitive with the commercial load balancer. Replacing the software OpenFlow switch with a hardware switch is likely to further improve the results.

ContributorsGhaffarinejad, Ashkan (Author) / Syrotiuk, Violet R. (Thesis advisor) / Xue, Guoliang (Committee member) / Huang, Dijiang (Committee member) / Arizona State University (Publisher)

Created2015

Making thin data thick: user behavior analysis with minimum information

Description

With the rise of social media, user-generated content has become available at an unprecedented scale. On Twitter, 1 billion tweets are posted every 5 days and on Facebook, 20 million links are shared every 20 minutes. These massive collections of user-generated content have introduced the human behavior's big-data.

This big data…

With the rise of social media, user-generated content has become available at an unprecedented scale. On Twitter, 1 billion tweets are posted every 5 days and on Facebook, 20 million links are shared every 20 minutes. These massive collections of user-generated content have introduced the human behavior's big-data.

This big data has brought about countless opportunities for analyzing human behavior at scale. However, is this data enough? Unfortunately, the data available at the individual-level is limited for most users. This limited individual-level data is often referred to as thin data. Hence, researchers face a big-data paradox, where this big-data is a large collection of mostly limited individual-level information. Researchers are often constrained to derive meaningful insights regarding online user behavior with this limited information. Simply put, they have to make thin data thick.

In this dissertation, how human behavior's thin data can be made thick is investigated. The chief objective of this dissertation is to demonstrate how traces of human behavior can be efficiently gleaned from the, often limited, individual-level information; hence, introducing an all-inclusive user behavior analysis methodology that considers social media users with different levels of information availability. To that end, the absolute minimum information in terms of both link or content data that is available for any social media user is determined. Utilizing only minimum information in different applications on social media such as prediction or recommendation tasks allows for solutions that are (1) generalizable to all social media users and that are (2) easy to implement. However, are applications that employ only minimum information as effective or comparable to applications that use more information?

In this dissertation, it is shown that common research challenges such as detecting malicious users or friend recommendation (i.e., link prediction) can be effectively performed using only minimum information. More importantly, it is demonstrated that unique user identification can be achieved using minimum information. Theoretical boundaries of unique user identification are obtained by introducing social signatures. Social signatures allow for user identification in any large-scale network on social media. The results on single-site user identification are generalized to multiple sites and it is shown how the same user can be uniquely identified across multiple sites using only minimum link or content information.

The findings in this dissertation allows finding the same user across multiple sites, which in turn has multiple implications. In particular, by identifying the same users across sites, (1) patterns that users exhibit across sites are identified, (2) how user behavior varies across sites is determined, and (3) activities that are observed only across sites are identified and studied.

ContributorsZafarani, Reza, 1983- (Author) / Liu, Huan (Thesis advisor) / Kambhampati, Subbarao (Committee member) / Xue, Guoliang (Committee member) / Leskovec, Jure (Committee member) / Arizona State University (Publisher)

Created2015

Mobile cloud application framework and offloading strategies

Description

Mobile Cloud computing has shown its capability to support mobile devices for

provisioning computing, storage and communication resources. A distributed mobile

cloud service system called "POEM" is presented to manage the mobile cloud resource

and compose mobile cloud applications. POEM considers resource management not

only between mobile devices and clouds, but also among mobile…

Mobile Cloud computing has shown its capability to support mobile devices for

provisioning computing, storage and communication resources. A distributed mobile

cloud service system called "POEM" is presented to manage the mobile cloud resource

and compose mobile cloud applications. POEM considers resource management not

only between mobile devices and clouds, but also among mobile devices. It implements

both computation offloading and service composition features. The proposed POEM

solution is demonstrated by using OSGi and XMPP techniques.

Offloading is one major type of collaborations between mobile device and cloud

to achieve less execution time and less energy consumption. Offloading decisions for

mobile cloud collaboration involve many decision factors. One of important decision

factors is the network unavailability. This report presents an offloading decision model

that takes network unavailability into consideration. The application execution time

and energy consumption in both ideal network and network with some unavailability

are analyzed. Based on the presented theoretical model, an application partition

algorithm and a decision module are presented to produce an offloading decision that

is resistant to network unavailability.

Existing offloading models mainly focus on the one-to-one offloading relation. To

address the multi-factor and multi-site offloading mobile cloud application scenarios,

a multi-factor multi-site risk-based offloading model is presented, which abstracts the

offloading impact factors as for offloading benefit and offloading risk. The offloading

decision is made based on a comprehensive offloading risk evaluation. This presented

model is generic and expendable. Four offloading impact factors are presented to show

the construction and operation of the presented offloading model, which can be easily

extended to incorporate more factors to make offloading decision more comprehensive.

The overall offloading benefits and risks are aggregated based on the mobile cloud

users' preference.

The offloading topology may change during the whole application life. A set of

algorithms are presented to address the service topology reconfiguration problem in

several mobile cloud representative application scenarios, i.e., they are modeled as

finite horizon scenarios, infinite horizon scenarios, and large state space scenarios to

represent ad hoc, long-term, and large-scale mobile cloud service composition scenarios,

respectively.

ContributorsWu, Huijun (Author) / Huang, Dijiang (Thesis advisor) / Xue, Guoliang (Committee member) / Dasgupta, Partha (Committee member) / Mirchandani, Pitu (Committee member) / Arizona State University (Publisher)

Created2016

Toward customizable multi-tenant SaaS applications

Description

Nowadays, Computing is so pervasive that it has become indeed the 5th utility (after water, electricity, gas, telephony) as Leonard Kleinrock once envisioned. Evolved from utility computing, cloud computing has emerged as a computing infrastructure that enables rapid delivery of computing resources as a utility in a dynamically…

Nowadays, Computing is so pervasive that it has become indeed the 5th utility (after water, electricity, gas, telephony) as Leonard Kleinrock once envisioned. Evolved from utility computing, cloud computing has emerged as a computing infrastructure that enables rapid delivery of computing resources as a utility in a dynamically scalable, virtualized manner. However, the current industrial cloud computing implementations promote segregation among different cloud providers, which leads to user lockdown because of prohibitive migration cost. On the other hand, Service-Orented Computing (SOC) including service-oriented architecture (SOA) and Web Services (WS) promote standardization and openness with its enabling standards and communication protocols. This thesis proposes a Service-Oriented Cloud Computing Architecture by combining the best attributes of the two paradigms to promote an open, interoperable environment for cloud computing development. Mutil-tenancy SaaS applicantions built on top of SOCCA have more flexibility and are not locked down by a certain platform. Tenants residing on a multi-tenant application appear to be the sole owner of the application and not aware of the existence of others. A multi-tenant SaaS application accommodates each tenant’s unique requirements by allowing tenant-level customization. A complex SaaS application that supports hundreds, even thousands of tenants could have hundreds of customization points with each of them providing multiple options, and this could result in a huge number of ways to customize the application. This dissertation also proposes innovative customization approaches, which studies similar tenants’ customization choices and each individual users behaviors, then provides guided semi-automated customization process for the future tenants. A semi-automated customization process could enable tenants to quickly implement the customization that best suits their business needs.

ContributorsSun, Xin (Author) / Tsai, Wei-Tek (Thesis advisor) / Xue, Guoliang (Committee member) / Davulcu, Hasan (Committee member) / Sarjoughian, Hessam S. (Committee member) / Arizona State University (Publisher)

Created2016

Optimal resource allocation in social and critical infrastructure networks

Description

We live in a networked world with a multitude of networks, such as communication networks, electric power grid, transportation networks and water distribution networks, all around us. In addition to such physical (infrastructure) networks, recent years have seen tremendous proliferation of social networks, such as Facebook, Twitter, LinkedIn, Instagram, Google+…

We live in a networked world with a multitude of networks, such as communication networks, electric power grid, transportation networks and water distribution networks, all around us. In addition to such physical (infrastructure) networks, recent years have seen tremendous proliferation of social networks, such as Facebook, Twitter, LinkedIn, Instagram, Google+ and others. These powerful social networks are not only used for harnessing revenue from the infrastructure networks, but are also increasingly being used as “non-conventional sensors” for monitoring the infrastructure networks. Accordingly, nowadays, analyses of social and infrastructure networks go hand-in-hand. This dissertation studies resource allocation problems encountered in this set of diverse, heterogeneous, and interdependent networks. Three problems studied in this dissertation are encountered in the physical network domain while the three other problems studied are encountered in the social network domain.

The first problem from the infrastructure network domain relates to distributed files storage scheme with a goal of enhancing robustness of data storage by making it tolerant against large scale geographically-correlated failures. The second problem relates to placement of relay nodes in a deployment area with multiple sensor nodes with a goal of augmenting connectivity of the resulting network, while staying within the budget specifying the maximum number of relay nodes that can be deployed. The third problem studied in this dissertation relates to complex interdependencies that exist between infrastructure networks, such as power grid and communication network. The progressive recovery problem in an interdependent network is studied whose goal is to maximize system utility over the time when recovery process of failed entities takes place in a sequential manner.

The three problems studied from the social network domain relate to influence propagation in adversarial environment and political sentiment assessment in various states in a country with a goal of creation of a “political heat map” of the country. In the first problem of the influence propagation domain, the goal of the second player is to restrict the influence of the first player, while in the second problem the goal of the second player is to have a larger market share with least amount of initial investment.

ContributorsMazumder, Anisha (Author) / Sen, Arunabha (Thesis advisor) / Richa, Andrea (Committee member) / Xue, Guoliang (Committee member) / Reisslein, Martin (Committee member) / Arizona State University (Publisher)

Created2016

An investigation of machine learning for password evaluation

Description

Passwords are ubiquitous and are poised to stay that way due to their relative usability, security and deployability when compared with alternative authentication schemes. Unfortunately, humans struggle with some of the assumptions or requirements that are necessary for truly strong passwords. As administrators try to push users towards password complexity…

Passwords are ubiquitous and are poised to stay that way due to their relative usability, security and deployability when compared with alternative authentication schemes. Unfortunately, humans struggle with some of the assumptions or requirements that are necessary for truly strong passwords. As administrators try to push users towards password complexity and diversity, users still end up using predictable mangling patterns on old passwords and reusing the same passwords across services; users even inadvertently converge on the same patterns to a surprising degree, making an attacker’s job easier. This work explores using machine learning techniques to pick out strong passwords from weak ones, from a dataset of 10 million passwords, based on how structurally similar they were to the rest of the set.

ContributorsTodd, Margaret Nicole (Author) / Xue, Guoliang (Thesis advisor) / Ahn, Gail-Joon (Committee member) / Huang, Dijiang (Committee member) / Arizona State University (Publisher)

Created2016

Scaling Up Large-scale Sparse Learning and Its Application to Medical Imaging

Description

Large-scale $\ell_1$-regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. In many applications, it remains challenging to apply the sparse learning model to large-scale problems that have massive data samples with high-dimensional features. One popular and promising strategy…

Large-scale $\ell_1$-regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. In many applications, it remains challenging to apply the sparse learning model to large-scale problems that have massive data samples with high-dimensional features. One popular and promising strategy is to scaling up the optimization problem in parallel. Parallel solvers run multiple cores on a shared memory system or a distributed environment to speed up the computation, while the practical usage is limited by the huge dimension in the feature space and synchronization problems.

In this dissertation, I carry out the research along the direction with particular focuses on scaling up the optimization of sparse learning for supervised and unsupervised learning problems. For the supervised learning, I firstly propose an asynchronous parallel solver to optimize the large-scale sparse learning model in a multithreading environment. Moreover, I propose a distributed framework to conduct the learning process when the dataset is distributed stored among different machines. Then the proposed model is further extended to the studies of risk genetic factors for Alzheimer's Disease (AD) among different research institutions, integrating a group feature selection framework to rank the top risk SNPs for AD. For the unsupervised learning problem, I propose a highly efficient solver, termed Stochastic Coordinate Coding (SCC), scaling up the optimization of dictionary learning and sparse coding problems. The common issue for the medical imaging research is that the longitudinal features of patients among different time points are beneficial to study together. To further improve the dictionary learning model, I propose a multi-task dictionary learning method, learning the different task simultaneously and utilizing shared and individual dictionary to encode both consistent and changing imaging features.

ContributorsLi, Qingyang (Author) / Ye, Jieping (Thesis advisor) / Xue, Guoliang (Thesis advisor) / He, Jingrui (Committee member) / Wang, Yalin (Committee member) / Li, Jing (Committee member) / Arizona State University (Publisher)

Created2017

Vulnerability and Protection Analysis of Critical Infrastructure Systems

Description

The power and communication networks are highly interdependent and form a part of the critical infrastructure of a country. Similarly, dependencies exist within the networks itself. Owing to cascading failures, interdependent and intradependent networks are extremely susceptible to widespread vulnerabilities. In recent times the research community has shown significant interest…

The power and communication networks are highly interdependent and form a part of the critical infrastructure of a country. Similarly, dependencies exist within the networks itself. Owing to cascading failures, interdependent and intradependent networks are extremely susceptible to widespread vulnerabilities. In recent times the research community has shown significant interest in modeling to capture these dependencies. However, many of them are simplistic in nature which limits their applicability to real world systems. This dissertation presents a Boolean logic based model termed as Implicative Interdependency Model (IIM) to capture the complex dependencies and cascading failures resulting from an initial failure of one or more entities of either network.

Utilizing the IIM, four pertinent problems encompassing vulnerability and protection of critical infrastructures are formulated and solved. For protection analysis, the Entity Hardening Problem, Targeted Entity Hardening Problem and Auxiliary Entity Allocation Problem are formulated. Qualitatively, under a resource budget, the problems maximize the number of entities protected from failure from an initial failure of a set of entities. Additionally, the model is also used to come up with a metric to analyze the Robustness of critical infrastructure systems. The computational complexity of all these problems is NP-complete. Accordingly, Integer Linear Program solutions (to obtain the optimal solution) and polynomial time sub-optimal Heuristic solutions are proposed for these problems. To analyze the efficacy of the Heuristic solution, comparative studies are performed on real-world and test system data.

ContributorsBanerjee, Joydeep (Author) / Sen, Arunabha (Thesis advisor) / Dasgupta, Partha (Committee member) / Xue, Guoliang (Committee member) / Raravi, Gurulingesh (Committee member) / Arizona State University (Publisher)

Created2017

Modeling, analysis, and efficient resource allocation in cyber-physical systems and critical infrastructure networks

Description

The critical infrastructures of the nation are a large and complex network of human, physical and cyber-physical systems. In recent times, it has become increasingly apparent that individual critical infrastructures, such as the power and communication networks, do not operate in isolation, but instead are part of a complex interdependent…

The critical infrastructures of the nation are a large and complex network of human, physical and cyber-physical systems. In recent times, it has become increasingly apparent that individual critical infrastructures, such as the power and communication networks, do not operate in isolation, but instead are part of a complex interdependent ecosystem where a failure involving a small set of network entities can trigger a cascading event resulting in the failure of a much larger set of entities through the failure propagation process.

Recognizing the need for a deeper understanding of the interdependent relationships between such critical infrastructures, several models have been proposed and analyzed in the last few years. However, most of these models are over-simplified and fail to capture the complex interdependencies that may exist between critical infrastructures. To overcome the limitations of existing models, this dissertation presents a new model -- the Implicative Interdependency Model (IIM) that is able to capture such complex interdependency relations. As the potential for a failure cascade in critical interdependent networks poses several risks that can jeopardize the nation, this dissertation explores relevant research problems in the interdependent power and communication networks using the proposed IIM and lays the foundations for further study using this model.

Apart from exploring problems in interdependent critical infrastructures, this dissertation also explores resource allocation techniques for environments enabled with cyber-physical systems. Specifically, the problem of efficient path planning for data collection using mobile cyber-physical systems is explored. Two such environments are considered: a Radio-Frequency IDentification (RFID) environment with mobile “Tags” and “Readers”, and a sensor data collection environment where both the sensors and the data mules (data collectors) are mobile.

Finally, from an applied research perspective, this dissertation presents Raptor, an advanced network planning and management tool for mitigating the impact of spatially correlated, or region based faults on infrastructure networks. Raptor consolidates a wide range of studies conducted in the last few years on region based faults, and provides an interface for network planners, designers and operators to use the results of these studies for designing robust and resilient networks in the presence of spatially correlated faults.

ContributorsDas, Arun (Author) / Sen, Arunabha (Thesis advisor) / Xue, Guoliang (Committee member) / Fainekos, Georgios (Committee member) / Qiao, Chunming (Committee member) / Arizona State University (Publisher)

Created2016

Filtering by