Search Content

Data Poisoning Attacks on Linked Data with Graph Regularization

Description

Social media has become the norm of everyone for communication. The usage of social media has increased exponentially in the last decade. The myriads of Social media services such as Facebook, Twitter, Snapchat, and Instagram etc allow people to connect with their friends, and followers freely. The attackers who try…

Social media has become the norm of everyone for communication. The usage of social media has increased exponentially in the last decade. The myriads of Social media services such as Facebook, Twitter, Snapchat, and Instagram etc allow people to connect with their friends, and followers freely. The attackers who try to take advantage of this situation has also increased at an exponential rate. Every social media service has its own recommender systems and user profiling algorithms. These algorithms use users current information to make different recommendations. Often the data that is formed from social media services is Linked data as each item/user is usually linked with other users/items. Recommender systems due to their ubiquitous and prominent nature are prone to several forms of attacks. One of the major form of attacks is poisoning the training set data. As recommender systems use current user/item information as the training set to make recommendations, the attacker tries to modify the training set in such a way that the recommender system would benefit the attacker or give incorrect recommendations and hence failing in its basic functionality. Most existing training set attack algorithms work with ``flat" attribute-value data which is typically assumed to be independent and identically distributed (i.i.d.). However, the i.i.d. assumption does not hold for social media data since it is inherently linked as described above. Usage of user-similarity with Graph Regularizer in morphing the training data produces best results to attacker. This thesis proves the same by demonstrating with experiments on Collaborative Filtering with multiple datasets.

ContributorsMagham, Venkatesh (Author) / Liu, Huan (Thesis advisor) / Wu, Liang (Committee member) / Amor, Hani Ben (Committee member) / Arizona State University (Publisher)

Created2019

Misinformation Detection in Social Media

Description

The pervasive use of social media gives it a crucial role in helping the public perceive reliable information. Meanwhile, the openness and timeliness of social networking sites also allow for the rapid creation and dissemination of misinformation. It becomes increasingly difficult for online users to find accurate and trustworthy information.…

The pervasive use of social media gives it a crucial role in helping the public perceive reliable information. Meanwhile, the openness and timeliness of social networking sites also allow for the rapid creation and dissemination of misinformation. It becomes increasingly difficult for online users to find accurate and trustworthy information. As witnessed in recent incidents of misinformation, it escalates quickly and can impact social media users with undesirable consequences and wreak havoc instantaneously. Different from some existing research in psychology and social sciences about misinformation, social media platforms pose unprecedented challenges for misinformation detection. First, intentional spreaders of misinformation will actively disguise themselves. Second, content of misinformation may be manipulated to avoid being detected, while abundant contextual information may play a vital role in detecting it. Third, not only accuracy, earliness of a detection method is also important in containing misinformation from being viral. Fourth, social media platforms have been used as a fundamental data source for various disciplines, and these research may have been conducted in the presence of misinformation. To tackle the challenges, we focus on developing machine learning algorithms that are robust to adversarial manipulation and data scarcity.

The main objective of this dissertation is to provide a systematic study of misinformation detection in social media. To tackle the challenges of adversarial attacks, I propose adaptive detection algorithms to deal with the active manipulations of misinformation spreaders via content and networks. To facilitate content-based approaches, I analyze the contextual data of misinformation and propose to incorporate the specific contextual patterns of misinformation into a principled detection framework. Considering its rapidly growing nature, I study how misinformation can be detected at an early stage. In particular, I focus on the challenge of data scarcity and propose a novel framework to enable historical data to be utilized for emerging incidents that are seemingly irrelevant. With misinformation being viral, applications that rely on social media data face the challenge of corrupted data. To this end, I present robust statistical relational learning and personalization algorithms to minimize the negative effect of misinformation.

ContributorsWu, Liang (Author) / Liu, Huan (Thesis advisor) / Tong, Hanghang (Committee member) / Doupe, Adam (Committee member) / Davison, Brian D. (Committee member) / Arizona State University (Publisher)

Created2019

Mining Data with Feature Interactions

Description

Models using feature interactions have been applied successfully in many areas such as biomedical analysis, recommender systems. The popularity of using feature interactions mainly lies in (1) they are able to capture the nonlinearity of the data compared with linear effects and (2) they enjoy great interpretability. In this thesis,…

Models using feature interactions have been applied successfully in many areas such as biomedical analysis, recommender systems. The popularity of using feature interactions mainly lies in (1) they are able to capture the nonlinearity of the data compared with linear effects and (2) they enjoy great interpretability. In this thesis, I propose a series of formulations using feature interactions for real world problems and develop efficient algorithms for solving them.

Specifically, I first propose to directly solve the non-convex formulation of the weak hierarchical Lasso which imposes weak hierarchy on individual features and interactions but can only be approximately solved by a convex relaxation in existing studies. I further propose to use the non-convex weak hierarchical Lasso formulation for hypothesis testing on the interaction features with hierarchical assumptions. Secondly, I propose a type of bi-linear models that take advantage of interactions of features for drug discovery problems where specific drug-drug pairs or drug-disease pairs are of interest. These models are learned by maximizing the number of positive data pairs that rank above the average score of unlabeled data pairs. Then I generalize the method to the case of using the top-ranked unlabeled data pairs for representative construction and derive an efficient algorithm for the extended formulation. Last but not least, motivated by a special form of bi-linear models, I propose a framework that enables simultaneously subgrouping data points and building specific models on the subgroups for learning on massive and heterogeneous datasets. Experiments on synthetic and real datasets are conducted to demonstrate the effectiveness or efficiency of the proposed methods.

ContributorsLiu, Yashu (Author) / Ye, Jieping (Thesis advisor) / Xue, Guoliang (Thesis advisor) / Liu, Huan (Committee member) / Mittelmann, Hans D (Committee member) / Arizona State University (Publisher)

Created2018

System Identification, State Estimation, And Control Approaches to Gestational Weight Gain Interventions

Description

Excessive weight gain during pregnancy is a significant public health concern and has been the recent focus of novel, control systems-based interventions. Healthy Mom Zone (HMZ) is an intervention study that aims to develop and validate an individually tailored and intensively adaptive intervention to manage weight gain for overweight or…

Excessive weight gain during pregnancy is a significant public health concern and has been the recent focus of novel, control systems-based interventions. Healthy Mom Zone (HMZ) is an intervention study that aims to develop and validate an individually tailored and intensively adaptive intervention to manage weight gain for overweight or obese pregnant women using control engineering approaches. Motivated by the needs of the HMZ, this dissertation presents how to use system identification and state estimation techniques to assist in dynamical systems modeling and further enhance the performance of the closed-loop control system for interventions.

Underreporting of energy intake (EI) has been found to be an important consideration that interferes with accurate weight control assessment and the effective use of energy balance (EB) models in an intervention setting. To better understand underreporting, a variety of estimation approaches are developed; these include back-calculating energy intake from a closed-form of the EB model, a Kalman-filter based algorithm for recursive estimation from randomly intermittent measurements in real time, and two semi-physical identification approaches that can parameterize the extent of systematic underreporting with global/local modeling techniques. Each approach is analyzed with intervention participant data and demonstrates potential of promoting the success of weight control.

In addition, substantial efforts have been devoted to develop participant-validated models and incorporate into the Hybrid Model Predictive Control (HMPC) framework for closed-loop interventions. System identification analyses from Phase I led to modifications of the measurement protocols for Phase II, from which longer and more informative data sets were collected. Participant-validated models obtained from Phase II data significantly increase predictive ability for individual behaviors and provide reliable open-loop dynamic information for HMPC implementation. The HMPC algorithm that assigns optimized dosages in response to participant real time intervention outcomes relies on a Mixed Logical Dynamical framework which can address the categorical nature of dosage components, and translates sequential decision rules and other clinical considerations into mixed-integer linear constraints. The performance of the HMPC decision algorithm was tested with participant-validated models, with the results indicating that HMPC is superior to "IF-THEN" decision rules.

ContributorsGuo, Penghong (Author) / Rivera, Daniel E. (Thesis advisor) / Peet, Matthew M. (Committee member) / Forzani, Erica (Committee member) / Deng, Shuguang (Committee member) / Pavlic, Theodore P. (Committee member) / Arizona State University (Publisher)

Created2018

Toward small community discovery in social networks

Description

A community in a social network can be viewed as a structure formed by individuals who share similar interests. Not all communities are explicit; some may be hidden in a large network. Therefore, discovering these hidden communities becomes an interesting problem. Researchers from a number of fields have developed algorithms…

A community in a social network can be viewed as a structure formed by individuals who share similar interests. Not all communities are explicit; some may be hidden in a large network. Therefore, discovering these hidden communities becomes an interesting problem. Researchers from a number of fields have developed algorithms to tackle this problem.

Besides the common feature above, communities within a social network have two unique characteristics: communities are mostly small and overlapping. Unfortunately, many traditional algorithms have difficulty recognizing these small communities (often called the resolution limit problem) as well as overlapping communities.

In this work, two enhanced community detection techniques are proposed for re-working existing community detection algorithms to find small communities in social networks. One method is to modify the modularity measure within the framework of the traditional Newman-Girvan algorithm so that more small communities can be detected. The second method is to incorporate a preprocessing step into existing algorithms by changing edge weights inside communities. Both methods help improve community detection performance while maintaining or improving computational efficiency.

ContributorsWang, Ran (Author) / Liu, Huan (Thesis advisor) / Sen, Arunabha (Committee member) / Colbourn, Charles (Committee member) / Arizona State University (Publisher)

Created2015

Improved, scalable, and personalized context recovery system: E-TweetSense

Description

Browsing Twitter users, or browsers, often find it increasingly cumbersome to attach meaning to tweets that are displayed on their timeline as they follow more and more users or pages. The tweets being browsed are created by Twitter users called originators, and are of some significance to the browser who…

Browsing Twitter users, or browsers, often find it increasingly cumbersome to attach meaning to tweets that are displayed on their timeline as they follow more and more users or pages. The tweets being browsed are created by Twitter users called originators, and are of some significance to the browser who has chosen to subscribe to the tweets from the originator by following the originator. Although, hashtags are used to tag tweets in an effort to attach context to the tweets, many tweets do not have a hashtag. Such tweets are called orphan tweets and they adversely affect the experience of a browser.

A hashtag is a type of label or meta-data tag used in social networks and micro-blogging services which makes it easier for users to find messages with a specific theme or content. The context of a tweet can be defined as a set of one or more hashtags. Users often do not use hashtags to tag their tweets. This leads to the problem of missing context for tweets. To address the problem of missing hashtags, a statistical method was proposed which predicts most likely hashtags based on the social circle of an originator.

In this thesis, we propose to improve on the existing context recovery system by selectively limiting the candidate set of hashtags to be derived from the intimate circle of the originator rather than from every user in the social network of the originator. This helps in reducing the computation, increasing speed of prediction, scaling the system to originators with large social networks while still preserving most of the accuracy of the predictions. We also propose to not only derive the candidate hashtags from the social network of the originator but also derive the candidate hashtags based on the content of the tweet. We further propose to learn personalized statistical models according to the adoption patterns of different originators. This helps in not only identifying the personalized candidate set of hashtags based on the social circle and content of the tweets but also in customizing the hashtag adoption pattern to the originator of the tweet.

ContributorsMallapura Umamaheshwar, Tejas (Author) / Kambhampati, Subbarao (Thesis advisor) / Liu, Huan (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)

Created2015

Making thin data thick: user behavior analysis with minimum information

Description

With the rise of social media, user-generated content has become available at an unprecedented scale. On Twitter, 1 billion tweets are posted every 5 days and on Facebook, 20 million links are shared every 20 minutes. These massive collections of user-generated content have introduced the human behavior's big-data.

This big data…

With the rise of social media, user-generated content has become available at an unprecedented scale. On Twitter, 1 billion tweets are posted every 5 days and on Facebook, 20 million links are shared every 20 minutes. These massive collections of user-generated content have introduced the human behavior's big-data.

This big data has brought about countless opportunities for analyzing human behavior at scale. However, is this data enough? Unfortunately, the data available at the individual-level is limited for most users. This limited individual-level data is often referred to as thin data. Hence, researchers face a big-data paradox, where this big-data is a large collection of mostly limited individual-level information. Researchers are often constrained to derive meaningful insights regarding online user behavior with this limited information. Simply put, they have to make thin data thick.

In this dissertation, how human behavior's thin data can be made thick is investigated. The chief objective of this dissertation is to demonstrate how traces of human behavior can be efficiently gleaned from the, often limited, individual-level information; hence, introducing an all-inclusive user behavior analysis methodology that considers social media users with different levels of information availability. To that end, the absolute minimum information in terms of both link or content data that is available for any social media user is determined. Utilizing only minimum information in different applications on social media such as prediction or recommendation tasks allows for solutions that are (1) generalizable to all social media users and that are (2) easy to implement. However, are applications that employ only minimum information as effective or comparable to applications that use more information?

In this dissertation, it is shown that common research challenges such as detecting malicious users or friend recommendation (i.e., link prediction) can be effectively performed using only minimum information. More importantly, it is demonstrated that unique user identification can be achieved using minimum information. Theoretical boundaries of unique user identification are obtained by introducing social signatures. Social signatures allow for user identification in any large-scale network on social media. The results on single-site user identification are generalized to multiple sites and it is shown how the same user can be uniquely identified across multiple sites using only minimum link or content information.

The findings in this dissertation allows finding the same user across multiple sites, which in turn has multiple implications. In particular, by identifying the same users across sites, (1) patterns that users exhibit across sites are identified, (2) how user behavior varies across sites is determined, and (3) activities that are observed only across sites are identified and studied.

ContributorsZafarani, Reza, 1983- (Author) / Liu, Huan (Thesis advisor) / Kambhampati, Subbarao (Committee member) / Xue, Guoliang (Committee member) / Leskovec, Jure (Committee member) / Arizona State University (Publisher)

Created2015

Visual analytics for spatiotemporal cluster analysis

Description

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries and relate domain knowledge about underlying geographical phenomena that would…

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries and relate domain knowledge about underlying geographical phenomena that would not be apparent in tabular form. However, several critical challenges arise when visualizing and exploring these large spatiotemporal datasets. While, the underlying geographical component of the data lends itself well to univariate visualization in the form of traditional cartographic representations (e.g., choropleth, isopleth, dasymetric maps), as the data becomes multivariate, cartographic representations become more complex. To simplify the visual representations, analytical methods such as clustering and feature extraction are often applied as part of the classification phase. The automatic classification can then be rendered onto a map; however, one common issue in data classification is that items near a classification boundary are often mislabeled.

This thesis explores methods to augment the automated spatial classification by utilizing interactive machine learning as part of the cluster creation step. First, this thesis explores the design space for spatiotemporal analysis through the development of a comprehensive data wrangling and exploratory data analysis platform. Second, this system is augmented with a novel method for evaluating the visual impact of edge cases for multivariate geographic projections. Finally, system features and functionality are demonstrated through a series of case studies, with key features including similarity analysis, multivariate clustering, and novel visual support for cluster comparison.

ContributorsZhang, Yifan (Author) / Maciejewski, Ross (Thesis advisor) / Mack, Elizabeth (Committee member) / Liu, Huan (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)

Created2016

Web Intelligence for Scaling Discourse of Organizations

Description

Internet and social media devices created a new public space for debate on political

and social topics (Papacharissi 2002; Himelboim 2010). Hotly debated issues

span all spheres of human activity; from liberal vs. conservative politics, to radical

vs. counter-radical religious debate, to climate change debate in scientific community,

to globalization debate in economics, and…

Internet and social media devices created a new public space for debate on political

and social topics (Papacharissi 2002; Himelboim 2010). Hotly debated issues

span all spheres of human activity; from liberal vs. conservative politics, to radical

vs. counter-radical religious debate, to climate change debate in scientific community,

to globalization debate in economics, and to nuclear disarmament debate in

security. Many prominent ’camps’ have emerged within Internet debate rhetoric and

practice (Dahlberg, n.d.).

In this research I utilized feature extraction and model fitting techniques to process

the rhetoric found in the web sites of 23 Indonesian Islamic religious organizations,

later with 26 similar organizations from the United Kingdom to profile their

ideology and activity patterns along a hypothesized radical/counter-radical scale, and

presented an end-to-end system that is able to help researchers to visualize the data

in an interactive fashion on a time line. The subject data of this study is the articles

downloaded from the web sites of these organizations dating from 2001 to 2011,

and in 2013. I developed algorithms to rank these organizations by assigning them

to probable positions on the scale. I showed that the developed Rasch model fits

the data using Andersen’s LR-test (likelihood ratio). I created a gold standard of

the ranking of these organizations through an expertise elicitation tool. Then using

my system I computed expert-to-expert agreements, and then presented experimental

results comparing the performance of three baseline methods to show that the

Rasch model not only outperforms the baseline methods, but it was also the only

system that performs at expert-level accuracy.

I developed an end-to-end system that receives list of organizations from experts,

mines their web corpus, prepare discourse topic lists with expert support, and then

ranks them on scales with partial expert interaction, and finally presents them on an

easy to use web based analytic system.

ContributorsTikves, Sukru (Author) / Davulcu, Hasan (Thesis advisor) / Sen, Arunabha (Committee member) / Liu, Huan (Committee member) / Woodward, Mark (Committee member) / Arizona State University (Publisher)

Created2016

Patient-centered and experience-aware mining for effective information discovery in health forums

Description

Online health forums provide a convenient channel for patients, caregivers, and medical professionals to share their experience, support and encourage each other, and form health communities. The fast growing content in health forums provides a large repository for people to seek valuable information. A forum user can issue a keyword…

Online health forums provide a convenient channel for patients, caregivers, and medical professionals to share their experience, support and encourage each other, and form health communities. The fast growing content in health forums provides a large repository for people to seek valuable information. A forum user can issue a keyword query to search health forums regarding to some specific questions, e.g., what treatments are effective for a disease symptom? A medical researcher can discover medical knowledge in a timely and large-scale fashion by automatically aggregating the latest evidences emerging in health forums.

This dissertation studies how to effectively discover information in health forums. Several challenges have been identified. First, the existing work relies on the syntactic information unit, such as a sentence, a post, or a thread, to bind different pieces of information in a forum. However, most of information discovery tasks should be based on the semantic information unit, a patient. For instance, given a keyword query that involves the relationship between a treatment and side effects, it is expected that the matched keywords refer to the same patient. In this work, patient-centered mining is proposed to mine patient semantic information units. In a patient information unit, the health information, such as diseases, symptoms, treatments, effects, and etc., is connected by the corresponding patient.

Second, the information published in health forums has varying degree of quality. Some information includes patient-reported personal health experience, while others can be hearsay. In this work, a context-aware experience extraction framework is proposed to mine patient-reported personal health experience, which can be used for evidence-based knowledge discovery or finding patients with similar experience.

At last, the proposed patient-centered and experience-aware mining framework is used to build a patient health information database for effectively discovering adverse drug reactions (ADRs) from health forums. ADRs have become a serious health problem and even a leading cause of death in the United States. Health forums provide valuable evidences in a large scale and in a timely fashion through the active participation of patients, caregivers, and doctors. Empirical evaluation shows the effectiveness of the proposed approach.

ContributorsLiu, Yunzhong (Author) / Chen, Yi (Thesis advisor) / Liu, Huan (Thesis advisor) / Li, Baoxin (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)

Created2016

Filtering by