Matching Items (8)

Filtering by

Clear all filters

137174-Thumbnail Image.png

Analysis of Twitter's Effect on Stock Prices

Description

Twitter has become a very popular social media site that is used daily by many people and organizations. This paper will focus on the financial aspect of Twitter, as a process will be shown to be able to mine data

Twitter has become a very popular social media site that is used daily by many people and organizations. This paper will focus on the financial aspect of Twitter, as a process will be shown to be able to mine data about specific companies' stock prices. This was done by writing a program to grab tweets about the stocks of the thirty companies in the Dow Jones.

Contributors

Agent

Created

Date Created
2014-05

150158-Thumbnail Image.png

Multi-label dimensionality reduction

Description

Multi-label learning, which deals with data associated with multiple labels simultaneously, is ubiquitous in real-world applications. To overcome the curse of dimensionality in multi-label learning, in this thesis I study multi-label dimensionality reduction, which extracts a small number of features

Multi-label learning, which deals with data associated with multiple labels simultaneously, is ubiquitous in real-world applications. To overcome the curse of dimensionality in multi-label learning, in this thesis I study multi-label dimensionality reduction, which extracts a small number of features by removing the irrelevant, redundant, and noisy information while considering the correlation among different labels in multi-label learning. Specifically, I propose Hypergraph Spectral Learning (HSL) to perform dimensionality reduction for multi-label data by exploiting correlations among different labels using a hypergraph. The regularization effect on the classical dimensionality reduction algorithm known as Canonical Correlation Analysis (CCA) is elucidated in this thesis. The relationship between CCA and Orthonormalized Partial Least Squares (OPLS) is also investigated. To perform dimensionality reduction efficiently for large-scale problems, two efficient implementations are proposed for a class of dimensionality reduction algorithms, including canonical correlation analysis, orthonormalized partial least squares, linear discriminant analysis, and hypergraph spectral learning. The first approach is a direct least squares approach which allows the use of different regularization penalties, but is applicable under a certain assumption; the second one is a two-stage approach which can be applied in the regularization setting without any assumption. Furthermore, an online implementation for the same class of dimensionality reduction algorithms is proposed when the data comes sequentially. A Matlab toolbox for multi-label dimensionality reduction has been developed and released. The proposed algorithms have been applied successfully in the Drosophila gene expression pattern image annotation. The experimental results on some benchmark data sets in multi-label learning also demonstrate the effectiveness and efficiency of the proposed algorithms.

Contributors

Agent

Created

Date Created
2011

150190-Thumbnail Image.png

Sparse learning package with stability selection and application to alzheimer's disease

Description

Sparse learning is a technique in machine learning for feature selection and dimensionality reduction, to find a sparse set of the most relevant features. In any machine learning problem, there is a considerable amount of irrelevant information, and separating relevant

Sparse learning is a technique in machine learning for feature selection and dimensionality reduction, to find a sparse set of the most relevant features. In any machine learning problem, there is a considerable amount of irrelevant information, and separating relevant information from the irrelevant information has been a topic of focus. In supervised learning like regression, the data consists of many features and only a subset of the features may be responsible for the result. Also, the features might require special structural requirements, which introduces additional complexity for feature selection. The sparse learning package, provides a set of algorithms for learning a sparse set of the most relevant features for both regression and classification problems. Structural dependencies among features which introduce additional requirements are also provided as part of the package. The features may be grouped together, and there may exist hierarchies and over- lapping groups among these, and there may be requirements for selecting the most relevant groups among them. In spite of getting sparse solutions, the solutions are not guaranteed to be robust. For the selection to be robust, there are certain techniques which provide theoretical justification of why certain features are selected. The stability selection, is a method for feature selection which allows the use of existing sparse learning methods to select the stable set of features for a given training sample. This is done by assigning probabilities for the features: by sub-sampling the training data and using a specific sparse learning technique to learn the relevant features, and repeating this a large number of times, and counting the probability as the number of times a feature is selected. Cross-validation which is used to determine the best parameter value over a range of values, further allows to select the best parameter value. This is done by selecting the parameter value which gives the maximum accuracy score. With such a combination of algorithms, with good convergence guarantees, stable feature selection properties and the inclusion of various structural dependencies among features, the sparse learning package will be a powerful tool for machine learning research. Modular structure, C implementation, ATLAS integration for fast linear algebraic subroutines, make it one of the best tool for a large sparse setting. The varied collection of algorithms, support for group sparsity, batch algorithms, are a few of the notable functionality of the SLEP package, and these features can be used in a variety of fields to infer relevant elements. The Alzheimer Disease(AD) is a neurodegenerative disease, which gradually leads to dementia. The SLEP package is used for feature selection for getting the most relevant biomarkers from the available AD dataset, and the results show that, indeed, only a subset of the features are required to gain valuable insights.

Contributors

Agent

Created

Date Created
2011

152768-Thumbnail Image.png

Surgical instrument reprocessing in a hospital setting analyzed with statistical process control and data mining techniques

Description

In a healthcare setting, the Sterile Processing Department (SPD) provides ancillary services to the Operating Room (OR), Emergency Room, Labor & Delivery, and off-site clinics. SPD's function is to reprocess reusable surgical instruments and return them to their home departments.

In a healthcare setting, the Sterile Processing Department (SPD) provides ancillary services to the Operating Room (OR), Emergency Room, Labor & Delivery, and off-site clinics. SPD's function is to reprocess reusable surgical instruments and return them to their home departments. The management of surgical instruments and medical devices can impact patient safety and hospital revenue. Any time instrumentation or devices are not available or are not fit for use, patient safety and revenue can be negatively impacted. One step of the instrument reprocessing cycle is sterilization. Steam sterilization is the sterilization method used for the majority of surgical instruments and is preferred to immediate use steam sterilization (IUSS) because terminally sterilized items can be stored until needed. IUSS Items must be used promptly and cannot be stored for later use. IUSS is intended for emergency situations and not as regular course of action. Unfortunately, IUSS is used to compensate for inadequate inventory levels, scheduling conflicts, and miscommunications. If IUSS is viewed as an adverse event, then monitoring IUSS incidences can help healthcare organizations meet patient safety goals and financial goals along with aiding in process improvement efforts. This work recommends statistical process control methods to IUSS incidents and illustrates the use of control charts for IUSS occurrences through a case study and analysis of the control charts for data from a health care provider. Furthermore, this work considers the application of data mining methods to IUSS occurrences and presents a representative example of data mining to the IUSS occurrences. This extends the application of statistical process control and data mining in healthcare applications.

Contributors

Agent

Created

Date Created
2014

151587-Thumbnail Image.png

On feature selection stability: a data perspective

Description

The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques seemed to be paralyzed against these massive datasets. High-dimensionality is

The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques seemed to be paralyzed against these massive datasets. High-dimensionality is one of the most common challenges for machine learning and data mining tasks. Feature selection aims to reduce dimensionality by selecting a small subset of the features that perform at least as good as the full feature set. Generally, the learning performance, e.g. classification accuracy, and algorithm complexity are used to measure the quality of the algorithm. Recently, the stability of feature selection algorithms has gained an increasing attention as a new indicator due to the necessity to select similar subsets of features each time when the algorithm is run on the same dataset even in the presence of a small amount of perturbation. In order to cure the selection stability issue, we should understand the cause of instability first. In this dissertation, we will investigate the causes of instability in high-dimensional datasets using well-known feature selection algorithms. As a result, we found that the stability mostly data-dependent. According to these findings, we propose a framework to improve selection stability by solving these main causes. In particular, we found that data noise greatly impacts the stability and the learning performance as well. So, we proposed to reduce it in order to improve both selection stability and learning performance. However, current noise reduction approaches are not able to distinguish between data noise and variation in samples from different classes. For this reason, we overcome this limitation by using Supervised noise reduction via Low Rank Matrix Approximation, SLRMA for short. The proposed framework has proved to be successful on different types of datasets with high-dimensionality, such as microarrays and images datasets. However, this framework cannot handle unlabeled, hence, we propose Local SVD to overcome this limitation.

Contributors

Agent

Created

Date Created
2013

153428-Thumbnail Image.png

Mining content and relations for social spammer detection

Description

Social networking services have emerged as an important platform for large-scale information sharing and communication. With the growing popularity of social media, spamming has become rampant in the platforms. Complex network interactions and evolving content present great challenges for social

Social networking services have emerged as an important platform for large-scale information sharing and communication. With the growing popularity of social media, spamming has become rampant in the platforms. Complex network interactions and evolving content present great challenges for social spammer detection. Different from some existing well-studied platforms, distinct characteristics of newly emerged social media data present new challenges for social spammer detection. First, texts in social media are short and potentially linked with each other via user connections. Second, it is observed that abundant contextual information may play an important role in distinguishing social spammers and normal users. Third, not only the content information but also the social connections in social media evolve very fast. Fourth, it is easy to amass vast quantities of unlabeled data in social media, but would be costly to obtain labels, which are essential for many supervised algorithms. To tackle those challenges raise in social media data, I focused on developing effective and efficient machine learning algorithms for social spammer detection.

I provide a novel and systematic study of social spammer detection in the dissertation. By analyzing the properties of social network and content information, I propose a unified framework for social spammer detection by collectively using the two types of information in social media. Motivated by psychological findings in physical world, I investigate whether sentiment analysis can help spammer detection in online social media. In particular, I conduct an exploratory study to analyze the sentiment differences between spammers and normal users; and present a novel method to incorporate sentiment information into social spammer detection framework. Given the rapidly evolving nature, I propose a novel framework to efficiently reflect the effect of newly emerging social spammers. To tackle the problem of lack of labeling data in social media, I study how to incorporate network information into text content modeling, and design strategies to select the most representative and informative instances from social media for labeling. Motivated by publicly available label information from other media platforms, I propose to make use of knowledge learned from cross-media to help spammer detection on social media.

Contributors

Agent

Created

Date Created
2015

153374-Thumbnail Image.png

Managing a user's vulnerability on a social networking site

Description

Users often join an online social networking (OSN) site, like Facebook, to remain social, by either staying connected with friends or expanding social networks. On an OSN site, users generally share variety of personal information which is often expected

Users often join an online social networking (OSN) site, like Facebook, to remain social, by either staying connected with friends or expanding social networks. On an OSN site, users generally share variety of personal information which is often expected to be visible to their friends, but sometimes vulnerable to unwarranted access from others. The recent study suggests that many personal attributes, including religious and political affiliations, sexual orientation, relationship status, age, and gender, are predictable using users' personal data from an OSN site. The majority of users want to remain socially active, and protect their personal data at the same time. This tension leads to a user's vulnerability, allowing privacy attacks which can cause physical and emotional distress to a user, sometimes with dire consequences. For example, stalkers can make use of personal information available on an OSN site to their personal gain. This dissertation aims to systematically study a user vulnerability against such privacy attacks.

A user vulnerability can be managed in three steps: (1) identifying, (2) measuring and (3) reducing a user vulnerability. Researchers have long been identifying vulnerabilities arising from user's personal data, including user names, demographic attributes, lists of friends, wall posts and associated interactions, multimedia data such as photos, audios and videos, and tagging of friends. Hence, this research first proposes a way to measure and reduce a user vulnerability to protect such personal data. This dissertation also proposes an algorithm to minimize a user's vulnerability while maximizing their social utility values.

To address these vulnerability concerns, social networking sites like Facebook usually let their users to adjust their profile settings so as to make some of their data invisible. However, users sometimes interact with others using unprotected posts (e.g., posts from a ``Facebook page\footnote{The term ''Facebook page`` refers to the page which are commonly dedicated for businesses, brands and organizations to share their stories and connect with people.}''). Such interactions help users to become more social and are publicly accessible to everyone. Thus, visibilities of these interactions are beyond the control of their profile settings. I explore such unprotected interactions so that users' are well aware of these new vulnerabilities and adopt measures to mitigate them further. In particular, {\em are users' personal attributes predictable using only the unprotected interactions}? To answer this question, I address a novel problem of predictability of users' personal attributes with unprotected interactions. The extreme sparsity patterns in users' unprotected interactions pose a serious challenge. Therefore, I approach to mitigating the data sparsity challenge by designing a novel attribute prediction framework using only the unprotected interactions. Experimental results on Facebook dataset demonstrates that the proposed framework can predict users' personal attributes.

Contributors

Agent

Created

Date Created
2015

154269-Thumbnail Image.png

Effective gene expression annotation approaches for mouse brain images

Description

Understanding the complexity of temporal and spatial characteristics of gene expression over brain development is one of the crucial research topics in neuroscience. An accurate description of the locations and expression status of relative genes requires extensive experiment resources. The

Understanding the complexity of temporal and spatial characteristics of gene expression over brain development is one of the crucial research topics in neuroscience. An accurate description of the locations and expression status of relative genes requires extensive experiment resources. The Allen Developing Mouse Brain Atlas provides a large number of in situ hybridization (ISH) images of gene expression over seven different mouse brain developmental stages. Studying mouse brain models helps us understand the gene expressions in human brains. This atlas collects about thousands of genes and now they are manually annotated by biologists. Due to the high labor cost of manual annotation, investigating an efficient approach to perform automated gene expression annotation on mouse brain images becomes necessary. In this thesis, a novel efficient approach based on machine learning framework is proposed. Features are extracted from raw brain images, and both binary classification and multi-class classification models are built with some supervised learning methods. To generate features, one of the most adopted methods in current research effort is to apply the bag-of-words (BoW) algorithm. However, both the efficiency and the accuracy of BoW are not outstanding when dealing with large-scale data. Thus, an augmented sparse coding method, which is called Stochastic Coordinate Coding, is adopted to generate high-level features in this thesis. In addition, a new multi-label classification model is proposed in this thesis. Label hierarchy is built based on the given brain ontology structure. Experiments have been conducted on the atlas and the results show that this approach is efficient and classifies the images with a relatively higher accuracy.

Contributors

Agent

Created

Date Created
2016