Search Content

Proactive Identification of Cybersecurity Threats Using Online Sources

Description

Many existing applications of machine learning (ML) to cybersecurity are focused on detecting malicious activity already present in an enterprise. However, recent high-profile cyberattacks proved that certain threats could have been avoided. The speed of contemporary attacks along with the high costs of remediation incentivizes avoidance over response. Yet, avoidance…

Many existing applications of machine learning (ML) to cybersecurity are focused on detecting malicious activity already present in an enterprise. However, recent high-profile cyberattacks proved that certain threats could have been avoided. The speed of contemporary attacks along with the high costs of remediation incentivizes avoidance over response. Yet, avoidance implies the ability to predict - a notoriously difficult task due to high rates of false positives, difficulty in finding data that is indicative of future events, and the unexplainable results from machine learning algorithms.

In this dissertation, these challenges are addressed by presenting three artificial intelligence (AI) approaches to support prioritizing defense measures. The first two approaches leverage ML on cyberthreat intelligence data to predict if exploits are going to be used in the wild. The first work focuses on what data feeds are generated after vulnerability disclosures. The developed ML models outperform the current industry-standard method with F1 score more than doubled. Then, an approach to derive features about who generated the said data feeds is developed. The addition of these features increase recall by over 19% while maintaining precision. Finally, frequent itemset mining is combined with a variant of a probabilistic temporal logic framework to predict when attacks are likely to occur. In this approach, rules correlating malicious activity in the hacking community platforms with real-world cyberattacks are mined. They are then used in a deductive reasoning approach to generate predictions. The developed approach predicted unseen real-world attacks with an average increase in the value of F1 score by over 45%, compared to a baseline approach.

ContributorsAlmukaynizi, Mohammed (Author) / Shakarian, Paulo (Thesis advisor) / Huang, Dijiang (Committee member) / Maciejewski, Ross (Committee member) / Simari, Gerardo I. (Committee member) / Arizona State University (Publisher)

Created2019

Understanding Bots on Social Media - An Application in Disaster Response

Description

Social media has become a primary platform for real-time information sharing among users. News on social media spreads faster than traditional outlets and millions of users turn to this platform to receive the latest updates on major events especially disasters. Social media bridges the gap between the people who are…

Social media has become a primary platform for real-time information sharing among users. News on social media spreads faster than traditional outlets and millions of users turn to this platform to receive the latest updates on major events especially disasters. Social media bridges the gap between the people who are affected by disasters, volunteers who offer contributions, and first responders. On the other hand, social media is a fertile ground for malicious users who purposefully disturb the relief processes facilitated on social media. These malicious users take advantage of social bots to overrun social media posts with fake images, rumors, and false information. This process causes distress and prevents actionable information from reaching the affected people. Social bots are automated accounts that are controlled by a malicious user and these bots have become prevalent on social media in recent years.

In spite of existing efforts towards understanding and removing bots on social media, there are at least two drawbacks associated with the current bot detection algorithms: general-purpose bot detection methods are designed to be conservative and not label a user as a bot unless the algorithm is highly confident and they overlook the effect of users who are manipulated by bots and (unintentionally) spread their content. This study is trifold. First, I design a Machine Learning model that uses content and context of social media posts to detect actionable ones among them; it specifically focuses on tweets in which people ask for help after major disasters. Second, I focus on bots who can be a facilitator of malicious content spreading during disasters. I propose two methods for detecting bots on social media with a focus on the recall of the detection. Third, I study the characteristics of users who spread the content of malicious actors. These features have the potential to improve methods that detect malicious content such as fake news.

ContributorsHossein Nazer, Tahora (Author) / Liu, Huan (Thesis advisor) / Davulcu, Hasan (Committee member) / Maciejewski, Ross (Committee member) / Akoglu, Leman (Committee member) / Arizona State University (Publisher)

Created2019

Assessing Influential Users in Live Streaming Social Networks

Description

Live streaming has risen to significant popularity in the recent past and largely this live streaming is a feature of existing social networks like Facebook, Instagram, and Snapchat. However, there does exist at least one social network entirely devoted to live streaming, and specifically the live streaming of video games,…

Live streaming has risen to significant popularity in the recent past and largely this live streaming is a feature of existing social networks like Facebook, Instagram, and Snapchat. However, there does exist at least one social network entirely devoted to live streaming, and specifically the live streaming of video games, Twitch. This social network is unique for a number of reasons, not least because of its hyper-focus on live content and this uniqueness has challenges for social media researchers.

Despite this uniqueness, almost no scientific work has been performed on this public social network. Thus, it is unclear what user interaction features present on other social networks exist on Twitch. Investigating the interactions between users and identifying which, if any, of the common user behaviors on social network exist on Twitch is an important step in understanding how Twitch fits in to the social media ecosystem. For example, there are users that have large followings on Twitch and amass a large number of viewers, but do those users exert influence over the behavior of other user the way that popular users on Twitter do?

This task, however, will not be trivial. The same hyper-focus on live content that makes Twitch unique in the social network space invalidates many of the traditional approaches to social network analysis. Thus, new algorithms and techniques must be developed in order to tap this data source. In this thesis, a novel algorithm for finding games whose releases have made a significant impact on the network is described as well as a novel algorithm for detecting and identifying influential players of games. In addition, the Twitch network is described in detail along with the data that was collected in order to power the two previously described algorithms.

ContributorsJones, Isaac (Author) / Liu, Huan (Thesis advisor) / Maciejewski, Ross (Committee member) / Shakarian, Paulo (Committee member) / Agarwal, Nitin (Committee member) / Arizona State University (Publisher)

Created2019

The impact of graph layouts on the perception of graph properties

Description

Graphs are commonly used visualization tools in a variety of fields. Algorithms have been proposed that claim to improve the readability of graphs by reducing edge crossings, adjusting edge length, or some other means. However, little research has been done to determine which of these algorithms best suit human perception…

Graphs are commonly used visualization tools in a variety of fields. Algorithms have been proposed that claim to improve the readability of graphs by reducing edge crossings, adjusting edge length, or some other means. However, little research has been done to determine which of these algorithms best suit human perception for particular graph properties. This thesis explores four different graph properties: average local clustering coefficient (ALCC), global clustering coefficient (GCC), number of triangles (NT), and diameter. For each of these properties, three different graph layouts are applied to represent three different approaches to graph visualization: multidimensional scaling (MDS), force directed (FD), and tsNET. In a series of studies conducted through the crowdsourcing platform Amazon Mechanical Turk, participants are tasked with discriminating between two graphs in order to determine their just noticeable differences (JNDs) for the four graph properties and three layout algorithm pairs. These results are analyzed using previously established methods presented by Rensink et al. and Kay and Heer.The average JNDs are analyzed using a linear model that determines whether the property-layout pair seems to follow Weber's Law, and the individual JNDs are run through a log-linear model to determine whether it is possible to model the individual variance of the participant's JNDs. The models are evaluated using the R2 score to determine if they adequately explain the data and compared using the Mann-Whitney pairwise U-test to determine whether the layout has a significant effect on the perception of the graph property. These tests indicate that the data collected in the studies can not always be modelled well with either the linear model or log-linear model, which suggests that some properties may not follow Weber's Law. Additionally, the layout algorithm is not found to have a significant impact on the perception of some of these properties.

ContributorsClayton, Benjamin (Author) / Maciejewski, Ross (Thesis advisor) / Kobourov, Stephen (Committee member) / Sefair, Jorge (Committee member) / Arizona State University (Publisher)

Created2019

Composition of Geographic-Based Component Simulation Models

Description

Component simulation models, such as agent-based models, may depend on spatial data associated with geographic locations. Composition of such models can be achieved using a Geographic Knowledge Interchange Broker (GeoKIB) enabled with spatial-temporal data transformation functions, each of which is responsible for a set of interactions between two independent models.…

Component simulation models, such as agent-based models, may depend on spatial data associated with geographic locations. Composition of such models can be achieved using a Geographic Knowledge Interchange Broker (GeoKIB) enabled with spatial-temporal data transformation functions, each of which is responsible for a set of interactions between two independent models. The use of autonomous interaction models allows model composition without alteration of the composed component models. An interaction model must handle differences in the spatial resolutions between models, in addition to differences in their temporal input/output data types and resolutions.

A generalized GeoKIB was designed that regulates unidirectional spatially-based interactions between composed models. Different input and output data types are used for the interaction model, depending on whether data transfer should be passive or active. Synchronization of time-tagged input/output values is made possible with the use of dependency on a discrete simulation clock. An algorithm supporting spatial conversion is developed to transform any two-dimensional geographic data map between different region specifications. Maps belonging to the composed models can have different regions, map cell sizes, or boundaries. The GeoKIB can be extended based on the model specifications to be composed and the target application domain.

Two separate, simplistic models were created to demonstrate model composition via the GeoKIB. An interaction model was created for each of the two directions the composed models interact. This exemplar is developed to demonstrate composition and simulation of geographic-based component models.

ContributorsBoyd, William Angelo (Author) / Sarjoughian, Hessam S. (Thesis advisor) / Maciejewski, Ross (Committee member) / Sarwat, Mohamed (Committee member) / Arizona State University (Publisher)

Created2019

Understanding Disinformation: Learning with Weak Social Supervision

Description

Social media has become an important means of user-centered information sharing and communications in a gamut of domains, including news consumption, entertainment, marketing, public relations, and many more. The low cost, easy access, and rapid dissemination of information on social media draws a large audience but also exacerbate the wide…

Social media has become an important means of user-centered information sharing and communications in a gamut of domains, including news consumption, entertainment, marketing, public relations, and many more. The low cost, easy access, and rapid dissemination of information on social media draws a large audience but also exacerbate the wide propagation of disinformation including fake news, i.e., news with intentionally false information. Disinformation on social media is growing fast in volume and can have detrimental societal effects. Despite the importance of this problem, our understanding of disinformation in social media is still limited. Recent advancements of computational approaches on detecting disinformation and fake news have shown some early promising results. Novel challenges are still abundant due to its complexity, diversity, dynamics, multi-modality, and costs of fact-checking or annotation.

Social media data opens the door to interdisciplinary research and allows one to collectively study large-scale human behaviors otherwise impossible. For example, user engagements over information such as news articles, including posting about, commenting on, or recommending the news on social media, contain abundant rich information. Since social media data is big, incomplete, noisy, unstructured, with abundant social relations, solely relying on user engagements can be sensitive to noisy user feedback. To alleviate the problem of limited labeled data, it is important to combine contents and this new (but weak) type of information as supervision signals, i.e., weak social supervision, to advance fake news detection.

The goal of this dissertation is to understand disinformation by proposing and exploiting weak social supervision for learning with little labeled data and effectively detect disinformation via innovative research and novel computational methods. In particular, I investigate learning with weak social supervision for understanding disinformation with the following computational tasks: bringing the heterogeneous social context as auxiliary information for effective fake news detection; discovering explanations of fake news from social media for explainable fake news detection; modeling multi-source of weak social supervision for early fake news detection; and transferring knowledge across domains with adversarial machine learning for cross-domain fake news detection. The findings of the dissertation significantly expand the boundaries of disinformation research and establish a novel paradigm of learning with weak social supervision that has important implications in broad applications in social media.

ContributorsShu, Kai (Author) / Liu, Huan (Thesis advisor) / Bernard, H. Russell (Committee member) / Maciejewski, Ross (Committee member) / Xue, Guoliang (Committee member) / Arizona State University (Publisher)

Created2020

An Empirical Study of View Construction for Multi-View Learning

Description

Multi-view learning, a subfield of machine learning that aims to improve model performance by training on multiple views of the data, has been studied extensively in the past decades. It is typically applied in contexts where the input features naturally form multiple groups or views. An example of a naturally…

Multi-view learning, a subfield of machine learning that aims to improve model performance by training on multiple views of the data, has been studied extensively in the past decades. It is typically applied in contexts where the input features naturally form multiple groups or views. An example of a naturally multi-view context is a data set of websites, where each website is described not only by the text on the page, but also by the text of hyperlinks pointing to the page. More recently, various studies have demonstrated the initial success of applying multi-view learning on single-view data with multiple artificially constructed views. However, there lacks a systematic study regarding the effectiveness of such artificially constructed views. To bridge this gap, this thesis begins by providing a high-level overview of multi-view learning with the co-training algorithm. Co-training is a classic semi-supervised learning algorithm that takes advantage of both labelled and unlabelled examples in the data set for training. Then, the thesis presents a web-based tool developed in Python allowing users to experiment with and compare the performance of multiple view construction approaches on various data sets. The supported view construction approaches in the web-based tool include subsampling, Optimal Feature Set Partitioning, and the genetic algorithm. Finally, the thesis presents an empirical comparison of the performance of these approaches, not only against one another, but also against traditional single-view models. The findings show that a simple subsampling approach combined with co-training often outperforms both the other view construction approaches, as well as traditional single-view methods.

ContributorsAksoy, Kaan (Author) / Maciejewski, Ross (Thesis director) / He, Jingrui (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2019-12

3D rooftop detection and modeling using orthographic aerial images

Description

Detection of extruded features like rooftops and trees in aerial images automatically is a very active area of research. Elevated features identified from aerial imagery have potential applications in urban planning, identifying cover in military training or flight training. Detection of such features using commonly available geospatial data like orthographic…

Detection of extruded features like rooftops and trees in aerial images automatically is a very active area of research. Elevated features identified from aerial imagery have potential applications in urban planning, identifying cover in military training or flight training. Detection of such features using commonly available geospatial data like orthographic aerial imagery is very challenging because rooftop and tree textures are often camouflaged by similar looking features like roads, ground and grass. So, additonal data such as LIDAR, multispectral imagery and multiple viewpoints are exploited for more accurate detection. However, such data is often not available, or may be improperly registered or inacurate. In this thesis, we discuss a novel framework that only uses orthographic images for detection and modeling of rooftops. A segmentation scheme that initializes by assigning either foreground (rooftop) or background labels to certain pixels in the image based on shadows is proposed. Then it employs grabcut to assign one of those two labels to the rest of the pixels based on initial labeling. Parametric model fitting is performed on the segmented results in order to create a 3D scene and to facilitate roof-shape and height estimation. The framework can also benefit from additional geospatial data such as streetmaps and LIDAR, if available.

ContributorsKhanna, Kunal (Author) / Femiani, John (Thesis advisor) / Wonka, Peter (Thesis advisor) / Razdan, Anshuman (Committee member) / Maciejewski, Ross (Committee member) / Arizona State University (Publisher)

Created2013

Model based safety analysis and verification of cyber-physical systems

Description

Critical infrastructures in healthcare, power systems, and web services, incorporate cyber-physical systems (CPSes), where the software controlled computing systems interact with the physical environment through actuation and monitoring. Ensuring software safety in CPSes, to avoid hazards to property and human life as a result of un-controlled interactions, is essential and…

Critical infrastructures in healthcare, power systems, and web services, incorporate cyber-physical systems (CPSes), where the software controlled computing systems interact with the physical environment through actuation and monitoring. Ensuring software safety in CPSes, to avoid hazards to property and human life as a result of un-controlled interactions, is essential and challenging. The principal hurdle in this regard is the characterization of the context driven interactions between software and the physical environment (cyber-physical interactions), which introduce multi-dimensional dynamics in space and time, complex non-linearities, and non-trivial aggregation of interaction in case of networked operations. Traditionally, CPS software is tested for safety either through experimental trials, which can be expensive, incomprehensive, and hazardous, or through static analysis of code, which ignore the cyber-physical interactions. This thesis considers model based engineering, a paradigm widely used in different disciplines of engineering, for safety verification of CPS software and contributes to three fundamental phases: a) modeling, building abstractions or models that characterize cyberphysical interactions in a mathematical framework, b) analysis, reasoning about safety based on properties of the model, and c) synthesis, implementing models on standard testbeds for performing preliminary experimental trials. In this regard, CPS modeling techniques are proposed that can accurately capture the context driven spatio-temporal aggregate cyber-physical interactions. Different levels of abstractions are considered, which result in high level architectural models, or more detailed formal behavioral models of CPSes. The outcomes include, a well defined architectural specification framework called CPS-DAS and a novel spatio-temporal formal model called Spatio-Temporal Hybrid Automata (STHA) for CPSes. Model analysis techniques are proposed for the CPS models, which can simulate the effects of dynamic context changes on non-linear spatio-temporal cyberphysical interactions, and characterize aggregate effects. The outcomes include tractable algorithms for simulation analysis and for theoretically proving safety properties of CPS software. Lastly a software synthesis technique is proposed that can automatically convert high level architectural models of CPSes in the healthcare domain into implementations in high level programming languages. The outcome is a tool called Health-Dev that can synthesize software implementations of CPS models in healthcare for experimental verification of safety properties.

ContributorsBanerjee, Ayan (Author) / Gupta, Sandeep K.S. (Thesis advisor) / Poovendran, Radha (Committee member) / Fainekos, Georgios (Committee member) / Maciejewski, Ross (Committee member) / Arizona State University (Publisher)

Created2012

The Perception of Graph Properties In Graph Layouts

Description

When looking at drawings of graphs, questions about graph density, community structures, local clustering and other graph properties may be of critical importance for analysis. While graph layout algorithms have focused on minimizing edge crossing, symmetry, and other such layout properties, there is not much known about how these algorithms…

When looking at drawings of graphs, questions about graph density, community structures, local clustering and other graph properties may be of critical importance for analysis. While graph layout algorithms have focused on minimizing edge crossing, symmetry, and other such layout properties, there is not much known about how these algorithms relate to a user’s ability to perceive graph properties for a given graph layout. This study applies previously established methodologies for perceptual analysis to identify which graph drawing layout will help the user best perceive a particular graph property. A large scale (n = 588) crowdsourced experiment is conducted to investigate whether the perception of two graph properties (graph density and average local clustering coefficient) can be modeled using Weber’s law. Three graph layout algorithms from three representative classes (Force Directed - FD, Circular, and Multi-Dimensional Scaling - MDS) are studied, and the results of this experiment establish the precision of judgment for these graph layouts and properties. The findings demonstrate that the perception of graph density can be modeled with Weber’s law. Furthermore, the perception of the average clustering coefficient can be modeled as an inverse of Weber’s law, and the MDS layout showed a significantly different precision of judgment than the FD layout.

ContributorsSoni, Utkarsh (Author) / Maciejewski, Ross (Thesis advisor) / Kobourov, Stephen (Committee member) / Sefair, Jorge (Committee member) / Arizona State University (Publisher)

Created2018

Filtering by