Matching Items (406)
ContributorsWard, Geoffrey Harris (Performer) / ASU Library. Music Library (Publisher)
Created2018-03-18
152514-Thumbnail Image.png
Description
As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms

As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms which are capable of finding the hidden structure within these datasets. As consumers of popular Big Data frameworks have sought to apply and benefit from these improved learning algorithms, the problems encountered with the frameworks have motivated a new generation of Big Data tools to address the shortcomings of the previous generation. One important example of this is the improved performance in the newer tools with the large class of machine learning algorithms which are highly iterative in nature. In this thesis project, I set about to implement a low-rank matrix completion algorithm (as an example of a highly iterative algorithm) within a popular Big Data framework, and to evaluate its performance processing the Netflix Prize dataset. I begin by describing several approaches which I attempted, but which did not perform adequately. These include an implementation of the Singular Value Thresholding (SVT) algorithm within the Apache Mahout framework, which runs on top of the Apache Hadoop MapReduce engine. I then describe an approach which uses the Divide-Factor-Combine (DFC) algorithmic framework to parallelize the state-of-the-art low-rank completion algorithm Orthogoal Rank-One Matrix Pursuit (OR1MP) within the Apache Spark engine. I describe the results of a series of tests running this implementation with the Netflix dataset on clusters of various sizes, with various degrees of parallelism. For these experiments, I utilized the Amazon Elastic Compute Cloud (EC2) web service. In the final analysis, I conclude that the Spark DFC + OR1MP implementation does indeed produce competitive results, in both accuracy and performance. In particular, the Spark implementation performs nearly as well as the MATLAB implementation of OR1MP without any parallelism, and improves performance to a significant degree as the parallelism increases. In addition, the experience demonstrates how Spark's flexible programming model makes it straightforward to implement this parallel and iterative machine learning algorithm.
ContributorsKrouse, Brian (Author) / Ye, Jieping (Thesis advisor) / Liu, Huan (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)
Created2014
153213-Thumbnail Image.png
Description
The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the volume of RDF data grew to exponential proportions, the limitations of these systems became apparent and researchers began to focus on using big data analysis tools, most notably Hadoop, to process RDF data. Various studies and benchmarks that evaluate these tools for RDF data processing have been published. In the past two and half years, however, heavy users of big data systems, like Facebook, noted limitations with the query performance of these big data systems and began to develop new distributed query engines for big data that do not rely on map-reduce. Facebook's Presto is one such example.

This thesis deals with evaluating the performance of Presto in processing big RDF data against Apache Hive. A comparative analysis was also conducted against 4store, a native RDF store. To evaluate the performance Presto for big RDF data processing, a map-reduce program and a compiler, based on Flex and Bison, were implemented. The map-reduce program loads RDF data into HDFS while the compiler translates SPARQL queries into a subset of SQL that Presto (and Hive) can understand. The evaluation was done on four and eight node Linux clusters installed on Microsoft Windows Azure platform with RDF datasets of size 10, 20, and 30 million triples. The results of the experiment show that Presto has a much higher performance than Hive can be used to process big RDF data. The thesis also proposes an architecture based on Presto, Presto-RDF, that can be used to process big RDF data.
ContributorsMammo, Mulugeta (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Lindquist, Timothy (Committee member) / Arizona State University (Publisher)
Created2014
156060-Thumbnail Image.png
Description
As urban populations become increasingly dense, massive amounts of new 'big' data that characterize human activity are being made available and may be characterized as having a large volume of observations, being produced in real-time or near real-time, and including a diverse variety of information. In particular, spatial interaction (SI)

As urban populations become increasingly dense, massive amounts of new 'big' data that characterize human activity are being made available and may be characterized as having a large volume of observations, being produced in real-time or near real-time, and including a diverse variety of information. In particular, spatial interaction (SI) data - a collection of human interactions across a set of origins and destination locations - present unique challenges for distilling big data into insight. Therefore, this dissertation identifies some of the potential and pitfalls associated with new sources of big SI data. It also evaluates methods for modeling SI to investigate the relationships that drive SI processes in order to focus on human behavior rather than data description.

A critical review of the existing SI modeling paradigms is first presented, which also highlights features of big data that are particular to SI data. Next, a simulation experiment is carried out to evaluate three different statistical modeling frameworks for SI data that are supported by different underlying conceptual frameworks. Then, two approaches are taken to identify the potential and pitfalls associated with two newer sources of data from New York City - bike-share cycling trips and taxi trips. The first approach builds a model of commuting behavior using a traditional census data set and then compares the results for the same model when it is applied to these newer data sources. The second approach examines how the increased temporal resolution of big SI data may be incorporated into SI models.

Several important results are obtained through this research. First, it is demonstrated that different SI models account for different types of spatial effects and that the Competing Destination framework seems to be the most robust for capturing spatial structure effects. Second, newer sources of big SI data are shown to be very useful for complimenting traditional sources of data, though they are not sufficient substitutions. Finally, it is demonstrated that the increased temporal resolution of new data sources may usher in a new era of SI modeling that allows us to better understand the dynamics of human behavior.
ContributorsOshan, Taylor Matthew (Author) / Fotheringham, A. S. (Thesis advisor) / Farmer, Carson J.Q. (Committee member) / Rey, Sergio S.J. (Committee member) / Nelson, Trisalyn (Committee member) / Arizona State University (Publisher)
Created2017
ContributorsBolari, John (Performer) / ASU Library. Music Library (Publisher)
Created2018-10-04
135667-Thumbnail Image.png
Description
This work challenges the conventional perceptions surrounding the utility and use of the CMS Open Payments data. I suggest unconsidered methodologies for extracting meaningful information from these data following an exploratory analysis of the 2014 research dataset that, in turn, enhance its value as a public good. This dataset is

This work challenges the conventional perceptions surrounding the utility and use of the CMS Open Payments data. I suggest unconsidered methodologies for extracting meaningful information from these data following an exploratory analysis of the 2014 research dataset that, in turn, enhance its value as a public good. This dataset is favored for analysis over the general payments dataset as it is believed that generating transparency in the pharmaceutical and medical device R&D process would be of the greatest benefit to public health. The research dataset has been largely ignored by analysts and this may be one of the few works that have accomplished a comprehensive exploratory analysis of these data. If we are to extract valuable information from this dataset, we must alter both our approach as well as focus our attention towards re-conceptualizing the questions that we ask. Adopting the theoretical framework of complex systems serves as the foundation for our interpretation of the research dataset. This framework, in conjunction with a methodological toolkit for network analysis, may set a precedent for the development of alternative perspectives that allow for novel interpretations of the information that big data attempts to convey. By thus proposing a novel perspective in interpreting the information that this dataset contains, it is possible to gain insight into the emergent dynamics of the collaborative relationships that are established during the pharmaceutical and medical device R&D process.
Created2016-05
136409-Thumbnail Image.png
Description
Twitter, the microblogging platform, has grown in prominence to the point that the topics that trend on the network are often the subject of the news and other traditional media. By predicting trends on Twitter, it could be possible to predict the next major topic of interest to the public.

Twitter, the microblogging platform, has grown in prominence to the point that the topics that trend on the network are often the subject of the news and other traditional media. By predicting trends on Twitter, it could be possible to predict the next major topic of interest to the public. With this motivation, this paper develops a model for trends leveraging previous work with k-nearest-neighbors and dynamic time warping. The development of this model provides insight into the length and features of trends, and successfully generalizes to identify 74.3% of trends in the time period of interest. The model developed in this work provides understanding into why par- ticular words trend on Twitter.
ContributorsMarshall, Grant A (Author) / Liu, Huan (Thesis director) / Morstatter, Fred (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor) / School of Mathematical and Statistical Sciences (Contributor)
Created2015-05
136334-Thumbnail Image.png
Description
Investment real estate is unique among similar financial instruments by nature of each property's internal complexities and interaction with the external economy. Where a majority of tradable assets are static goods within a dynamic market, real estate investments are dynamic goods within a dynamic market. Furthermore, investment real estate, particularly

Investment real estate is unique among similar financial instruments by nature of each property's internal complexities and interaction with the external economy. Where a majority of tradable assets are static goods within a dynamic market, real estate investments are dynamic goods within a dynamic market. Furthermore, investment real estate, particularly commercial properties, not only interacts with the surrounding economy, it reflects it. Alive with tenancy, each and every commercial investment property provides a microeconomic view of businesses that make up the local economy. Management of commercial investment real estate captures this economic snapshot in a unique abundance of untapped statistical data. While analysis of such data is undeniably valuable, the efforts involved with this process are time consuming. Given this unutilized potential our team has develop proprietary software to analyze this data and communicate the results automatically though and easy to use interface. We have worked with a local real estate property management and ownership firm, Reliance Management, to develop this system through the use of their current, historical, and future data. Our team has also built a relationship with the executives of Reliance Management to review functionality and pertinence of the system we have dubbed, Reliance Dashboard.
ContributorsBurton, Daryl (Co-author) / Workman, Jack (Co-author) / LePine, Marcie (Thesis director) / Atkinson, Robert (Committee member) / Barrett, The Honors College (Contributor) / Department of Finance (Contributor) / Department of Management (Contributor) / Computer Science and Engineering Program (Contributor)
Created2015-05
133409-Thumbnail Image.png
Description
In the era of big data, the impact of information technologies in improving organizational performance is growing as unstructured data is increasingly important to business intelligence. Daily data gives businesses opportunities to respond to changing markets. As a result, many companies invest lots of money in big data in order

In the era of big data, the impact of information technologies in improving organizational performance is growing as unstructured data is increasingly important to business intelligence. Daily data gives businesses opportunities to respond to changing markets. As a result, many companies invest lots of money in big data in order to obtain adverse outcomes. In particular, analysis of commercial websites may reveal relations of different parties in digital markets that pose great value to businesses. However, complex e­commercial sites present significant challenges for primary web analysts. While some resources and tutorials of web analysis are available for studying, some learners especially entry­level analysts still struggle with getting satisfying results. Thus, I am interested in developing a computer program in the Python programming language for investigating the relation between sellers’ listings and their seller levels in a darknet market. To investigate the relation, I couple web data retrieval techniques with doc2vec, a machine learning algorithm. This approach does not allow me to analyze the potential relation between sellers’ listings and reputations in the context of darknet markets, but assist other users of business intelligence with similar analysis of online markets. I present several conclusions found through the analysis. Key findings suggest that no relation exists between similarities of different sellers’ listings and their seller levels in rsClub Market. This study can become a great and unique example of web analysis and create potential values for modern enterprises.
ContributorsWang, Zhen (Author) / Benjamin, Victor (Thesis director) / Santanam, Raghu (Committee member) / Barrett, The Honors College (Contributor)
Created2018-05
136944-Thumbnail Image.png
Description
As the use of Big Data gains momentum and transitions into mainstream adoption, marketers are racing to generate valuable insights that can create well-informed strategic business decisions. The retail market is a fiercely competitive industry, and the rapid adoption of smartphones and tablets have led e-commerce rivals to grow at

As the use of Big Data gains momentum and transitions into mainstream adoption, marketers are racing to generate valuable insights that can create well-informed strategic business decisions. The retail market is a fiercely competitive industry, and the rapid adoption of smartphones and tablets have led e-commerce rivals to grow at an unbelievable rate. Retailers are able to collect and analyze data from both their physical stores and e-commerce platforms, placing them in a unique position to be able to fully capitalize on the power of Big Data. This thesis is an examination of Big Data and how marketers can use it to create better experiences for consumers. Insights generated from the use of Big Data can result in increased customer engagement, loyalty, and retention for an organization. Businesses of all sizes, whether it be enterprise, small-to-midsize, and even solely e-commerce organizations have successfully implemented Big Data technology. However, there are issues regarding challenges and the ethical and legal concerns that need to be addressed as the world continues to adopt the use of Big Data analytics and insights. With the abundance of data collected in today's digital world, marketers must take advantage of available resources to improve the overall customer experience.
ContributorsHaghgoo, Sam (Author) / Ostrom, Amy (Thesis director) / Giles, Bret (Committee member) / Barrett, The Honors College (Contributor) / Department of Marketing (Contributor) / W. P. Carey School of Business (Contributor) / Department of Management (Contributor)
Created2014-05