Filtering by
- Creators: Arizona State University
- Member of: Theses and Dissertations
This thesis deals with evaluating the performance of Presto in processing big RDF data against Apache Hive. A comparative analysis was also conducted against 4store, a native RDF store. To evaluate the performance Presto for big RDF data processing, a map-reduce program and a compiler, based on Flex and Bison, were implemented. The map-reduce program loads RDF data into HDFS while the compiler translates SPARQL queries into a subset of SQL that Presto (and Hive) can understand. The evaluation was done on four and eight node Linux clusters installed on Microsoft Windows Azure platform with RDF datasets of size 10, 20, and 30 million triples. The results of the experiment show that Presto has a much higher performance than Hive can be used to process big RDF data. The thesis also proposes an architecture based on Presto, Presto-RDF, that can be used to process big RDF data.
A critical review of the existing SI modeling paradigms is first presented, which also highlights features of big data that are particular to SI data. Next, a simulation experiment is carried out to evaluate three different statistical modeling frameworks for SI data that are supported by different underlying conceptual frameworks. Then, two approaches are taken to identify the potential and pitfalls associated with two newer sources of data from New York City - bike-share cycling trips and taxi trips. The first approach builds a model of commuting behavior using a traditional census data set and then compares the results for the same model when it is applied to these newer data sources. The second approach examines how the increased temporal resolution of big SI data may be incorporated into SI models.
Several important results are obtained through this research. First, it is demonstrated that different SI models account for different types of spatial effects and that the Competing Destination framework seems to be the most robust for capturing spatial structure effects. Second, newer sources of big SI data are shown to be very useful for complimenting traditional sources of data, though they are not sufficient substitutions. Finally, it is demonstrated that the increased temporal resolution of new data sources may usher in a new era of SI modeling that allows us to better understand the dynamics of human behavior.
The first part of the dissertation considers a new method for modeling social networks via probabilistic graphical models. Specifically, this new method employs the t-cherry junction tree, a recent advancement in probabilistic graphical models, to develop a compact representation and good approximation of an otherwise intractable probabilistic model. There are a number of advantages in this approach: 1) the best approximation possible via junction trees belongs to the class of t-cherry junction trees; 2) constructing a t-cherry junction tree can be largely parallelized; and 3) inference can be performed using distributed computation. To improve the quality of approximation, an algorithm to build a higher order tree gracefully from an existing one, without constructing it from scratch, is developed. this approach is applied to Twitter data containing 100,000 nodes to study the problem of recommending connections to new users.
Next, the t-cherry junction tree framework is extended by considering the impact of estimating the distributions involved from a training data set. Understanding this impact is vital to real-world applications as distributions are not known perfectly, but rather generated from training data. First, the fidelity of the t-cherry junction tree approximation due to this estimation is quantified. Then the scaling behavior, in terms of the size of the t-cherry junction tree, is approximated to show that higher-order t-cherry junction trees, which with perfect information are higher fidelity approximations, may actually result in decreased fidelity due to the difficulties in accurately estimating higher-dimensional distributions. Finally, this part concludes by demonstrating these findings by considering a distributed detection situation in which the sensors' measurements are correlated.
Having developed a framework to model social structure in online social networks, the study then highlights two approaches for utilizing this social network data in existing wireless communication networks. The first approach is a novel application: using social networks to enhance device-to-device wireless communication. It is well known that wireless communication can be significantly improved by utilizing relays to aid in transmission. Rather than deploying dedicated relays, a system is designed in which users can relay traffic for other users if there is a shared social trust between them, e.g., they are ``friends'' on Facebook, and for users that do not share social trust, implements a coalitional game framework to motivate users to relay traffic for each other. This framework guarantees that all users improve their throughput via relaying while ensuring that each user will function as a relay only if there is a social trust relationship or, if there is no social trust, a cycle of reciprocity is established in which a set of users will agree to relay for each other. This new system shows significant throughput gain in simulated networks that utilize real-world social network traces.
The second application of social structure to wireless communication is an approach to reduce the congestion in cellular networks during peak times. This is achieved by two means: preloading and offloading. Preloading refers to the process of using social network data to predict user demand and serve some users early, before the cellular network traffic peaks. Offloading allows users that have already obtained a copy of the content to opportunistically serve other users using device-to-device communication, thus eliminating the need for some cellular traffic. These two methods work especially well in tandem, as preloading creates a base of users that can serve later users via offloading. These two processes can greatly reduce the peak cellular traffic under ideal conditions, and in a more realistic situation, the impact of uncertainty in human mobility and the social network structure is analyzed. Even with the randomness inherent in these processes, both preloading and offloading offer substantial improvement. Finally, potential difficulties in preloading multiple pieces of content simultaneously are highlighted, and a heuristic method to solve these challenges is developed.
The first part characterizes differential privacy, identifiability and mutual-information privacy by their privacy--distortion functions, which is the optimal achievable privacy level as a function of the maximum allowable distortion. The results show that these notions are fundamentally related and exhibit certain consistency: (1) The gap between the privacy--distortion functions of identifiability and differential privacy is upper bounded by a constant determined by the prior. (2) Identifiability and mutual-information privacy share the same optimal mechanism. (3) The mutual-information optimal mechanism satisfies differential privacy with a level at most a constant away from the optimal level.
The second part studies a market model of trading private data, where a data collector purchases private data from strategic data subjects (individuals) through an incentive mechanism. The value of epsilon units of privacy is measured by the minimum payment such that an individual's equilibrium strategy is to report data in an epsilon-differentially private manner. For the setting with binary private data that represents individuals' knowledge about a common underlying state, asymptotically tight lower and upper bounds on the value of privacy are established as the number of individuals becomes large, and the payment--accuracy tradeoff for learning the state is obtained. The lower bound assures the impossibility of using lower payment to buy epsilon units of privacy, and the upper bound is given by a designed reward mechanism. When the individuals' valuations of privacy are unknown to the data collector, mechanisms with possible negative payments (aiming to penalize individuals with "unacceptably" high privacy valuations) are designed to fulfill the accuracy goal and drive the total payment to zero. For the setting with binary private data following a general joint probability distribution with some symmetry, asymptotically optimal mechanisms are designed in the high data quality regime.
First, while an increasing amount of investigation has been done in this important area, most existing work concentrates on efficiency instead of search quality and may fail to deliver high quality results from semantic perspectives. Majority of the existing work generates minimal sub-graph results that are oblivious to the entity and relationship semantics embedded in the data and in the user query. There are also studies that define results to be subtrees or subgraphs that contain all query keywords but are not necessarily ``minimal''. However, such result construction method suffers from the same problem of semantic mis-alignment between data and user query. In this work the semantics of how to {\em define} results that can capture users' search intention and then the generation of search intention aware results is studied.
Second, most existing research is incapable of handling large-scale structured data. However, as data volume has seen rapid growth in recent years, the problem of how to efficiently process keyword queries on large-scale structured data becomes important. MapReduce is widely acknowledged as an effective programming model to process big data. For keyword query processing on data graph, first graph algorithms which can efficiently return query results that are consistent with users' search intention are proposed. Then these algorithms are migrated to MapReduce to support big data. For keyword query processing on schema graph, it first transforms a keyword query into multiple SQL queries, then all generated SQL queries are run on the structured data. Therefore it is crucial to find the optimal way to execute a SQL query using MapReduce, which can minimize the processing time. In this work, a system called SOSQL is developed which generates the optimal query execution plan using MapReduce for a SQL query $Q$ with time complexity $O(n^2)$, where $n$ is the number of input tables of $Q$.
Embedded within the regression framework, local models can estimate conditioned relationships between observed spatial phenomena and hypothesized explanatory variables and help infer the intangible spatial processes that contribute to the observed spatial patterns. Rather than investigating averaged characteristics corresponding to processes over space as global models do, these models estimate a surface of spatially varying parameters with a value for each location. Additionally, some models such as variants within the Geographically Weighted Regression (GWR) framework, also estimate a parameter to represent the spatial scale across which the processes vary representing the inherent heterogeneity of the estimated surfaces. Since different processes tend to operate at unique spatial scales, some extensions to local models such as Multiscale GWR (MGWR) estimate unique scales of association for each predictor in a model and generate significantly more information on the nature of geographic processes than their predecessors. However, developments within the realm of local models are fairly nascent and hence an understanding around their correct application as well as recognizing their true potential in exploring fundamental spatial science issues is under-developed. The techniques within these frameworks are also currently limited thus restricting the kinds of data that can be analyzed using these models. Therefore the goal of this dissertation is to advance techniques within local multiscale modeling specifically by coining new diagnostics, exploring their novel application in understanding long-standing issues concerning spatial scale and by expanding the tool base to allow their use in wider empirical applications. This goal is realized through three distinct research objectives over four chapters, followed by a discussion on the future of the developments within local multiscale modeling. A correct understanding of the capability and promise of local multiscale models and expanding the fields where they can be employed will not only enhance geographical research by strengthening the intuition of the nature of geographic processes, but will also exemplify the importance and need for using such tools bringing quantitative spatial science to the fore.
In the first part, nonlinear regression models were embedded into a multistage workflow to predict the spatial abundance of reef fish species in the Gulf of Mexico. There were two challenges, zero-inflated data and out of sample prediction. The methods and models in the workflow could effectively handle the zero-inflated sampling data without strong assumptions. Three strategies were proposed to solve the out of sample prediction problem. The results and discussions showed that the nonlinear prediction had the advantages of high accuracy, low bias and well-performed in multi-resolution.
In the second part, a two-stage spatial regression model was proposed for analyzing soil carbon stock (SOC) data. In the first stage, there was a spatial linear mixed model that captured the linear and stationary effects. In the second stage, a generalized additive model was used to explain the nonlinear and nonstationary effects. The results illustrated that the two-stage model had good interpretability in understanding the effect of covariates, meanwhile, it kept high prediction accuracy which is competitive to the popular machine learning models, like, random forest, xgboost and support vector machine.
A new nonlinear regression model, Gaussian process BART (Bayesian additive regression tree), was proposed in the third part. Combining advantages in both BART and Gaussian process, the model could capture the nonlinear effects of both observed and latent covariates. To develop the model, first, the traditional BART was generalized to accommodate correlated errors. Then, the failure of likelihood based Markov chain Monte Carlo (MCMC) in parameter estimating was discussed. Based on the idea of analysis of variation, back comparing and tuning range, were proposed to tackle this failure. Finally, effectiveness of the new model was examined by experiments on both simulation and real data.