A measure of shape compactness is a numerical quantity representing the degree to which a shape is compact. Ways to provide an accurate measure have been given great attention due to its application in a broad range of GIS problems, such as detecting clustering patterns from remote-sensing images, understanding urban sprawl, and redrawing electoral districts to avoid gerrymandering. In this article, we propose an effective and efficient approach to computing shape compactness based on the moment of inertia (MI), a well-known concept in physics. The mathematical framework and the computer implementation for both raster and vector models are discussed in detail. In addition to computing compactness for a single shape, we propose a computational method that is capable of calculating the variations in compactness as a shape grows or shrinks, which is a typical application found in regionalization problems. We conducted a number of experiments that demonstrate the superiority of the MI over the popular isoperimetric quotient approach in terms of (1) computational efficiency; (2) tolerance of positional uncertainty and irregular boundaries; (3) ability to handle shapes with holes and multiple parts; and (4) applicability and efficacy in districting/zonation/regionalization problems.
In this dissertation, I focus on designing efficient data systems and data indexing mechanisms to bolster scalable and interactive analytics on large-scale geospatial data. I first propose a cluster computing system GeoSpark which extends the core engine of Apache Spark and Spark SQL to support spatial data types, indexes, and geometrical operations at scale. In order to reduce the indexing overhead, I propose Hippo, a fast, yet scalable, sparse database indexing approach. In contrast to existing tree index structures, Hippo stores disk page ranges (each works as a pointer of one or many pages) instead of tuple pointers in the indexed table to reduce the storage space occupied by the index. Moreover, I present Tabula, a middleware framework that sits between a SQL data system and a spatial visualization dashboard to make the user experience with the dashboard more seamless and interactive. Tabula adopts a materialized sampling cube approach, which pre-materializes samples, not for the entire table as in the SampleFirst approach, but for the results of potentially unforeseen queries (represented by an OLAP cube cell).
model spatially non-stationary relationships. Classic GWR is considered as a single-scale model that is based on one bandwidth parameter which controls the amount of distance-decay in weighting neighboring data around each location. The single bandwidth in GWR assumes that processes (relationships between the response variable and the predictor variables) all operate at the same scale. However, this posits a limitation in modeling potentially multi-scale processes which are more often seen in the real world. For example, the measured ambient temperature of a location is affected by the built environment, regional weather and global warming, all of which operate at different scales. A recent advancement to GWR termed Multiscale GWR (MGWR) removes the single bandwidth assumption and allows the bandwidths for each covariate to vary. This results in each parameter surface being allowed to have a different degree of spatial variation, reflecting variation across covariate-specific processes. In this way, MGWR has the capability to differentiate local, regional and global processes by using varying bandwidths for covariates. Additionally, bandwidths in MGWR become explicit indicators of the scale at various processes operate. The proposed dissertation covers three perspectives centering on MGWR: Computation; Inference; and Application. The first component focuses on addressing computational issues in MGWR to allow MGWR models to be calibrated more efficiently and to be applied on large datasets. The second component aims to statistically differentiate the spatial scales at which different processes operate by quantifying the uncertainty associated with each bandwidth obtained from MGWR. In the third component, an empirical study will be conducted to model the changing relationships between county-level socio-economic factors and voter preferences in the 2008-2016 United States presidential elections using MGWR.
Meanwhile, the emerging cyberinfrastructure rapidly increases our capacity for handling such massive data with regard to data collection and management, data integration and interoperability, data transmission and visualization, high-performance computing, etc. Cyberinfrastructure (CI) consists of computing systems, data storage systems, advanced instruments and data repositories, visualization environments, and people, all linked together by software and high-performance networks to improve research productivity and enable breakthroughs that are not otherwise possible.
The Geospatial CI (GCI, or CyberGIS), as the synthesis of CI and GIScience has inherent advantages in enabling computationally intensive spatial analysis and modeling (SAM) and collaborative geospatial problem solving and decision making.
This dissertation is dedicated to addressing several critical issues and improving the performance of existing methodologies and systems in the field of CyberGIS. My dissertation will include three parts: The first part is focused on developing methodologies to help public researchers find appropriate open geo-spatial datasets from millions of records provided by thousands of organizations scattered around the world efficiently and effectively. Machine learning and semantic search methods will be utilized in this research. The second part develops an interoperable and replicable geoprocessing service by synthesizing the high-performance computing (HPC) environment, the core spatial statistic/analysis algorithms from the widely adopted open source python package – Python Spatial Analysis Library (PySAL), and rich datasets acquired from the first research. The third part is dedicated to studying optimization strategies for feature data transmission and visualization. This study is intended for solving the performance issue in large feature data transmission through the Internet and visualization on the client (browser) side.
Taken together, the three parts constitute an endeavor towards the methodological improvement and implementation practice of the data-driven, high-performance and intelligent CI to advance spatial sciences.