Search Content

The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

Description

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they…

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (https://github.com/diging/).

ContributorsLessios-Damerow, Julia (Contributor) / Peirson, Erick (Contributor) / Laubichler, Manfred (Contributor) / ASU-SFI Center for Biosocial Complex Systems (Contributor)

Created2017-09-28

Challenges and Opportunities in Coding the Commons: Problems, Procedures, and Potential Solutions in Large-N Comparative Case Studies

Description

On-going efforts to understand the dynamics of coupled social-ecological (or more broadly, coupled infrastructure) systems and common pool resources have led to the generation of numerous datasets based on a large number of case studies. This data has facilitated the identification of important factors and fundamental principles which increase our…

On-going efforts to understand the dynamics of coupled social-ecological (or more broadly, coupled infrastructure) systems and common pool resources have led to the generation of numerous datasets based on a large number of case studies. This data has facilitated the identification of important factors and fundamental principles which increase our understanding of such complex systems. However, the data at our disposal are often not easily comparable, have limited scope and scale, and are based on disparate underlying frameworks inhibiting synthesis, meta-analysis, and the validation of findings. Research efforts are further hampered when case inclusion criteria, variable definitions, coding schema, and inter-coder reliability testing are not made explicit in the presentation of research and shared among the research community. This paper first outlines challenges experienced by researchers engaged in a large-scale coding project; then highlights valuable lessons learned; and finally discusses opportunities for further research on comparative case study analysis focusing on social-ecological systems and common pool resources. Includes supplemental materials and appendices published in the International Journal of the Commons 2016 Special Issue. Volume 10 - Issue 2 - 2016.

ContributorsRatajczyk, Elicia (Author) / Brady, Ute (Author) / Baggio, Jacopo (Author) / Barnett, Allain J. (Author) / Perez Ibarra, Irene (Author) / Rollins, Nathan (Author) / Rubinos, Cathy (Author) / Shin, Hoon Cheol (Author) / Yu, David (Author) / Aggarwal, Rimjhim (Author) / Anderies, John (Author) / Janssen, Marco (Author) / ASU-SFI Center for Biosocial Complex Systems (Contributor)

Created2016-09-09

The Tragedy of the Unexamined Cat: Why K-12 and University Education Are Still in the Dark Ages and How Citizen Science Allows for a Renaissance

Description

At the end of the dark ages, anatomy was taught as though everything that could be known was known. Scholars learned about what had been discovered rather than how to make discoveries. This was true even though the body (and the rest of biology) was very poorly understood. The renaissance…

At the end of the dark ages, anatomy was taught as though everything that could be known was known. Scholars learned about what had been discovered rather than how to make discoveries. This was true even though the body (and the rest of biology) was very poorly understood. The renaissance eventually brought a revolution in how scholars (and graduate students) were trained and worked. This revolution never occurred in K-12 or university education such that we now teach young students in much the way that scholars were taught in the dark ages, we teach them what is already known rather than the process of knowing. Citizen science offers a way to change K-12 and university education and, in doing so, complete the renaissance. Here we offer an example of such an approach and call for change in the way students are taught science, change that is more possible than it has ever been and is, nonetheless, five hundred years delayed.

ContributorsDunn, Robert R. (Author) / Urban, Julie (Author) / Cavalier, Darlene (Author) / Cooper, Caren B. (Author) / Consortium for Science, Policy and Outcomes (Contributor) / ASU-SFI Center for Biosocial Complex Systems (Contributor)

Created2016-03-01

A Microbial Survey of the International Space Station (ISS)

Description

Background: Modern advances in sequencing technology have enabled the census of microbial members of many natural ecosystems. Recently, attention is increasingly being paid to the microbial residents of human-made, built ecosystems, both private (homes) and public (subways, office buildings, and hospitals). Here, we report results of the characterization of the microbial…

Background: Modern advances in sequencing technology have enabled the census of microbial members of many natural ecosystems. Recently, attention is increasingly being paid to the microbial residents of human-made, built ecosystems, both private (homes) and public (subways, office buildings, and hospitals). Here, we report results of the characterization of the microbial ecology of a singular built environment, the International Space Station (ISS). This ISS sampling involved the collection and microbial analysis (via 16S rRNA gene PCR) of 15 surfaces sampled by swabs onboard the ISS. This sampling was a component of Project MERCCURI (Microbial Ecology Research Combining Citizen and University Researchers on ISS). Learning more about the microbial inhabitants of the “buildings” in which we travel through space will take on increasing importance, as plans for human exploration continue, with the possibility of colonization of other planets and moons.

Results: Sterile swabs were used to sample 15 surfaces onboard the ISS. The sites sampled were designed to be analogous to samples collected for (1) the Wildlife of Our Homes project and (2) a study of cell phones and shoes that were concurrently being collected for another component of Project MERCCURI. Sequencing of the 16S rRNA genes amplified from DNA extracted from each swab was used to produce a census of the microbes present on each surface sampled. We compared the microbes found on the ISS swabs to those from both homes on Earth and data from the Human Microbiome Project.

Conclusions: While significantly different from homes on Earth and the Human Microbiome Project samples analyzed here, the microbial community composition on the ISS was more similar to home surfaces than to the human microbiome samples. The ISS surfaces are OTU-rich with 1,036–4,294 operational taxonomic units (OTUs per sample). There was no discernible biogeography of microbes on the 15 ISS surfaces, although this may be a reflection of the small sample size we were able to obtain.

ContributorsLang, Jenna M. (Author) / Coil, David A. (Author) / Neches, Russell Y. (Author) / Brown, Wendy E. (Author) / Cavalier, Darlene (Author) / Severance, Mark (Author) / Hampton-Marcell, Jarrad T. (Author) / Gilbert, Jack A. (Author) / Eisen, Jonathan A. (Author) / ASU-SFI Center for Biosocial Complex Systems (Contributor)

Created2017-12-05

A Multidisciplinary Biospecimen Bank of Renal Cell Carcinomas Compatible With Discovery Platforms at Mayo Clinic, Scottsdale, Arizona

Description

To address the need to study frozen clinical specimens using next-generation RNA, DNA, chromatin immunoprecipitation (ChIP) sequencing and protein analyses, we developed a biobank work flow to prospectively collect biospecimens from patients with renal cell carcinoma (RCC). We describe our standard operating procedures and work flow to annotate pathologic results…

To address the need to study frozen clinical specimens using next-generation RNA, DNA, chromatin immunoprecipitation (ChIP) sequencing and protein analyses, we developed a biobank work flow to prospectively collect biospecimens from patients with renal cell carcinoma (RCC). We describe our standard operating procedures and work flow to annotate pathologic results and clinical outcomes. We report quality control outcomes and nucleic acid yields of our RCC submissions (N=16) to The Cancer Genome Atlas (TCGA) project, as well as newer discovery platforms, by describing mass spectrometry analysis of albumin oxidation in plasma and 6 ChIP sequencing libraries generated from nephrectomy specimens after histone H3 lysine 36 trimethylation (H3K36me3) immunoprecipitation. From June 1, 2010, through January 1, 2013, we enrolled 328 patients with RCC. Our mean (SD) TCGA RNA integrity numbers (RINs) were 8.1 (0.8) for papillary RCC, with a 12.5% overall rate of sample disqualification for RIN <7. Banked plasma had significantly less albumin oxidation (by mass spectrometry analysis) than plasma kept at 25°C (P<.001). For ChIP sequencing, the FastQC score for average read quality was at least 30 for 91% to 95% of paired-end reads. In parallel, we analyzed frozen tissue by RNA sequencing; after genome alignment, only 0.2% to 0.4% of total reads failed the default quality check steps of Bowtie2, which was comparable to the disqualification ratio (0.1%) of the 786-O RCC cell line that was prepared under optimal RNA isolation conditions. The overall correlation coefficients for gene expression between Mayo Clinic vs TCGA tissues ranged from 0.75 to 0.82. These data support the generation of high-quality nucleic acids for genomic analyses from banked RCC. Importantly, the protocol does not interfere with routine clinical care. Collections over defined time points during disease treatment further enhance collaborative efforts to integrate genomic information with outcomes.

ContributorsHo, Thai H. (Author) / Nunez Nateras, Rafael (Author) / Yan, Huihuang (Author) / Park, Jin (Author) / Jensen, Sally (Author) / Borges, Chad (Author) / Lee, Jeong Heon (Author) / Champion, Mia D. (Author) / Tibes, Raoul (Author) / Bryce, Alan H. (Author) / Carballido, Estrella M. (Author) / Todd, Mark A. (Author) / Joseph, Richard W. (Author) / Wong, William W. (Author) / Parker, Alexander S. (Author) / Stanton, Melissa L. (Author) / Castle, Erik P. (Author) / Biodesign Institute (Contributor)

Created2015-07-16

Parallel Workflow for High-Throughput (>1,000 Samples/Day) Quantitative Analysis of Human Insulin-Like Growth Factor 1 Using Mass Spectrometric Immunoassay

Description

Insulin-like growth factor 1 (IGF1) is an important biomarker for the management of growth hormone disorders. Recently there has been rising interest in deploying mass spectrometric (MS) methods of detection for measuring IGF1. However, widespread clinical adoption of any MS-based IGF1 assay will require increased throughput and speed to justify…

Insulin-like growth factor 1 (IGF1) is an important biomarker for the management of growth hormone disorders. Recently there has been rising interest in deploying mass spectrometric (MS) methods of detection for measuring IGF1. However, widespread clinical adoption of any MS-based IGF1 assay will require increased throughput and speed to justify the costs of analyses, and robust industrial platforms that are reproducible across laboratories. Presented here is an MS-based quantitative IGF1 assay with performance rating of >1,000 samples/day, and a capability of quantifying IGF1 point mutations and posttranslational modifications. The throughput of the IGF1 mass spectrometric immunoassay (MSIA) benefited from a simplified sample preparation step, IGF1 immunocapture in a tip format, and high-throughput MALDI-TOF MS analysis. The Limit of Detection and Limit of Quantification of the resulting assay were 1.5 μg/L and 5 μg/L, respectively, with intra- and inter-assay precision CVs of less than 10%, and good linearity and recovery characteristics. The IGF1 MSIA was benchmarked against commercially available IGF1 ELISA via Bland-Altman method comparison test, resulting in a slight positive bias of 16%. The IGF1 MSIA was employed in an optimized parallel workflow utilizing two pipetting robots and MALDI-TOF-MS instruments synced into one-hour phases of sample preparation, extraction and MSIA pipette tip elution, MS data collection, and data processing. Using this workflow, high-throughput IGF1 quantification of 1,054 human samples was achieved in approximately 9 hours. This rate of assaying is a significant improvement over existing MS-based IGF1 assays, and is on par with that of the enzyme-based immunoassays. Furthermore, a mutation was detected in ∼1% of the samples (SNP: rs17884626, creating an A→T substitution at position 67 of the IGF1), demonstrating the capability of IGF1 MSIA to detect point mutations and posttranslational modifications.

ContributorsOran, Paul (Author) / Trenchevska, Olgica (Author) / Nedelkov, Dobrin (Author) / Borges, Chad (Author) / Schaab, Matthew (Author) / Rehder, Douglas (Author) / Jarvis, Jason (Author) / Sherma, Nisha (Author) / Shen, Luhui (Author) / Krastins, Bryan (Author) / Lopez, Mary F. (Author) / Schwenke, Dawn (Author) / Reaven, Peter D. (Author) / Nelson, Randall (Author) / Biodesign Institute (Contributor)

Created2014-03-24

The Role of Diverse Strategies in Sustainable Knowledge Production

Description

Online communities are becoming increasingly important as platforms for large-scale human cooperation. These communities allow users seeking and sharing professional skills to solve problems collaboratively. To investigate how users cooperate to complete a large number of knowledge-producing tasks, we analyze Stack Exchange, one of the largest question and answer systems…

Online communities are becoming increasingly important as platforms for large-scale human cooperation. These communities allow users seeking and sharing professional skills to solve problems collaboratively. To investigate how users cooperate to complete a large number of knowledge-producing tasks, we analyze Stack Exchange, one of the largest question and answer systems in the world. We construct attention networks to model the growth of 110 communities in the Stack Exchange system and quantify individual answering strategies using the linking dynamics on attention networks. We identify two answering strategies. Strategy A aims at performing maintenance by doing simple tasks, whereas strategy B aims at investing time in doing challenging tasks. Both strategies are important: empirical evidence shows that strategy A decreases the median waiting time for answers and strategy B increases the acceptance rate of answers. In investigating the strategic persistence of users, we find that users tends to stick on the same strategy over time in a community, but switch from one strategy to the other across communities. This finding reveals the different sets of knowledge and skills between users. A balance between the population of users taking A and B strategies that approximates 2:1, is found to be optimal to the sustainable growth of communities.

ContributorsWu, Lingfei (Author) / Baggio, Jacopo (Author) / Janssen, Marco (Author) / ASU-SFI Center for Biosocial Complex Systems (Contributor)

Created2016-03-02

Serum Amyloid A Truncations in Type 2 Diabetes Mellitus

Description

Serum Amyloid A (SAA) is an acute phase protein complex consisting of several abundant isoforms. The N- terminus of SAA is critical to its function in amyloid formation. SAA is frequently truncated, either missing an arginine or an arginine-serine dipeptide, resulting in isoforms that may influence the capacity to form…

Serum Amyloid A (SAA) is an acute phase protein complex consisting of several abundant isoforms. The N- terminus of SAA is critical to its function in amyloid formation. SAA is frequently truncated, either missing an arginine or an arginine-serine dipeptide, resulting in isoforms that may influence the capacity to form amyloid. However, the relative abundance of truncated SAA in diabetes and chronic kidney disease is not known.

Methods: Using mass spectrometric immunoassay, the abundance of SAA truncations relative to the native variants was examined in plasma of 91 participants with type 2 diabetes and chronic kidney disease and 69 participants without diabetes.

Results: The ratio of SAA 1.1 (missing N-terminal arginine) to native SAA 1.1 was lower in diabetics compared to non-diabetics (p = 0.004), and in males compared to females (p<0.001). This ratio was negatively correlated with glycated hemoglobin (r = −0.32, p<0.001) and triglyceride concentrations (r = −0.37, p<0.001), and positively correlated with HDL cholesterol concentrations (r = 0.32, p<0.001).

Conclusion: The relative abundance of the N-terminal arginine truncation of SAA1.1 is significantly decreased in diabetes and negatively correlates with measures of glycemic and lipid control.

ContributorsYassine, Hussein N. (Author) / Trenchevska, Olgica (Author) / He, Huijuan (Author) / Borges, Chad (Author) / Nedelkov, Dobrin (Author) / Mack, Wendy (Author) / Kono, Naoko (Author) / Koska, Juraj (Author) / Reaven, Peter D. (Author) / Nelson, Randall (Author) / Biodesign Institute (Contributor)

Created2015-01-21

Next-Generation Sequencing Methylation Profiling of Subjects With Obesity Identifies Novel Gene Changes

Description

Background: Obesity is a metabolic disease caused by environmental and genetic factors. However, the epigenetic mechanisms of obesity are incompletely understood. The aim of our study was to investigate the role of skeletal muscle DNA methylation in combination with transcriptomic changes in obesity.

Results: Muscle biopsies were obtained basally from lean (n = 12; BMI = 23.4 ± 0.7…

Background: Obesity is a metabolic disease caused by environmental and genetic factors. However, the epigenetic mechanisms of obesity are incompletely understood. The aim of our study was to investigate the role of skeletal muscle DNA methylation in combination with transcriptomic changes in obesity.

Results: Muscle biopsies were obtained basally from lean (n = 12; BMI = 23.4 ± 0.7 kg/m[superscript 2]) and obese (n = 10; BMI = 32.9 ± 0.7 kg/m[superscript 2]) participants in combination with euglycemic-hyperinsulinemic clamps to assess insulin sensitivity. We performed reduced representation bisulfite sequencing (RRBS) next-generation methylation and microarray analyses on DNA and RNA isolated from vastus lateralis muscle biopsies. There were 13,130 differentially methylated cytosines (DMC; uncorrected P < 0.05) that were altered in the promoter and untranslated (5' and 3'UTR) regions in the obese versus lean analysis. Microarray analysis revealed 99 probes that were significantly (corrected P < 0.05) altered. Of these, 12 genes (encompassing 22 methylation sites) demonstrated a negative relationship between gene expression and DNA methylation. Specifically, sorbin and SH3 domain containing 3 (SORBS3) which codes for the adapter protein vinexin was significantly decreased in gene expression (fold change −1.9) and had nine DMCs that were significantly increased in methylation in obesity (methylation differences ranged from 5.0 to 24.4 %). Moreover, differentially methylated region (DMR) analysis identified a region in the 5'UTR (Chr.8:22,423,530–22,423,569) of SORBS3 that was increased in methylation by 11.2 % in the obese group. The negative relationship observed between DNA methylation and gene expression for SORBS3 was validated by a site-specific sequencing approach, pyrosequencing, and qRT-PCR. Additionally, we performed transcription factor binding analysis and identified a number of transcription factors whose binding to the differentially methylated sites or region may contribute to obesity.

Conclusions: These results demonstrate that obesity alters the epigenome through DNA methylation and highlights novel transcriptomic changes in SORBS3 in skeletal muscle.

ContributorsDay, Samantha (Author) / Coletta, Rich (Author) / Kim, Joon Young (Author) / Campbell, Latoya (Author) / Benjamin, Tonya R. (Author) / Roust, Lori R. (Author) / De Filippis, Elena A. (Author) / Dinu, Valentin (Author) / Shaibi, Gabriel (Author) / Mandarino, Lawrence J. (Author) / Coletta, Dawn (Author) / College of Liberal Arts and Sciences (Contributor)

Created2016-07-18

Possibilities and Pitfalls in Quantifying the Extent of Cysteine Sulfenic Acid Modification of Specific Proteins Within Complex Biofluids

Description

Background: Cysteine sulfenic acid (Cys-SOH) plays important roles in the redox regulation of numerous proteins. As a relatively unstable posttranslational protein modification it is difficult to quantify the degree to which any particular protein is modified by Cys-SOH within a complex biological environment. The goal of these studies was to move…

Background: Cysteine sulfenic acid (Cys-SOH) plays important roles in the redox regulation of numerous proteins. As a relatively unstable posttranslational protein modification it is difficult to quantify the degree to which any particular protein is modified by Cys-SOH within a complex biological environment. The goal of these studies was to move a step beyond detection and into the relative quantification of Cys-SOH within specific proteins found in a complex biological setting--namely, human plasma.

Results: This report describes the possibilities and limitations of performing such analyses based on the use of thionitrobenzoic acid and dimedone-based probes which are commonly employed to trap Cys-SOH. Results obtained by electrospray ionization-based mass spectrometric immunoassay reveal the optimal type of probe for such analyses as well as the reproducible relative quantification of Cys-SOH within albumin and transthyretin extracted from human plasma--the latter as a protein previously unknown to be modified by Cys-SOH.

Conclusions: The relative quantification of Cys-SOH within specific proteins in a complex biological setting can be accomplished, but several analytical precautions related to trapping, detecting, and quantifying Cys-SOH must be taken into account prior to pursuing its study in such matrices.

ContributorsRehder, Douglas (Author) / Borges, Chad (Author) / Biodesign Institute (Contributor)

Created2010-07-01

ASU Scholarship Showcase

Filtering by

The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

Challenges and Opportunities in Coding the Commons: Problems, Procedures, and Potential Solutions in Large-N Comparative Case Studies

The Tragedy of the Unexamined Cat: Why K-12 and University Education Are Still in the Dark Ages and How Citizen Science Allows for a Renaissance

A Microbial Survey of the International Space Station (ISS)

A Multidisciplinary Biospecimen Bank of Renal Cell Carcinomas Compatible With Discovery Platforms at Mayo Clinic, Scottsdale, Arizona

Parallel Workflow for High-Throughput (>1,000 Samples/Day) Quantitative Analysis of Human Insulin-Like Growth Factor 1 Using Mass Spectrometric Immunoassay

The Role of Diverse Strategies in Sustainable Knowledge Production

Serum Amyloid A Truncations in Type 2 Diabetes Mellitus

Next-Generation Sequencing Methylation Profiling of Subjects With Obesity Identifies Novel Gene Changes

Possibilities and Pitfalls in Quantifying the Extent of Cysteine Sulfenic Acid Modification of Specific Proteins Within Complex Biofluids