Search Content

The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

Description

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they…

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (https://github.com/diging/).

ContributorsLessios-Damerow, Julia (Contributor) / Peirson, Erick (Contributor) / Laubichler, Manfred (Contributor) / ASU-SFI Center for Biosocial Complex Systems (Contributor)

Created2017-09-28

Data management planning before supercomputing and A collaborative approach in research data sharing at ASU

Description

A two-part presentation from the ASU Library and Knowledge Enterprise Research Data Management Office. Presented at the 2023 Rocky Mountain Advanced Computing Consortium (RMACC).

Session 1: Data management planning is an integral step in the research data life cycle. Large amounts of data and lengthy code accompanying supercomputing runs are no…

A two-part presentation from the ASU Library and Knowledge Enterprise Research Data Management Office. Presented at the 2023 Rocky Mountain Advanced Computing Consortium (RMACC).

Session 1: Data management planning is an integral step in the research data life cycle. Large amounts of data and lengthy code accompanying supercomputing runs are no exception. Planning before analysis will benefit research and the researcher by providing a clear strategy for collecting, storing, analyzing, and sharing the data at the end of the research cycle. Supercomputing can require significant storage beyond scratch space, but researchers typically need to be informed of what tools are appropriate and available. Framed within the planning phase of the life cycle, this presentation presents ASU’s Storage Selector as a quick and easy tool to find the most appropriate storage resources provided by the university to help researchers choose a proper storage and management solution for their research data at the right time in their project. We will also explore the DMP Tool, developed by the California Digital Library, which provides a resource-rich platform for writing data management plans, including institutional-specific guidance, feedback request, and public plans that can be used as guides.

Session 2: This presentation overviews the ongoing working relationship between the ASU Library Open Science and Scholarly Communication division, Research Data Management Office, and Research Computing. We will explore these teams’ interdisciplinary relationships and interdependence as the institution increasingly supports open science practices and initiatives. We will include case studies regarding the decision-making process, data-sharing decisions, and opportunities and challenges when transferring research data from a high-performance computing environment to the ASU Research Data Repository. Finally, we will share lessons learned as we intentionally shepherd research data from active project management and storage to final publication and preservation.

ContributorsHarp, Matthew (Author) / Claypool, Kathryn (Author)

Created2023-05-17