Description

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills.

Reuse Permissions
  • 1.27 MB application/pdf

    Download count: 0

    Details

    Contributors
    Date Created
    • 2017-09-28
    Resource Type
  • Text
  • Collections this item is in
    Identifier
    • Digital object identifier: 10.5334/jors.164
    • Identifier Type
      International standard serial number
      Identifier Value
      2049-9647
    Note

    Citation and reuse

    Cite this item

    This is a suggested citation. Consult the appropriate style guide for specific citation guidelines.

    Damerow, J., Peirson, B. R., & Laubichler, M. D. (2017). The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents. Journal of Open Research Software, 5. doi:10.5334/jors.164

    Machine-readable links