Mining Art History: Bulk Converting Nonstandard PDFs to Text to Determine the Frequency of Citations and Key Terms in Humanities Articles
Affiliation: Stockholm University, SE
Close
Affiliation: Stockholm University, SE
Close
Chapter from the book: Petersson, S. 2021. Digital Human Sciences: New Objects – New Approaches.
Text mining in art history scholarship can tell us about the discipline itself, as well as artistic concerns at any given moment. The aim of this study is to develop and test a strategy for text mining from PDFs of journal articles that have nonstandard formatting and/or use notes rather than full bibliographies for references. While articles in the natural and social sciences typically adhere to standard formats, art history journals employ a variety of formatting styles that make bulk capture of citation and other textual data from the articles challenging. This study outlines a method by which researchers can extract data from journals articles, using a sample set from art history. Once extracted, the data from PDFs can be used to compare frequently used terms across samples and determine which scholars are most cited in either bibliographies or the main body text of articles. If the structure and layout of individual journals are carefully considered and the data is properly cleaned, a clear picture of the disciplinary influences and dependencies of the scholarship through citations and key terms can be obtained.