Working with very large corpora: Building your worksets in the HathiTrust

Series

Digital Humanities at Oxford Summer School

Video Embed

Kevin Page, Iain Emsley and David Weigl talk about using The HathiTrust Digital Library to conduct research in this interstice workshop.

Within the Andrew W. Mellon funded ‘Workset Creation for Scholarly Analysis (WCSA)’ project, the University of Oxford e-Research Centre have developed new tools and approaches to facilitate study of the HathiTrust Digital Library. This workshop will inform participants of the latest developments from the project, and provide attendees with the opportunity to work with project researchers to explore how they might undertake their own investigations.

The HathiTrust Digital Library comprises the digitized representations of 14.7 million volumes, 7.44 million book titles, 405,345 serial titles, and 5.2 billion pages, best described as “a partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future”. For many scholars the size of the HT corpus is both attractive and daunting.

The first half of this workshop introduces the concept of ‘worksets’, showing how they can be used to effectively investigate large corpora such as the HathiTrust, and demonstrating digital methods to refine and interrogate the data within them. These will be illustrated through existing worksets, including examples focussed on early English printed texts.

In the second, interactive, half of the workshop, attendees will work with project researchers to ‘paper prototype’ potential worksets relating to their own fields of study. Participants will be apprised of existing methods by which they can create HathiTrust worksets for their context; discovery of new workset creation motivations and strategies is welcomed and inform the next generation of HathiTrust workset tooling.