workflow

Automagical Repository Harvesting

Over the last couple of years, FSU Libraries dedicated librarians and staff to in-house development of an institutional repository platform that is open-source, flexible, and modular. I was hired as the full-time repository specialist for the Office of Digital Research and Scholarship recently and I quickly realized the strategic importance of the institutional repository concept: its purposes, benefits, and potential future impact intersects with the key issues surrounding libraries, technology, scholarly communications, and digital scholarship today.

One of my early tasks focused automating metadata harvesting from other repositories. Figuring out a time- and cost-efficient way to tackle the tracking and depositing of new publications is a key challenge in the field of scholarly communication today. Aside from the issue of how much time this takes per scholarly object, this framework lends itself to human error and, as a result for researchers, decreased scholarship discoverability, accessibility, and validity, which at times can be in tension with the overall goals and purposes of an institutional repository. Publicly accessible APIs provided by public repositories offer the chance to eliminate or greatly reduce the time it takes to process a deposit and the risk that bibliographic information will be inaccurately transferred from one system to another.

In response to this challenge, I have developed two tools to increase the efficiency of repository ingest. PMC Grabber is a PHP-based tool that uses PubMed Central’s APIs to programmatically search the PubMed Central database, pull metadata from the database, and transform the metadata for ingestion into FSU’s institutional repository. With this framework, the Libraries can run constructed searches every six or twelve months and stay on top of new publications from FSU researchers posted in PubMed without a hassle. While the tool does not fully automate the ingestion workflow from harvest to deposit, it significantly mitigates the time-intensive task of manually discovering and creating ingest records for individual articles.

PMC Grabber Workflow Diagram

PMC Grabber Workflow Diagram showing distinct steps, database table layout, and outcomes.

phpLiteAdmin_structureMenu

SQLite database management menu after using PMC Grabber.

phpLiteAdmin_embargoTable

SQLite database embargo table populated after a search using PMC Grabber.

The other tool, codenamed WOS (Web of Science) Grabber, combines a workflow using different tools and applications as well as the core concept of PMC Grabber. The goal is to capture all FSU-affiliated publications appearing in Web of Science with minimal participation necessary on the part of authors. Using a combination of Web of Science searches, Zotero, SHERPA/RoMEO API calls in Google Sheets, and OpenRefine, thousands of publications can be identified and staged for ingest. The end result of the workflow  is a set of publications that can be filtered to discover different sub-sets of articles: (1) those that can be deposited into an institutional repository as publisher versions with no author intervention; (2) those that can be deposited into an institutional repository as accepted manuscripts/final drafts; and (3) those that only allow pre-print versions to be deposited into institutional repositories. Using WOS Grabber I was able to quickly and easily identify over 2,000 articles published in 2016 affiliated with FSU. 500 of these articles (a good 25% of all Web of Science indexed scholarship from FSU!) were open access and were immediately added to our ingestion queue, and a little more than 1500 of the articles were identified as allowing final draft deposit into a repository.

Overall, my involvement with this projects has been positive and signals a promising future for repository managers looking to leverage emerging technologies and centralized repositories. My experiences suggest that through the use of new tools and technologies, what is still being described as an unmanageable goal is quickly becoming a feasible solution for institutional repositories. Libraries with sufficient resources (in terms of skilled personnel and funding) should continue to push the envelope in this area and discover different ways to improve repository workflow efficiency and, ultimately, user access to scholarship. If my experiences are any indication, an investment in and a focus on this kind of work will have great returns for everyone involved.