Digital scholarship

Digitizing the Past: My Time as a Digital Cultural Heritage Intern

By Grace Robbins, Office of Digital Research and Scholarship Intern, Fall 2019

 

During this semester I have been working as the Digital Cultural Heritage Intern in the Office of Digital Research and Scholarship. You might be wondering, what in the world is digital cultural heritage? It seems like a fancy title, but for what? Essentially the concept of digital cultural heritage is defined as preserving anything of cultural significance in a digital medium. Anything we preserve becomes part of a “heritage” to something, whether that be to an individual person or a whole culture. I was interested in working more with the intersection of digital humanities and archaeology after volunteering on the Cosa excavation in Italy directed by FSU. Archaeology is such a material driven, hands-on discipline (and science!), and it proves to be an effective tool to understanding–and interacting with–the past. However it generates so much data. Archaeologist Ethan Watrall writes, “The sheer volume and complexity of archaeological data is often difficult to communicate to non-archaeologists.” Furthermore, many discovered artifacts and architecture remain inaccessible to most of the general public. Thus, the goals in my internship revolved around understanding how digital platforms affect accessibility to these “heritages,” specifically in the contexts of archaeology and the humanities, so that scholarship is furthered for people in academia but also, hopefully, the general public.

I did most of my work with the Digital Cosa Project, which included uploading data from the Cosa hard drive onto DigiNole, FSU’s Digital Repository. My tasks confronted some organizational and technological challenges, however, as the large amount of data to be ingested meant rethinking the best way to organize the digital collection. We ended up switching plans from organizing by excavation year to organizing by type of file (artifacts, plans, maps, stratigraphic unit sheets, etc.). I also practiced coding, which most historians or archaeologists may not be familiar with. It begs the question, should these disciplines incorporate more digital education in the future? How useful would this be?

I also wanted to experiment with visual technology, including 3D applications such as 3D modeling and 3D printing.

meshroom photo

3D model of a trench from 2019 excavation season being built in Meshroom, a free and open source photogrammetry program.

I especially enjoyed learning about 3D printing because it ran simultaneous to the FSU Archaeology Club’s “Printing the Past” exhibit in Dirac, which displays one important way digital humanities can further archaeological knowledge: hands-on learning! I couldn’t have learned about ancient Rome better than when I was unearthing ancient material on the dig and providing such artifacts in the form of 3D printing for non-archaeologists to interact with bears pedagogical significance.

printingpast

Standing with the Cosa poster for the “Printing the Past” exhibit by the FSU Archaeology Club. I helped 3D model and 3D print the 3rd object from the left, an inscription found in the 2019 excavation season.

3D scanning was not as easy of a task, as new technology always comes with a learning curve, but in the future I want to continue working with this practice.

scanning

3D scanning the Napoleon Bonaparte death mask in Special Collections.

In my time at the internship, I have broken down my understanding of digital humanities from a broad concept to a web of applications that further humanities and archaeological knowledge. I will continue to work in the internship next semester, and I hope to continue mastering 3D modeling and printing and am looking forward to developments in the Digital Cosa project as we finalize our plans for the Cosa digital collection. Most importantly, I am eager to experiment with more creative ways these digital applications can be used in academia and the general public that will enable us to be more “in touch” with the past.

 

Open Access at FSU Libraries: A Year in Review

Open access is a global movement to freely publish research in online repositories and open publications instead of the costly subscription-based publishing models that have dominated the scholarly publishing industry for decades. Paywalled research is only available to those who can afford to pay subscription costs, leaving many researchers and institutions around the world unable to access critical findings in their fields. Open access allows research to reach a wider, global audience and leads to greater readership, citation, and innovation. Authors can publish their work openly by archiving accepted manuscripts in institutional repositories like DigiNole, publishing completed drafts on preprint servers, or submitting to open access journals.

Open Initiatives at FSU

In 2016, the FSU Faculty Senate adopted an Open Access Policy that grants the Libraries permission to archive scholarly works created by FSU faculty. The policy is intended to increase the availability of research developed at FSU to readers and scholars around the world. The Libraries use mediated deposits and automated harvesting workflows to populate DigiNole, our institutional repository. A three-year review found steady growth in repository deposits since the Open Access Policy was implemented. This trend continued in 2019 which saw over 2,000 objects added to DigiNole. Departments with 100 or more DigiNole uploads this year include the Department of Psychology, the Department of Biological Sciences, and the National High Magnetic Field Lab.

bar chart of upward trend in article deposits to DigiNole from 2011 to 2018

Scholarly Articles in DigiNole by Year

Another way the Libraries support open access includes the Open Access Fund which helps authors publish in open journals. Some open access journals require authors to pay article processing charges (APCs) to finance the technical work that goes into preparing, publishing, and preserving web publications. APCs can cost upwards of $2,000 for some publishers. To help authors mitigate this expense, the Open Access Fund provides awards of up to $1,500 for qualified proposals. In 2019, the Libraries funded 38 open access articles with funding support support from the College of Arts and Sciences, the Graduate School, and the Office of the Provost.

Open access is not limited to research articles. Textbook costs have increased 82% since 2002 (nearly three times the rate of inflation), and textbook affordability for students is a growing concern nationwide.¹ Instructors are turning to open educational resources to reduce textbook costs. In the 2017-2018 academic year, 43% of students in Florida reported spending over $300 per semester on textbooks, and 64.2% of students were unable to purchase a textbook due to high costs.² Florida Virtual Campus hosted an Open Educational Resources Summit in the spring of 2018 where librarians and educators from across the state came together to discuss challenges and opportunities for implementing OER on their campuses. Mallary Rawls represented FSU Libraries at the Summit and reported on the event in a March blog post.

Instructors at FSU have been adopting open course materials and using resources from the Libraries to decrease textbook expenses for students. Dr. Vanessa Dennen in the Department of Education created an open textbook in 2018 and reported on student’s perceptions of the open materials in a recent issue of Online Learning. The Libraries published two open access textbooks this year in DigiNole to fill subject gaps in existing open materials.

Cover art for FSU Open Textbooks

Dr. Giray Ökten and Dr. Arash Fahim received Alternative Textbook Grant awards that helped transform their lecture notes into open mathematics textbooks. First Semester in Numerical Analysis with Julia and Financial Mathematics: Concepts and Computational Methods support subjects that are not well-covered by traditional textbooks in the field. The Alternative Textbook Grants have saved students $333,356 since 2016. Instructors can visit the Alternative Textbook Grants webpage for more information.

In addition to cost savings, open course materials have the added benefit of perpetual access. Unlike access codes and textbook rentals that are only available for a limited time, open materials are freely available online or through the library without access restrictions. With open online course materials, instructors can easily update textbooks with new material, and students can be assured they are accessing the most current version of the information. Open educational resources offer greater flexibility for instructors to customize their course content and increase textbook affordability for students.

Open access is critical for advancing the global knowledge commons and scientific innovation, and open educational resources promote student success by increasing the accessibility of course materials. The Libraries are proud of the progress we have made this year in furthering open access, and we look forward to continuing this important work in the future.

 

  1. US GAO. (2013). College Textbooks: Students Have Greater Access to Textbook Information. https://www.gao.gov/products/GAO-13-368
  2. Florida Virtual Campus. (2018). 2018 Student Textbook and Course Materials Survey. Web.

A Library Intern’s Maiden Voyage through Digital Publication in the Antarctic

This post was authored by Suzanne Raybuck, Intern with the Office of Digital Research and Scholarship in the Fall of 2019. Suzanne recounts her experience working with Special Collections Materials and creating a digital publication interface to display it online. The final version is not yet live, but this post contains previews of the interface.

A sturdy, metal oil lantern about a foot tall with grey metal and discolored glass protecting the wick.

A Hercules emergency oil lantern from Operation Deep Freeze.

When I originally was brought on as the Digital Publication Intern for the Office of Digital Research and Scholarship, I had virtually no concept of what I would be doing in my new internship position. But, very early on I knew that I wanted to work with the Robert E. Hancock Jr. Collection at FSU Special Collections. The Hancock Jr. Collection is a collection that “contains materials regarding military operations in the Antarctic, primarily focusing on the Operation Deep Freeze II mission.” Based on that description, it’s a safe conclusion to assume it contains lots of important and scholarly documents and artifacts. However, it also contains various memorabilia from Robert E. Hancock Jr.’s time in Antarctica (including many, many, tiny penguin figurines, a drawing of Mickey Mouse shaking a penguin’s hand, model navy destroyer ships, lumps of coal, emergency lanterns, and military rations). This wonderful collection of artifacts is endlessly fascinating because it provides a series of vignettes of life at the South Pole in the form of really fun random objects.

A highly detailed model of a navy wind-class icebreaker ship. Below the water line the ship is painted red, while above it is a steel gray. The deck is littered with various apparatus such as guns, lifeboats, radar equipment, and cranes.

A model of a wind-class ice breaker ship USCGC Southwind which participated in Operation Deep Freeze.

I found this collection while searching through Special Collections for a fun series of documents to use as guinea pigs for a new publication system we were testing. Essentially, I needed a bunch of documents in similar formats that we could transform into digital objects and then use to test out different publication tools. After spending maybe an hour with a variety of fun models and pictures, I found the Operation Deep Freeze Newsletters nestled into a box of other periodicals from Antarctica. The Newsletters were published by the army to send to the families of servicemen who were in Antarctica to let them know the news from the various bases. The Newsletters were mostly written by incredibly bored servicemen just trying to pass the time in their freezing posts. This boredom resulted in the inaugural newsletter detailing the long and involved process of how a band of grizzled soldiers tried to hatch live chicks from commercial eggs for the upcoming Easter Holiday. I had definitely found my guinea pig documents.

A very basically formatted newsletter titled OPERATION DEEP FREEZE NEWSLETTER. The text is somewhat faded and the first article details how several servicemen from Little America Station selected six eggs for incubation in a dental incubator to try to hatch live chicks.

The original front page of Volume 1 Issue 1 of the Operation Deep Freeze Newsletter.

After finding the newsletters, I was tasked by our Digital Humanities Librarian, Sarah Stanley, with first encoding these newsletters in a data-rich .xml format called the Text Encoding Initiative, or TEI, and then figuring out how to publish them online. To accomplish this, we had to take into consideration three key factors: maintaining the format of the newsletters, good display functionality (e.g. tables of contents, hyperlinks, page view/scroll view), and how easy it would be to use. With these in mind, I started trying out different publication methods such as eXist-db’s TEI Publisher, which proved to be a challenging introduction into digital publishing.

eXist-db is an XML database tool that can be used to build web applications. We used the TEI Publisher package to create a digital collection that would use our TEI data format and present it in a clean and simple interface. The process of generating an application was intricate and required lots of specialized knowledge of both TEI files and their accompanying customization files. Additionally, we had no idea how the digital edition would look before we generated an application and viewed it, so if some small part of the display of the edition was off, we would have to delete the app, minimally adjust our code and generate a new app from the very beginning. Once we did get a finalized version generated, the overall look and feel of the page was exactly what we had hoped: very clean and easy to read. However, because we were using a program to generate the app for us, we had a very limited capacity to tweak the website interface and design or add our own custom parts to the whole thing. Ultimately, the the functionality of eXist-db did not quite meet our needs, and we tried to find a solution that would let us get a bit more hands-on with our edition.

A screenshot of a webpage titled TEI Publisher with the content of the first Deep Freeze Newsletter in it. The interface has a search bar and small arrows to navigate between pages, and all the text is displayed in a old-timey typewriter font to mimic the original newsletter.

A screencap of the eXist-db interface we created, its very clean and easy to navigate but at least four iterations of apps went into getting this particular layout.

Another possible publication tool didn’t arrive until the next semester, when I was working on publishing a collection of poetry translations online. Sarah pointed me towards a Jekyll (static website generator) template for minimal editions called “ed”. After looking at the examples, the display was again very clean and easy to interact with, so we decided to give it a shot. After deploying some quick test sites, we found that it was incredibly easy to work with and consistently generated beautifully designed websites that intuitively displayed our editions. It also had a built-in search function and annotation, which we were looking for in our poetry project. The only problem was we had to translate our TEI format into markdown, which caused us to lose huge amounts of metadata and information about textual styling that would be useful to other researchers. We made a judgement call and decided to keep looking for something that would preserve our format while giving us all the functionality and display options that we found with ed.

A screenshot of a webpage titled ‘A Wilfred Owen Collection: Just testing this out’ with the text of the first Operation Deep Freeze Newsletter below the title. The interface is very simple and appealing with dark red accents and a collapsible menu bar.

A screenshot of our test site for ed. Most of the sample texts we used were poems from Wilfred Owen, hence the name. Here you can see that the layout is slightly different since ed automatically creates larger title text. Unfortunately, we had to change all our TEI files to markdown, which got rid of most of our metadata.

The final option we looked at was a JavaScript library called CETEIcean, which takes TEI files and translates it directly into HTML. With a single script added to any existing HTML page, we could take our TEI files and easily publish them. Again, we started making some test pages and playing with the code and quickly ran into a problem. Because CETEIcean is just a JavaScript library, it doesn’t automatically build websites for you like with existdb and ed. If we used CETEIcean, we would have to make every single page on our website from scratch, repeating tons of HTML and JavaScript along the way. Sarah was enthusiastic about using CETEIcean since it did arguably check all our boxes, but I wanted to find a more efficient way.

In the end, we settled on using a combination of CETEIcean and ed along with chunks of original code to create our own web application which we named Pilot: Publishing Interface for Literary Objects in TEI¹. We essentially used the quick and intuitive page generation from ed, the javascript transformation of TEI from CETEIcean and mixed it together all running on a node.js server. Because we made Pilot from scratch, we can include or add all the functionality we want such as annotation, interactivity, and variant readings of the base newsletters.

A very basic interface that has a navigation bar at the top labelled ‘Pilot’ with links to a homepage, collections page, about page and contact information. The contents of the Operation Deep Freeze newsletter are again reproduced here with very basic text styling and layout.

A screenshot of what the first draft of our Pilot interface looks like. This page was automatically generated by the server file after reading a folder of TEI files, transforming them to HTML, and finally running them through three templates to get the desired display.

Though this project was long and frustrating, it ended up teaching me one of the most important points of digital publishing: digital representation of texts adds to the work, rather than merely representing it. Digital publishing is at a unique intersection where we have to negotiate the appearance of the facsimile, the functionality the editors want, and the demands of a digital medium. With all of these competing agendas, it’s hard to remember that a digital edition is a creative opportunity. With the vast array of tools offered by the web, developers can take advantage of things like interactive elements, user input, and different types of media to create editions that can only exist in digital spaces. In a way, digital editions represent a new kind of edition that acts more like an archive; where researchers can explore a digital space to find artifacts that are curated through organization and interface.

We plan for our iteration of the Newsletters in Pilot to allow for full-text searching, public annotation, different readings, and interactive displays. With these new features, we hope that the Newsletters will be read and understood in entirely different ways than their paper counterparts, and allow readers to interact with such an engaging yet little known collection.

Notes

¹ As an homage to CETEIcean (a pun on “cetacean,” which means “of or relating to whales”), we decided to keep with the whale theme and name our project after the pilot whale.

ALTERNATIVE TEXTBOOK GRANTS FOR INSTRUCTORS AIM TO REDUCE FINANCIAL BURDEN ON STUDENTS

FSU Libraries are currently taking applications for new Alternative Textbook Grants. These grants support FSU instructors in replacing commercial textbooks with open alternatives that are available to students at no cost. Open textbooks are written by experts and peer-reviewed, just like commercial textbooks, but are published under open copyright licenses so that they can be downloaded, distributed, and adapted for free.

“These grants encourage faculty to relieve some of the financial burden on their students, advancing the University’s strategic goal of ensuring an affordable education for all students regardless of socioeconomic status,” said Gale Etschmaier, Dean of University Libraries. “Grant programs of this kind are having a big impact at elite institutions across the country, collectively saving students millions in textbook costs each year.”

The cost of college textbooks has risen 300% since 1978, with a 90% cost increase over the last decade alone. Due to high costs, many students decide not to purchase textbooks, a decision which is proven to negatively impact student success. In a recent survey conducted by the Libraries, 72% of FSU students (n = 350) reported having not purchased a required textbook due to high cost. Instructors who participated in previous rounds of the Alternative Textbook Grants program are expected to save FSU students up to $270,000 by Summer 2019.

During the 2018-19 academic year, ten grants of $1,000 each will be available to FSU instructors who are interested in replacing commercial course materials with open textbooks, library-licensed electronic books or journal articles, or other zero-cost educational resources. Thanks to a partnership with International Programs, an additional ten grants of $1000 will be available for faculty who teach at FSU’s international study centers.

Interested instructors are encouraged to review the grant requirements and submit an online application form by the following dates:

  • October 1st, 2018 (for spring and summer on-campus courses)
  • November 1st, 2018 (for courses taught at our international study centers)
  • February 1st, 2019 (for summer and fall courses)

Successful applicants will receive training and consultations to assist them in implementing their alternative textbook. For more information, and to apply for a grant, please visit lib.fsu.edu/alttextbooks or contact Devin Soper, Scholarly Communications Librarian at dsoper@fsu.edu.

Florida State University Libraries’ mission is to drive academic excellence and success by fostering engagement through extensive collections, dynamic information resources, transformative collaborations, innovative services and supportive environments for FSU and the broader scholarly community.

Open Textbook Network Workshop for FSU Faculty

otn_logo_horiz_4cp-2.png

The Office of the Provost is sponsoring an open textbook workshop for FSU faculty from 10:00am-12:00pm on Thursday, October 25th. The workshop will be facilitated by two Open Textbook Network trainers, Dr. Abbey Dvorak and Josh Bolick from the University of Kansas. The purpose of the workshop is to introduce faculty to open textbooks and the benefits they can bring to student learning, faculty pedagogical practice, and social justice on campus.

Participating faculty will be invited to engage with an open textbook in their discipline by writing a brief review, for which they will be eligible to receive a $200 stipend.

What: Open Textbook Network Workshop

Where: Bradley Reading Room, Strozier Library

When: Thursday, October 25th, 10:00 AM – 12:00 PM

Interested faculty members are invited to apply by Friday, October 12th. Capacity is limited and open textbooks are not available for all subjects. Preference will be based on the availability of open textbooks in applicable subject areas.

If you have questions about this workshop or open textbooks, please contact Devin Soper, Scholarly Communications Librarian, at 850-645-2600 or dsoper@fsu.edu. You can also visit the Open & Affordable Textbook Initiative website for more information about our open education initiatives.

Gathering Publicly Available Information with an API

by Keno Catabay and Rachel Smart

This is a post for anyone who is interested in utilizing web APIs to gather content or simply have questions about how to begin interacting with web APIs. Keno Catabay and myself, Rachel Smart, both work in the Office of Digital Research and Scholarship on various content gathering related projects. Keno was our Graduate Assistant since Fall 2017, pursuing data and digital pedagogy interests as well as teaching python workshops. I am the manager of the Research Repository, DigiNole, and am responsible for content recruitment, gathering, and management and all the offshoot projects.

Earlier this summer, we embarked on a project to assist FSU’s Office of Commercialization to archive approved patent documents that university affiliated researchers have filed with United States Patent and Trademark Office (USPTO) since the 80s. These patents are to be uploaded into DigiNole, our institutional repository, increasing their discoverability, given that the USPTO Patent Full-text and Image Database(PatFT) is difficult to navigate and DigiNole is indexed by Google Scholar.

This project was accomplished, in part, using the Patent-Harvest software developed by Virginia Tech libraries. The software contains a Java script that retrieves metadata and PDF files of the patent objects from PatFT through the new API the USPTO is developing for their database, currently in its beta stage. While the Virginia Tech Patent-Harvest was an excellent starting point–thank you, VTech!–we decided that communicating directly with the USPTO API would be more beneficial for our project long-term, as we could manipulate the metadata more freely. Although, currently we rely on the VTech script to retrieve the pdf files.

If you are harvesting data from an API, you will have to familiarize yourself with the site’s unique API query language. The USPTO API query language can be found here:  API Query Language. We also had to make sure we were communicating with the correct endpoint, a URL that represents the objects we were looking to harvest. In our case, we were querying the Patents Endpoint.

Communicating with the API can be difficult for the uninitiated. For someone with a cursory understanding of IT and coding, you may run into roadblocks, specifically while attempting to communicate with the API directly from the command line/terminal of your computer. There are two main HTTP requests you can make to the server: GET requests and POST requests. GET HTTP requests appear to be the preferred standard, unless the parameters of your request exceed 2,000 characters in which case you would make a POST request.

Image of Postman's interface during a query

Snapshot of Postman’s interface during a query

Keno chose to use Postman, a free software, to send the HTTP requests without having to download packages from the command line. Depending on how much traffic is on the server, Postman is able to harvest the metadata in a few minutes for us.

Instructions for writing the parameters, or the data that we wanted from USPTO, is clearly provided by the API Query Language site, patentsview.org. In our case, we wanted our metadata to have specific fields, which are listed in the following GET request.

GET http://www.patentsview.org/api/patents/query?q={“assignee_organization”:”Florida State University Research Foundation, Inc”}&f=[“patent_number”,”patent_date”, “patent_num_cited_by_us_patents”, “app_date”, “patent_title”, “inventor_first_name”, “inventor_last_name”, “patent_abstract”, “patent_type”, “inventor_id”,”assignee_id”]&o={“per_page”:350}

Note that the request defaults to 25 results, so o={“per_page”:350} was inserted in the parameters as we expected around 200 returned results from that particular assignee.

USPTO returns the data in JSON format, which is written in an easy-to-read, key/value pair format. However, this data needs to be transformed into the xml MODS metadata format in order for the patent objects (paired metadata and pdf files) to be deposited into the research repository. A php script already being used to transform metadata for the repository was re-purposed for this transformation task, but significant changes needed to be made. When the debugging process is completed, the php script is executed through the command line with the json file as an argument, and 465 new well-formed, valid MODS records are born!

This is a screenshot of the JSON to MODS php script

Snippet of the JSON to MODS transformation script

This project took about three weeks to complete. For those curious about what kinds of inventions researchers at FSU are patenting, the collection housing these patents can be found here in the Florida State University Patent collection. The frequency at which this collection will be updated with new patents is still undecided, but currently we intend to run the script twice a year to net the recently approved patents.

It All Starts Here: Digital Scholarship @ FSU

This semester I set to the task of conducting an environmental scan of digital scholarship at FSU, focusing specifically on projects, faculty, and researchers incorporating various kinds of audio-visual media, tools, and platforms into their work. This project, building off my previous research in digital humanities initiatives using audio-visual media outside the University and the growing interest in such projects in the DH field at large, attempts to identify new horizons and domains for DRS to explore.

The goals of this undertaking lie somewhere between generating a possible blueprint for preservation and access to such projects (a goal traditionally sought by archives or media labs) and making new connections for FSU’s Office of Digital Research and Scholarship (DRS) which is a goal aligned with this emerging entity in academic libraries we are calling digital scholarship centers (Lippincott, et al 2014). Over the course of the semester, I’ve spoken with ethnomusicologists, new media artists, choreographers, digital humanities scholars, GIS experts, digital archivists, and web developers (just to name a few) with the hopes of finding common threads to weave into a shared infrastructure of AV media-focused resources for library collaborations. Although daunting, the value of such an environmental scan has been concisely articulated by E. Leigh Bonds:

I was less interested in labeling [the research of faculty at Ohio State University] than I was in learning what researchers were doing or wanted to do, and what support they needed to do it. Ultimately, I viewed the environmental scan as the first step towards coordinating a community of researchers (2018).

Bonds’ mission of “coordinating a community” is especially apt considering the wide array of scholarship happening at Florida State University. Despite differences in disciplines, approaches, and aims, the use of digital technologies in working with AV media has become a ubiquitous necessity that requires distinct but often overlapping tools and skill-sets. The digital scholarship center, as noted by Christina Kamposiori, operating under a “hub and spoke” organizational model, can effectively serve as a networking node and site of scholarly intersections and cross-pollination (2017).

Such an arrangement, eclipsing traditional conceptions of the library as simply a book repository or service center, better positions library faculty and staff to exercise their knowledge and expertise as technologist partners in scholarly projects working with digital AV content while also enhancing the research ecosystem through developing shared resources. This setup, while dependent on many complex factors, is attainable if the digital scholarship center can effectively check and track the pulse of its community of researchers, identifying their areas of interest, needs, and prospective directions. For DRS, some observations drawn from my environmental scan seems like a good place to begin.

One genre of support DRS and other library units working with digital media can begin to cultivate is providing documentation, preservation, and data management frameworks for digital projects whose final form exists outside traditional “deliverables” of academic scholarship (i.e. print-based publications, and the like). These can be “new media” objects like e-publications and websites, or more complex outputs like performances and/or artworks incorporating many different layers of digital technologies. The work of Tim Glenn, Professor in the School of Dance, is a great example of this kind of intricate digital scholarship which blends choreographic craft and technical execution to create captivating performances. One piece in particular, Triptych (2012), relies on the coordinated interaction between dancers’ bodies, cameras, projectors, and pre-edited video to create what Glenn calls “a total theater experience.”

The amount of digital data and infrastructure that goes into such a project is a bit staggering when we consider the lattice of capture and projection video signals, theater AV technology, lighting control signals, and creating the video documentation of the performance space itself. Glenn’s website is a testament to his own stellar efforts to capture and document these features of the work, but as many archivists and conservators will attest, this level of artist-provided documentation is often not the case (Rinehart & Ippolito, Chapters 1-2, 2014). With this kind of complex digital scholarship, DRS can develop models along a spectrum, either directly with researchers on developing documentation plans and schemas from the ground-up (see examples of such work from The Daniel Langlois Foundation and Matters in Media Art) or serving as a conduit for depositing these digital objects into FSU’s scholarship repository, DigiNole, to ensure their long-term accessibility.

Of course, the other side of the coin is the maintenance, compatibility, and sustainability of such platforms and repositories at the University. DigiNole, built on the Islandora open-source software framework, is the crown jewel of FSU’s digital collections. It serves as the access point to the digital collections of FSU libraries as well as the University’s research repository and green OA platform for works created by faculty, staff, and students. An incredibly valuable and integral part of the library’s mission, Diginole has the advantage being built on an extensible, open-source platform that can be expanded to accommodate a wide variety of digital objects (not to mention that it is also maintained by talented and dedicated librarians, developers, and administrators).

As such, DigiNole can play an equally integral role in data management and documentation projects as a repository of complex, multifaceted digital objects. The challenge will be normalizing data into formats that retain the necessary information or “essence” of the original data while also ensuring compatibility with the Islandora framework. Based on my conversation with FSU’s Digital Archivist, Krystal Thomas, another, more long-term, goal to enhance the digital preservation infrastructure of the library will be implementing a local instance of Archivematica, another open-source software framework that is specifically designed to address the unique challenges of long-term digital preservation of complex media. Another step the University can potentially take in increasing this infrastructure across campus is to seek out a trusted data repository certification. For those of us working in digital scholarship centers, these kinds of aspirations will always be moving targets, as is the nature of the technological landscape. But having a strongest possible grasp on the local needs and conditions of the scholastic community we work with will allow both librarians and administration to channel resources and energy into initiatives that have the highest and most palpable impacts and benefits.

Ultimately, the kind of infrastructure DRS or any other academic unit wishes to build should be in response to the needs of its scholars and foster solutions that have cross-disciplinary applications and implications. Whether generating data management plans, developing scholarly interfaces, or building out our homegrown digital repositories, an R1 institution like Florida State University needs systems that account for the wide variety of scholarship happening both on-campus and at its many satellite and auxiliary facilities. Looking towards the future, we can glimpse the kind of fruitful digital scholarship happening at FSU in the work of undergraduates like Suzanne Raybuck. Her contributions to Kris Harper and Ron Doel’s Exploring Greenland project and whose fascinating personal research on the construction of digital narratives in video games represent promising digital scholarship that bridges archival, humanities, and pedagogical research. Hopefully DRS and its partner organizations can keep pace with such advancements and continue to improve its services and scope of partnerships.

Acknowledgments

Enormous thank you to the entire staff of FSU’s Office of Digital Research and Scholarship for allowing me the space to pursue this research over the past year, namely Sarah Stanley, Micah Vandegrift, Matt Hunter, Devin Soper, Rachel Smart, and Associate Dean Jean Phillips. Thanks to Professor Tim Glenn and Assistant Professor Hannah Schwadron in the School of Dance, Assistant Professors Rob Duarte and Clint Sleeper in the College of Fine Arts, Assistant Professor Sarah Eyerly in the College of Music, doctoral candidate Mark Sciuchetti in the Department of Geography, Krystal Thomas, Digital Archivist at Special Collections & Archives, and Presidential/UROP Scholar Suzanne Raybuck for your time, contributions, and conversations that helped shape this research.

WORKS CITED

Bonds, E. L. (2018) “First Things First: Conducting an Environmental Scan.” dh+lib, “Features.” Retrieved from: http://acrl.ala.org/dh/2018/01/31/first-things-first-conducting-an-environmental-scan/

Kamposiori, C. (2017) The role of Research Libraries in the creation, archiving, curation, and preservation of tools for the Digital Humanities. Research Libraries UK. Retrieved from http://www.rluk.ac.uk/news/rluk-report-the-role-of-research-libraries-in-the-creation-archiving-curation-and-preservation-of-tools-for-the-digital-humanities/

Lippincott, J., Hemmasi, H. & Vivian Lewis (2014) “Trends in Digital Scholarship Centers.” EDUCAUSE Review. Retrieved from https://er.educause.edu/articles/2014/6/trends-in-digital-scholarship-centers

Rinehart, R. & Ippolito, J. (2014) Re-Collection: Art, New Media, and Social Memory. The MIT Press: Cambridge, Massachusetts.

Bringing Data Carpentry to FSU

My name is Rachel Smart and I’m a graduate assistant for Digital Research and Scholarship. I was adopted by DRS in mid-March when the Goldstein Library was reamed of its collection. It was devastating for the 2% of the campus who knew of its existence. Bitterness aside, I’m very grateful for the opportunity I’ve been given by the DRS staff who warmly welcomed me to their basement layer; here I’m being swiftly enthralled by the Open Access battle cry. The collaborative atmosphere and constant stream of projects never fails to hold my interest. Which leads me to Data Carpentry…

In May of this year, I met with Micah Vandegrift (boss and King of Extroverts) regarding my progress and the future direction of my work with DRS. He presented me with the task of running a data workshop here in our newly renovated space. Having never organized something this scale before, I was caught off guard. However, I understood the importance and need for data literacy and management trainings here on campus, and I was excited by the prospect of contributing to the establishment of a Data Carpentry presence here at FSU. Micah was kind enough to supply me with a pair of floaties before dropping me into the deep end. He initiated first contact with Deb Paul from iDigBio, a certified Data Carpentry instructor, here on campus and I joined the conversation from there.

It took a few weeks of phone calls and emails before we had a committed instructor line-up, and we were able to apply for a self-organized Data Carpentry workshop in April. Instructors Matthew Collins, Sergio Marconi, and Henry Senyondo from the University of Florida taught the introduction to R, R visualizations, and SQL portions of the workshop. I was informed that you aren’t a true academic librarian until you’ve had to wrestle with a Travel Authorization form, and I completed them for three different people, so I feel thoroughly showered in bureaucratic splendor. However, the most obstructive item on my multipart to-do list of 34+ tasks was finding the money to pay for food. DRS has an event budget with which we paid the self-hosting fee and our instructors’ traveling expenses, but we were not allowed to use it for food. This delayed the scheduling process, and if it weren’t for the generous assistance from iDigBio, we would have had some very hungry and far fewer attendees. If I were blessed with three magical freebies for the next potential Data Carpentry event, I would use the first to transform our current event budget into food-friendly money, and I would save the other two in case anything went wrong (ex, a vendor never received an order). This may seem overly cautious, but just ask anyone who had to organize anything. We are perfectly capable of completing these tasks on our own or with a team, but some freebies for the tasks which fall beyond our control would come in handy.

data carp pano.jpg

The event ran smoothly and we had full attendance from the 20 registered attendees. As busy as I was in the background during the event, attendees came up to me and let me know how well the workshop was going. There were also comments indicating we could do things a little differently during the lessons. I think most of the issues that sprung up during the event were troubleshooting software errors and discrepancies in the instructions for some of the lessons, for example, the SQLite instructions were written using the desktop version of the program and not the browser plugin everyone was using. The screen we used to display the lessons and programming demos was the largest we could find, but it was still difficult for some people to see. However, adjustments were made and attendees were able to continue participating.

The most rewarding element of the experience for me were the resulting discussions among participants during planned collaboration in lessons and unplanned collaboration during breaks and long lunch periods. The majority of our participants have various backgrounds in the Biological Sciences, but as individuals they had different approaches to solving problems. These approaches frequently resulted in discussions between participants about how their various backgrounds and research impacted their relationship with the tools and concepts they were learning at Data Carpentry. On both days of the event, participants came together in our conference room for lunch and rehashed what they had learned so far. They launched into engaging discussions with one another and with DRS staff about the nature of our work and how we can work together on future initiatives. This opportunity to freely exchange ideas sparked creative ideas relating to the Data Carpentry workshops themselves. On the second day, an increased number of participants brought their own project data to work with in workshop exercises.

The future of Data Carpentry here at FSU looks bright, whether I will be there for the next workshop is unknown. Thank you, Deb Paul, Micah Vandegrift, Emily Darrow, Kelly Grove, and Carolyn Moritz for helping me put this workshop together, and thank you to everyone who participated or contributed in any way.

Spring 2017: A User Experience Internship In Review

bs-SP17
It’s my final semester in the iSchool program, and I made it. I had a long journey from the start, including a brief hiatus, and yet I returned to finish with a passion – I even received the F. William Summers Award to prove my academic success. But perfect GPA aside, I’m most proud of my personal and professional development while remotely interning for the Office of Digital Research and Scholarship. The highlight was visiting FSU for the first time this semester and working in the office for a full week. Through meetings, workshops, and events, I learned even more and enjoyed interacting with the team in person. It was a fun and informative visit which I’d recommend any remote intern to do, if possible.

IMG_FSU
A beautiful Tallahassee day at Strozier Library.

My Spring semester objective was to learn more about user experience (UX) and apply it by compiling a report for the office’s website redesign. To prepare for the process, I spent half of the semester reading journal articles, checking out books, and utilizing online sources such as LibUX, Usability.gov, and Lynda.com via FSU. The other half of the semester, I applied what UX principles I learned to consult the office on how to redesign their current website. With this project, I now have a foundation in UX and demonstrated the process through quantitative research, user personas, and visual design. It’ll be exciting to see what recommendations will be used and how it’ll impact existing and new users.

IMG_GOLDSTEIN
Hitting the books on web and UX design to Depeche Mode.

Overall the yearly internship was somewhat unconventional since I worked remotely, but I was still able to understand the parts that make up a whole within digital scholarship. At this point I better comprehend how technology is changing research support and the research process as well. Although my time as an intern has ended, I’m looking forward to seeing what more the Office of DRS has to offer in the future – the new website included. I am grateful to have been introduced and involved with such a supportive and innovative community at FSU.

Thank you, Micah Vandegrift, for your leadership and mentorship, and the entire DRS team, for sharing your time and knowledge. With your guidance, I made it! 🎓

Using R on Early English Books Online

In order to follow along with this post you will need:

  1. Basic knowledge of the Text Encoding Initiative guidelines for marking up texts.
  2. Understanding of the structure of XML and the basics of XPath.
  3. Some experience with Regular Expressions is helpful, but not necessary.
  4. A willingness to learn R!

A few months ago, I started working through Matt Jockers’ Text Analysis with R for Students of Literature. I wanted to improve my text analysis skills, especially since I knew we would be acquiring the EEBO-TCP phase II texts, which contain text data for thousands of early modern English texts (if you are an FSU student or faculty member and you want access to these files, email me). To start, I decided to do some analysis on Holinshed’s Chronicles, which are famous for their impact on Shakespeare’s history plays. While I have been able to create a few basic analyses and visualizations with this data, I’m still learning and expanding my understanding of R. If you ever want to work through some of the ins-and-outs (or would prefer an in-person consultation on R), you should attend the Percolator from 3-5 on Wednesdays in Strozier or email me to schedule a consultation. We will also be holding a text analysis workshop from 10-11 on April 14.

I am going to be working from two of the EEBO TCP phase I texts, since these are currently open access. You can download the entire corpus for phase one in SGML format: https://umich.app.box.com/s/nfdp6hz228qtbl2hwhhb. I’ve used a stylesheet generated by the TEI council to transform the files into TEI P5-compliant XML files. You can get the example files on my GitHub page (along with the finalized code). Alternately, you can get all of the P5-compliant TEI files directly from the Text Creation Partnership Github.

If you want to follow along with this blog post, do the following:

Step 1. Get your texts. Go to my GitHub page and download holinshed-v1.xml and holinshed-v2.xml. Put them in a directory that you can easily find (I have mine on my desktop in a directory called “holinshed” within another directory called “eebo_r”).

Step 2. Download R and R Studio, as outlined in our Text Analysis libguide.

Step 3. Set Working Directory. Open R Studio, and type setwd(“”), where the path to the folder you created is contained within the quotes. On a Mac, your path will likely look something like this:

setwd("~/Desktop/eebo_r")

And on Windows it will look something like:

setwd("C:/Users/scstanley/Desktop/eebo_r")

(Note that you shouldn’t use a “\” character for windows filepaths, even though that is standard. Forward slashes are considered escape characters in R.)

You can either type this into the script pane or in the console. My script pane is on the top-left, but yours may be somewhere else within your RStudio Environment. If you are on a Mac, hit “ctrl+enter” Note: I am using the script pane to edit my code, and hitting ctrl + enter to have it run in the console. If you just want to run your code in the console without saving it as a script, you can type directly into the console.

Step 4. Install the XML and Text Mining packages. Go to Tools > Install Packages and type “XML” (all uppercase) into the Packages text field. Click “Install.” Do the same with “tm” (all lowercase). You could also enter install.packages(“tm”) and install.packages(“XML”) into your console with the same effect.

Step 5. Now that you have the XML and text mining package installed, you should call them into the session:

library(XML)
library(tm)

Again, hit ctrl+enter. 

Now you’re ready to get started working with R!

Remember from the beginning of this post that I created a directory within my working directory (“~/Desktop/eebo_r”) to store the files I want to analyze in. I called this directory “holinshed”. I am going to create an object called `directory` that references that filepath. To do this, I’m going to use an assignment operator (`<-`). This gets used quite frequently in R to assign some more complex or verbose object another name. In this case, we will say:

directory <- "holinshed"

Now, we want to get all of the files within that directory: 

files <- dir(path=directory, pattern=".*xml")

This line of code sets another object called “files” which follows the directory we set with the “directory” object, and finds all of the objects within that directory that end in “.xml” (all of the XML files).

This is where things can get a little confusing if you don’t understand XML and XPath. For a basic overview, you can take a detour to my presentation on TEI from the Discover DH workshop series, which contains an overview of XML.

What you will need to know for this exercise is that XML structures are perfectly nested and hierarchical, and you can navigate up and down that hierarchy using a XPath. If XML is like a tree, XPath is your way of moving up and down branches to twigs, jumping to other branches, or going back to the trunk.

For the purposes of this assignment, I am interested in specific divisions within Holinshed’s Chronicles—specifically, the ones that are labelled “chapter” and “section” by the encoders of the EEBO-TCP texts. The way that I would navigate from the root of the document to these two types of divisions is with the following XPath:

/tei:TEI//tei:div[@type='chapter'] | /tei:TEI//tei:div[@type='section']

(find me all the divisions with a value of “chapter” on the type attribute AND find me all the divisions with the value of “section” on the type attribute.)

Out of the box, R could not parse XPath, but the XML package that you installed at the beginning will allow you to select only those pieces from your documents.

Now we need to get the  XML content out of the two files in our “holinshed” directory. To do this, we will need to create a for loop. To start, create an empty list.

documents.list <- list()

This gives us a place to store the objects when the for loop finishes, and goes back to the beginning. Without the empty list, the content will just keep overwriting itself, so at the end you will only have the last object. So for example, I made the mistake of not creating an empty list while creating my for loop, and I kept only getting the divisions from the second volume of Holinshed’s Chronicles, since the second volume was overwriting the first.

Our for loop is now going to take every file in the “holinshed” directory and do the same thing to it. We begin a for loop like this:

for(i in 1:length(files)){
#the rest of the code goes here

This basically says for every object in 1 to however long the “files” object is (in this case “2”), do the following. Also, note that the pound sign indicates that that line is a comment and that it shouldn’t be processed as R code.

Now, within this for loop, we are going to specify what should be done to each file. We are going to create a document object using `xmlTreeParse` for each object within the “holinshed” directory.

document <- xmlTreeParse(file.path(directory, files[i]), useInternalNodes = TRUE) 

(If you find it hard to read long code on one line, you can put carriage returns. Just make sure that the returns happen at a logical place (like after a comma), and that the second line is indented. Spacing and indentation do matter in R. Unfortunately, WordPress isn’t allowing me to provide an example, but you can see how that would look in practice in the example R file provided in my eebo_r GitHub repository.)

The [i] in “files[i]” will be be where the numeric information will be stored on each loop. So the first loop will be files[1] and the second will be files[2] (which correspond to “holinshed-v1.xml and holinshed-v2.xml). If we had more than two xml files in this directory, the for loop would apply to all of those as well.

Next, you will use the empty list that you have created. Define each of the documents.l that corresponds to files[1] or files[2] (holinshed-v1.xml and holinshed-v2.xml, respectively) as being the nodeset that follows the XPath we created above. In other words, create a list of all of the divisions with a value on @type of “chapter” or “section” within each document.

documents.list[[files[i]]] <- getNodeSet(document, "/tei:TEI//tei:div[@type='chapter'] | /tei:TEI//tei:div[@type='section']", namespaces = c(tei="http://www.tei-c.org/ns/1.0"))

Ignore namespaces for now. They are important to understanding XML, but as long as you don’t have documents that contain multiple XML languages, you won’t need to worry as much about it. I can discuss the function and importance of namespaces in another post.

So, in the end, your full for loop will look like this:

for(i in 1:length(files.v)){
   document <- xmlTreeParse(file.path(directory, files.v[i]), useInternalNodes = TRUE)
   documents.l[[files.v[i]]] <- getNodeSet(document, "/tei:TEI//tei:div[@type='chapter'] | /tei:TEI//tei:div[@type='section']", 
        namespaces = c(tei="http://www.tei-c.org/ns/1.0"))
}

If you want to run multiple lines of code, you can highlight the entire for loop, and hit “ctrl+enter.” Alternately, you can put your cursor at the beginning of the for loop in the script pane, and click “option+command+E” on a mac, or go to the menu and click “code > run region > run from line to end” to run from that line to the end of the script. This is also useful if you ever save an R script and want to come back to it later, and start from where you left off. This way you don’t need to go back and run each line individually.

Now you should have a list with two items. Each item on this list is a node set (which is a specialized type of list). Rather than having documents.l being two nested lists, I want to convert each document into its own list. I did it with the following code. See if you can figure out what exactly is happening here:

holinshed1.l <- documents.l[[1]] 
holinshed2.l <- documents.l[[2]]

Now that I have two separate lists for each document, I want to concatenate them into a single, list of divisions. In R, you use `c` to concatenate objects:

both.documents <- c(holinshed1.l, holinshed2.l)

Now, if you check `length(both.documents)`, you should get 359. Your console will look like this

> length(both.documents)
359

Basically, what this means is that there are a total of 359 divisions in both documents that have a value on type of either “chapter” or “section.”

Now, you are going to want to return all of the paragraphs that are children of these two divisions.* To do this, we are going to need to create another for loop. This time, instead of creating an empty list, we will create an empty vector. I’m going to call this vector paras.lower.

paras.lower <- vector()

I’m going to give you the full code for selecting the contents (text, basically) of all of the paragraphs, and then explain it point-by-point after.

for(i in 1:length(both.documents)){
   paras <- xmlElementsByTagName(both.documents[[i]], "p")
   paras.words.v <- paste(sapply(paras, xmlValue), collapse = " ")
   paras.lower[[i]] <- tolower(paras.words.v)

This says for every object in 1 to the length of “both.documents” (which we determined was equivalent to 359 divisions), do the following:

Create an object called “paras” which will select all of the children of the node set “both.documents” with the tag name of “p.” On each loop, do this for one division within both.documents.

Now create another object (this time a vector), that essentially takes the content of paras (the text within all the <p> elements, stripping the nested tags) and collapses it into a vector.

Now take the vector you’ve created (all of the words from each paragraph within each division) and make the characters all lowercase.

This process may seem slightly confusing at first, especially if you are unfamiliar with what each piece is doing. If you are ever confused, you can type ?term into the console, and you will find the documentation for that specific aspect of R. So, for example, if you typed ?sapply, you’d see that sapply applies a given function over a list or vector (so essentially the same thing happens to multiple objects within a vector or list, without you needing to explicitly state what happens to each item).

Now that you have your character vector with the content of all of the paragraphs, you can start cleaning the text. The one problem is that paras.lower.v contains multiple vectors that need to be combined into one. You can do this by using the paste() function we used in the last few lines.

holinshed.all <- paste(paras.lower, collapse=" ", sep="\n") 

Now, if we ask for the length of holinshed.all, we see that it returns 1, instead of 359.

Now, we are going to use the tm package that we installed at the beginning. This package can facilitate a lot of types of analysis that we won’t cover in this post. We are going to simply use it to easily remove stopwords from our texts. Stopwords are commonly-occurring words that we may not want to include in our analysis, such as “the”, “a”, “when”, etc.

To do this, you are first going to create a corpus from your holinshed.all vector:

holinshed.corpus <- Corpus(VectorSource(holinshed.all))

Now you will remove stopwords from this corpus. You can use the following code to remove all English stopwords:

holinshed.corpus = tm_map(holinshed.corpus, removeWords, stopwords("english"))

However, with a corpus this big, R will run very slow (it will likely take upwards of 10 minutes to remove all the stopwords from your corpus). If you want to let it run and take a break here, feel free to do so. However, if you are impatient and would prefer to continue on right now, I have a premade text corpus in my R GitHub repository, which you can use instead of following the next step.

If you do want to remove the stopwords by yourself, run the above code, grab yourself a cup of coffee, work on some other writing projects for a bit, take a nap—whatever suits you best. Once the stopwords are removed, you will see a “>” once again in your console, and you can then type in

writeCorpus(holinshed.corpus, filenames ="holinshed.txt")

This will create a file that has all of the content of the paragraphs within the <div>s with the type value of “chapter” or “section” minus the stopwords.

**Impatient people who didn’t want to wait for the stopwords to get removed can start up again here**

Now that you have a text file with all of the relevant words from Holinshed’s Chronicles (holinshed.txt), we are going to analyze the frequencies of words within the corpus.

We are going to use the scan() function to get all of the characters in the Holinshed corpus.

holinshed <- scan("holinshed.txt", what="character", sep="\n")

This line of R will create an object called “holinshed” which contains all of the character data within holinshed.txt (the corpus you just created).

You will once again need to use the “paste” function to collapse all of the lines into one (as the line of code above separated the documents on each new line).

holinshed <- paste(holinshed, collapse=" ")

Now you will split this very long line of characters at the word level:

holinshed.words <- strsplit(holinshed, "\\W") 

This splits the strings of holinshed at the level of the word (“\\W”). If you attempt to show the first 10 items within holinshed.words (`holinshed.words[1:10]`), you will notice that it gives you a truncated version of the whole document, and then 9 NULLs. This is because strsplit converts your vector into a list, and then treats the whole document like the first item on that list. Using unlist(), we can create another character vector:

holinshed.words <- unlist(holinshed.words)

Now, if you enter `holinshed.words[1:10]`, you will see that it returns the first 10 words… but not quite. You will notice that there are a number of blank entries, which are represented by quote marks with no content. In order to remove these, we can say:

holinshed.words <- holinshed.words[which(holinshed.words!="")]

Now, if you enter holinshed.words[1:10], it will display the first 10 words:

[1] "read"     "earth"    "hath"     "beene"    "diuided"  "thrée"  
[7] "parts"    "euen"     "sithens"  "generall" 

In order to get the frequencies of the words within our corpus, we will need to create a table of holinshed.words. In R, this is incredibly simple:

holinshed.frequencies <- table(holinshed.words) 

Now, if you enter length(holinshed.frequencies), R will return 37086. This means that there are 37,086 unique strings (words) within Holinshed’s Chronicles. However, if you look at the first ten words in this table (`holinshed.frequencies[1:10]`), you will see that they are not words at all! Instead, the table has also returned numbers. Since I don’t care about numbers (you might, but you aren’t writing this exercise, are you?), I’m going to remove all of the numbers from my table. I determined that we start getting actual alphabetic words at position 895. So all you need to do is redefine holinshed.frequencies as being from position 895 to the end of the document.

holinshed.frequencies <- holinshed.frequencies[895:37086]

Now you can sort this frequency table so that the first values of the table are the most frequent words in the corpus:

holinshed.frequencies.sort <- sort(holinshed.frequencies, decreasing = TRUE)

Now, if you enter `holinshed.frequencies.sort[1:10]` to return a list of the most often used words in our Holinshed corpus.

If you want a graphic representation of this list, you can plot the top twenty words (or 15 or 10):

plot(holinshed.frequencies.sort[1:20])

This graph should show up in the right pane of your RStudio environment (unless you have it configured in a different way), and will show you a visual representation of the raw frequencies of words within our corpus.

Try it on your own!

  1. We analyzed the top 20 words for the two combined volumes of Holinshed’s Chronicles, but what would our top 20 words look like if we analyzed each text individually?
  2. If you look closely at the XML, you will notice that our original XPath (/tei:TEI//tei:div[@type=’chapter’] | /tei:TEI//tei:div[@type=’section’]) excludes a lot of content from the Chronicles. Specifically, it ignores any division without those type attributes. Further, using `xmlElementsByTagName` only selects the direct children of the node set, which excludes paragraphs that occur within divisions nested within chapters or sections (see, for example `<div type=”part”>`, which occurs occasionally within `<div type=”chapter”>` in volume I). Write code that selects the contents of all paragraphs.
  3. Words in the top 20 list like “doo,” “haue,” and “hir” would presumably be picked up by a stopwords list, if they had been spelled like their modern English equivalents. How could you get rid of a few of these nonstandard stopwords?

Check back to my eebo_r GitHub page for additional R exercises and tutorials using the EEBO-TCP corpus! And if you have any questions about this post or want to learn more about R, schedule a consultation with me.

Notes

* I specifically don’t say that you are looking for all the paragraphs within these divisions, because the code we are about to use only selects children, not descendants. Understanding the difference between these requires some knowledge of XPath and the structure of XML documents.