Author: Rachel Smart

Rachel is a Graduate Assistant in The Office of Digital Research and Scholarship for Florida State University.

Gathering Publicly Available Information with an API

by Keno Catabay and Rachel Smart

This is a post for anyone who is interested in utilizing web APIs to gather content or simply have questions about how to begin interacting with web APIs. Keno Catabay and myself, Rachel Smart, both work in the Office of Digital Research and Scholarship on various content gathering related projects. Keno was our Graduate Assistant since Fall 2017, pursuing data and digital pedagogy interests as well as teaching python workshops. I am the manager of the Research Repository, DigiNole, and am responsible for content recruitment, gathering, and management and all the offshoot projects.

Earlier this summer, we embarked on a project to assist FSU’s Office of Commercialization to archive approved patent documents that university affiliated researchers have filed with United States Patent and Trademark Office (USPTO) since the 80s. These patents are to be uploaded into DigiNole, our institutional repository, increasing their discoverability, given that the USPTO Patent Full-text and Image Database(PatFT) is difficult to navigate and DigiNole is indexed by Google Scholar.

This project was accomplished, in part, using the Patent-Harvest software developed by Virginia Tech libraries. The software contains a Java script that retrieves metadata and PDF files of the patent objects from PatFT through the new API the USPTO is developing for their database, currently in its beta stage. While the Virginia Tech Patent-Harvest was an excellent starting point–thank you, VTech!–we decided that communicating directly with the USPTO API would be more beneficial for our project long-term, as we could manipulate the metadata more freely. Although, currently we rely on the VTech script to retrieve the pdf files.

If you are harvesting data from an API, you will have to familiarize yourself with the site’s unique API query language. The USPTO API query language can be found here:  API Query Language. We also had to make sure we were communicating with the correct endpoint, a URL that represents the objects we were looking to harvest. In our case, we were querying the Patents Endpoint.

Communicating with the API can be difficult for the uninitiated. For someone with a cursory understanding of IT and coding, you may run into roadblocks, specifically while attempting to communicate with the API directly from the command line/terminal of your computer. There are two main HTTP requests you can make to the server: GET requests and POST requests. GET HTTP requests appear to be the preferred standard, unless the parameters of your request exceed 2,000 characters in which case you would make a POST request.

Image of Postman's interface during a query

Snapshot of Postman’s interface during a query

Keno chose to use Postman, a free software, to send the HTTP requests without having to download packages from the command line. Depending on how much traffic is on the server, Postman is able to harvest the metadata in a few minutes for us.

Instructions for writing the parameters, or the data that we wanted from USPTO, is clearly provided by the API Query Language site, patentsview.org. In our case, we wanted our metadata to have specific fields, which are listed in the following GET request.

GET http://www.patentsview.org/api/patents/query?q={“assignee_organization”:”Florida State University Research Foundation, Inc”}&f=[“patent_number”,”patent_date”, “patent_num_cited_by_us_patents”, “app_date”, “patent_title”, “inventor_first_name”, “inventor_last_name”, “patent_abstract”, “patent_type”, “inventor_id”,”assignee_id”]&o={“per_page”:350}

Note that the request defaults to 25 results, so o={“per_page”:350} was inserted in the parameters as we expected around 200 returned results from that particular assignee.

USPTO returns the data in JSON format, which is written in an easy-to-read, key/value pair format. However, this data needs to be transformed into the xml MODS metadata format in order for the patent objects (paired metadata and pdf files) to be deposited into the research repository. A php script already being used to transform metadata for the repository was re-purposed for this transformation task, but significant changes needed to be made. When the debugging process is completed, the php script is executed through the command line with the json file as an argument, and 465 new well-formed, valid MODS records are born!

This is a screenshot of the JSON to MODS php script

Snippet of the JSON to MODS transformation script

This project took about three weeks to complete. For those curious about what kinds of inventions researchers at FSU are patenting, the collection housing these patents can be found here in the Florida State University Patent collection. The frequency at which this collection will be updated with new patents is still undecided, but currently we intend to run the script twice a year to net the recently approved patents.

Bringing Data Carpentry to FSU

My name is Rachel Smart and I’m a graduate assistant for Digital Research and Scholarship. I was adopted by DRS in mid-March when the Goldstein Library was reamed of its collection. It was devastating for the 2% of the campus who knew of its existence. Bitterness aside, I’m very grateful for the opportunity I’ve been given by the DRS staff who warmly welcomed me to their basement layer; here I’m being swiftly enthralled by the Open Access battle cry. The collaborative atmosphere and constant stream of projects never fails to hold my interest. Which leads me to Data Carpentry…

In May of this year, I met with Micah Vandegrift (boss and King of Extroverts) regarding my progress and the future direction of my work with DRS. He presented me with the task of running a data workshop here in our newly renovated space. Having never organized something this scale before, I was caught off guard. However, I understood the importance and need for data literacy and management trainings here on campus, and I was excited by the prospect of contributing to the establishment of a Data Carpentry presence here at FSU. Micah was kind enough to supply me with a pair of floaties before dropping me into the deep end. He initiated first contact with Deb Paul from iDigBio, a certified Data Carpentry instructor, here on campus and I joined the conversation from there.

It took a few weeks of phone calls and emails before we had a committed instructor line-up, and we were able to apply for a self-organized Data Carpentry workshop in April. Instructors Matthew Collins, Sergio Marconi, and Henry Senyondo from the University of Florida taught the introduction to R, R visualizations, and SQL portions of the workshop. I was informed that you aren’t a true academic librarian until you’ve had to wrestle with a Travel Authorization form, and I completed them for three different people, so I feel thoroughly showered in bureaucratic splendor. However, the most obstructive item on my multipart to-do list of 34+ tasks was finding the money to pay for food. DRS has an event budget with which we paid the self-hosting fee and our instructors’ traveling expenses, but we were not allowed to use it for food. This delayed the scheduling process, and if it weren’t for the generous assistance from iDigBio, we would have had some very hungry and far fewer attendees. If I were blessed with three magical freebies for the next potential Data Carpentry event, I would use the first to transform our current event budget into food-friendly money, and I would save the other two in case anything went wrong (ex, a vendor never received an order). This may seem overly cautious, but just ask anyone who had to organize anything. We are perfectly capable of completing these tasks on our own or with a team, but some freebies for the tasks which fall beyond our control would come in handy.

data carp pano.jpg

The event ran smoothly and we had full attendance from the 20 registered attendees. As busy as I was in the background during the event, attendees came up to me and let me know how well the workshop was going. There were also comments indicating we could do things a little differently during the lessons. I think most of the issues that sprung up during the event were troubleshooting software errors and discrepancies in the instructions for some of the lessons, for example, the SQLite instructions were written using the desktop version of the program and not the browser plugin everyone was using. The screen we used to display the lessons and programming demos was the largest we could find, but it was still difficult for some people to see. However, adjustments were made and attendees were able to continue participating.

The most rewarding element of the experience for me were the resulting discussions among participants during planned collaboration in lessons and unplanned collaboration during breaks and long lunch periods. The majority of our participants have various backgrounds in the Biological Sciences, but as individuals they had different approaches to solving problems. These approaches frequently resulted in discussions between participants about how their various backgrounds and research impacted their relationship with the tools and concepts they were learning at Data Carpentry. On both days of the event, participants came together in our conference room for lunch and rehashed what they had learned so far. They launched into engaging discussions with one another and with DRS staff about the nature of our work and how we can work together on future initiatives. This opportunity to freely exchange ideas sparked creative ideas relating to the Data Carpentry workshops themselves. On the second day, an increased number of participants brought their own project data to work with in workshop exercises.

The future of Data Carpentry here at FSU looks bright, whether I will be there for the next workshop is unknown. Thank you, Deb Paul, Micah Vandegrift, Emily Darrow, Kelly Grove, and Carolyn Moritz for helping me put this workshop together, and thank you to everyone who participated or contributed in any way.