16 October 2017 by pencechp • permalink
One of the lab’s main products is a suite of tools designed to enable users to perform textual analysis tasks in a user-friendly manner against a corpus of journal articles – a task not common in traditional digital humanities work, which tends to be focused on book-length material. The software is a bit of a challenge to get running, however, requiring some experience in system administration. This post is designed to offer interested researchers a “walkthrough” on how to get a quick server up, running, and loaded with data.
As a first aside, I should note that even if you decide that RLetters isn’t for you, some of these earlier steps will absolutely still be useful. Most of the work of a DH project such as this is tied up in data management and file processing, and that preparation as I describe it here will remain useful even if you decide, for example, to carry out your final analyses in R,1 those steps should still be worthwhile.
The first thing you’ll need is a set of journal articles. Here, unfortunately, I can offer little guidance, as you might (1) already have a manually curated set of articles you’re interested in, you might (2) negotiate access to download a large number of PDFs from a journal publisher, or you could (3) use a text mining API such as JSTOR’s Data for Research or Elsevier’s Text and Data Mining API. In short, what RLetters requires is a set of plain-text journal articles, with their corresponding metadata. If you’re getting articles from DFR, the Elsevier API, or a similar text-mining-focused source, this should be relatively easy – that is, all the relevant information will likely be packaged with the articles (e.g., in an XML format like that returned by the Elsevier API). For those who have to work with large collections of PDFs, however, let me say a bit about how we’ve processed those in the past.
If the PDFs are recent enough (if they were “born digital”), they will have perfect-fidelity plain text already embedded in them. Extracting this text is very easy and produces exceptionally high-quality results. I use
pdftotext for this, which comes with the Poppler libraries on Linux (you almost certainly already have them installed) or Mac OS X (they can be obtained through the fantastic Homebrew project).2
If you have older or scanned PDFs without a text layer, you’ll need to perform optical character recognition (OCR) on them to extract the text. That makes your life difficult, because it’s much more computationally expensive. (I have a machine in my office that I “borrowed” from the departmental storage room a year ago that does nothing but run OCR, 24/7/365.) There are a number of software packages that implement OCR of varying qualities. Tesseract, formerly a Google product, is free and very easy to automate, but unfortunately produces the lowest-quality OCR of any of the options I’ve tried. Better quality is Adobe Acrobat Pro, but it is exceedingly difficult to automate, and rather expensive. The same is true for the best-in-class ABBYY FineReader, which produces the nicest OCR text I know of, but is very expensive, and can’t be automated easily unless one buys the even more expensive Enterprise product.
Now, let’s assume that you’re sitting on a
.txt file for each of your journal PDFs. You need some metadata to go with them. How hard this is depends on whether or not your articles also have a Digital Object Identifier (DOI) attached to them. If they do, you can just query the metadata information from Crossref, the central DOI repository. I’ve written a small script which walks through a directory looking for a series of
.doi files (each of which just contain the DOI of an article) and downloads
.ttl files (RDF-N3/Turtle format) containing their metadata. The fundamental trick here is that if you query the standard https://doi.org DOI resolver with a different HTTP-Accept header, it will return not a redirect to the article, but metadata out in your chosen format. The whole thing is described in a blog post here, if you’re interested.
If you don’t have DOIs, your life gets more difficult. You’ll likely have to write a scraper that visits the website for the journal (hopefully you still have URLs for the article pages online, or an easy way to get them) and scrapes out the bibliographic data. I strongly recommend RoboBrowser in Python as an efficient way to write small-scale, custom web scrapers. (NOTE: Some varieties of journal website scraping may be illegal in your jurisdiction. Please investigate the legality of what you’re doing before you do it.)
I’m now assuming that you have a folder of
.txt files, along with some format, at least, of bibliographic data along with each text file (which you presumably can parse somehow). You now need to get that data into the format that RLetters expects.
RLetters wants its inbound data in an exceptionally simple XML format. The best way to demonstrate it is simply to show an example:
<doc> <field name="data_source">Project Gutenberg</field> <field name="license">Public domain</field> <field name="license_url">http://www.gutenberg.org/license</field> <field name="uid">doi:10.5678/dickens</field> <field name="doi">10.5678/dickens</field> <field name="authors">C. Dickens</field> <field name="title">A Tale of Two Cities</field> <field name="year">1859</field> <field name="volume">1</field> <field name="number">1</field> <field name="pages">1</field> <field name="journal">Actually a Novel</field> <field name="fulltext">It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way-- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.</field> </doc>
Most of these fields are self-explanatory metadata for each article (
authors field should be a list of author names separated by commas. The
uid field needs to be a unique identifier, but may be generated in any way you like. If DOIs are available, I usually use the format seen here (
fulltext field contains the article’s full text. Lastly, there are a few pieces of license and tracking data – the
license_url fields allow you to clearly denote the terms under which this article is provided, and the
data_source field is simply a way to indicate the original source of this article data to users. Each document should be wrapped in a
doc element. Note that because these are XML documents, you must ensure that all content has XML entities escaped!
This is such basic XML that I often generate it in very simple ways, with things like shell scripts or
sed commands, rather than using any sort of fancy XML processing system.3 If you do have documents that start out as XML, however (e.g., Elsevier API documents), you would almost certainly be able to produce an XSLT stylesheet that would directly transform into RLetters’ format in a single stroke.
Getting a Server Running
Okay! Now you have a corpus. You’ll soon find that this is the vast, vast majority of the work involved in any DH project…
- Python 2: Ansible currently still prefers Python 2.x, though there’s some tech-preview support for Python 3 in recent releases. Install this through your Linux distribution or Homebrew.
- Ruby: You’ll need a fairly recent (2.x) version of Ruby, with Bundler installed. If you aren’t a regular Ruby user, install this through your Linux distribution or Homebrew. If you are, I strongly recommend rbenv.
- VirtualBox: The server will run inside a virtual machine, which will be run by VirtualBox, a great free virtual machine system.
- Vagrant: Vagrant is a system for automating the deployment of virtual machines. It’ll be telling VirtualBox what to do.
Deployment of an RLetters instance is significantly aided by the fact that I’ve put together an Ansible playbook for firing up a basic server. The first thing you’ll want to do is install Ansible (and its Python dependencies), and clone the git repository here.
With that done, run
bundle in the ansible-playbook directory (from the terminal/console) to download and install our Ruby dependencies. Inform the deployment scripts that you’ll be running a development instance (and so you don’t want tight security) by editing the file
deploy/group_vars/all and changing the value of
true. With that, all you should need to do to start a server is type
This will not be a quick process. It will start by downloading an empty virtual machine image (running CentOS 7.2, for those interested), and it will then install all of the dependencies and server components required to run RLetters on the virtual machine (using Ansible). If the process finishes successfully, you will now be running an RLetters server.
To temporarily shut down the server, type
vagrant suspend, which will freeze the server’s RAM and CPU state to be restarted later with
To destroy the server, type
rake down. Note that
rake down will completely erase the server and all of its data. Do not run this command if you still have analyses to perform.
Using the Server
Now you have a server running. How do you use the thing? First, you need to get your XML files loaded into the server’s Solr instance. Via port forwarding, Vagrant has configured your virtual machine such that http://localhost:9999 should point to the Solr server on the virtual machine. This means that the Solr instance that powers RLetters can be found at http://localhost:9999/solr.
Solr requires documents to be uploaded via an HTTP POST request to a particular URL,
/solr/<core>/update. To upload a document called
doc.xml to the server using curl, run:
curl http://localhost:9999/solr/collection1/update --data-binary @doc.xml -H 'Content-type:application/xml'
Note that for efficiency, you can batch multiple documents (each wrapped with its own
<doc> element) into a single XML file and upload that. If that command finishes without errors, the documents should now be available on your Solr server. Clicking this link should pull up a local query on your machine, which should list a few of those documents.
Appendix: From Development to Production
If you’d like to run such a server publicly, obviously you’ll need to know a little more about server management, so I’ll only leave a few quick notes here:
- You’ll want to remove the iptables rule that allows inbound VM connections to the Solr server on port 8080 – you don’t want external traffic hitting that server directly.
- You must change the administrator username and password, accessible at http://localhost:8888/admin, which default to
- You can also edit the branding and logo in the administration panel.
- You should evaluate the postgresql and Tomcat configurations to make sure they’re suitable for your setup.
I’m sure some of my readers will be running Windows. If you are… well… sigh. It’s a whole lot harder for me to help you, as almost all of the standard tools for DH data processing and management are old-fashioned terminal scripts or applications, the majority of which are really hard to set up on a Windows box. Not impossible, by any means – investigate the Cygwin or GnuWin32 projects – but not something that I’ve had to do in a long time, and fraught with a number of hard-to-detect bugs. ↩