I’m excited to announce that I’m formally releasing NERtwork, a Shell and Python script-enabled workflow that takes a directory of plain text files and constructs a network of co-occurring named entities (people, locations, organizations) within that collection. The name is a horrible pun, and I will not be apoligizing for it.
NERtwork is a modular workflow that consists of three scripts and a few manual or open-ended steps.
Identifying Named Entities
Batchner is a Shell script that uses the Stanford Named Entity Recognizer to identify names, locations, and organizations in a directory of text files, and extracts the document name, entity name, entity type, and occurrence count into a spreadsheet. This script alone may be of use to researchers, even if they aren’t interested in creating a network.
At present, it only works with Stanford’s English classsifier, though it could likely be altered to work with other languages (if you’re interested, please get in touch!). It also currently only works for the 3 class model, and I don’t intend to change that (though if there’s a good use case for the additional classes, I would be interested in learning about it!).
Following Batchner, the results are likely to be a little…messy. You may have first names with no context, street addresses (in which you may or may not be interested), various shortened or abbreviated names, and typos, mispellings, and OCR mistakes. The best way to resolve this is to use OpenRefine, which can help summarize the contents of a column, and even algorithmically identify likely variations. Particularly with larger datasets and/or lower quality OCR, this is a monumental task. Part of the benefit of the modularity of this workflow is that it’s pretty easy to come back to this step, make improvements, and generate new networks.
Using OpenRefine creates duplicate rows, so there is a Python script (developed by Devin Higgins) that will find all instances where document, entity, and entity type are the same and adds the count together.
At this point, it may also be a good idea to run a join to bring in additional metadata, enabling you to subset your data based on the author, archival series, date, or any other information you may have about your documents. Because everyone’s metadata will be very different, NERtwork doesn’t include scripts to do this, but it does point to some examples to help users.
The final script in the workflow creates a bipartite network projection from the entity counts using NetworkX, then downloads node and edge lists in CSV format. In slightly plainer terms, it first creates a network graph with connections between every document and every entity, then it connects entities together based on shared connections with the same document, then removes the documents from the graph. In even plainer terms, it connects different entities based on the amount that they have both appeared in the same document. Projection is something you can do in some visual network graph software, such as Gephi’s Multimode Networks Transformation Plugin, but even a rather small dataset can overwhelm a machine’s RAM and crash.
The network creation script takes a few different options, allowing you to create several different subsets of the data. Using flags, you can elect to create a network of all three types of entities, solely person, location, or organization, or create all four types at once. You can also add a minimum weight, which complete the entire projection, then only save edges (and corresponding nodes) with a weight above the specified number.
The final step is to take the node and edge lists and put them into whatever software you wish to visualize and analyze them. I wanted a simple and universal output format for teaching and other beginners, but eventually I would love to be able to produce gexf, gml, or d3/html outputs.
Through development, I did a lot of testing using US Presidents’ Inaugural Speeches (assembled by Alan Liu). It’s a great set because it’s small (fast processing time), the text is nearly 100% accurate, and the entities are precise (in other words, full names are always used, there are no abbreviations, etc).
It’s not a great set for the networks, because it’s so small, the speeches are relatively short, and the topics are pretty homogenous. Because of the small size of the dataset, you’re generally able to see entities used by many (state names, Roosevelt, Jefferson, Senate, Army, Navy, Capitol, etc), and those used only by one president.
Above: Full NER graph from US inaugural speeches.
Below, clockwise from top left: a cluster mostly from Reagan’s ‘81 speech, referencing to DC memorials soldiers lost in notable battles in US history; many of the entities Obama used haven’t been used by anyone else, such as Seneca Falls, Afghanistan, Africa, and Medicare and Medicaid; William Henry Harrison went heavy on the deep historical allusions; Taft talking about the new territory gained from the Spanish-American War and US control of the Western Hemisphere
The Fannie Lou Hamer Papers NER dataset is significantly larger, and I haven’t done much with the data beyond generating some basic graphs. If anyone is interested in digging into these any deeper, I would be happy to chat and help out however I can! Here are a few of the graphs, just to show what 640 documents (13 MB of txt files) looks like.
All person names in the entire collection with an edge weight above 3.
All location names in the entire collection with an edge weight above 5. Although it’s pretty hairball-y, the clusters (colored here by modularity) break nearly perfectly into Mississippi towns and street addresses, cities and states elsewhere in the US, and foreign countries.
All named entities from the Mississippians United To Elect Negro Candidates Series.
All named entities with edge weights above 5 from the Mississippi Freedom Democratic Party Series.
Inspiration and Acknowledgements
The genesis of this project came from the desire to create open data from texts and other datasets that are under copyright or otherwise unavailable. Several years ago, when Thomas Padilla and I were both working at Michigan State, the MSU Library acquired hard drives from vendors, and we talked quite a bit about how we might use them, given copyright and licensing restrictions on them. Robots Reading Vogue, a project from Lindsay King and Peter Leonard at Yale, was a major inspiration in thinking through research access for people who didn’t have access to the data itself.
I worked with Thomas, Devin Higgins, and later Megan Kudzia to get access to data from a couple of Gale CENGAGE’s Archives Unbound collections. The Fannie Lou Hamer Papers were of particular interest to me and to many students and professors in the MSU History Department. The incredibly low quality of the data from Gale CENGAGE put the project on hold for quite awhile, as I discussed at the Shaping Humanities Data preconference workshop at DH 2017, but we were able to overcome that thanks to the generous funding of the MSU Department of History. Joe Karisny, Olivia Ramos, and Kellen Saxton (undergraduate student assistants in LEADR) transcribed all of the natural language materials (e.g. everything besides receipts, cancelled checks, membership lists, etc.). Jen Andrella, in her capacity as a graduate assistant in LEADR, refined the named entities in the Fannie Lou Hamer Papers dataset, making the results intelligible. The named entities portion of the Fannie Lou Hamer Papers open data project is available on Github.
Finally, I would be remiss if I did not mention that much of my personal inspiration to keep plugging away at this project comes from Fannie Lou Hamer herself. This workflow has always been entwined with her papers for me, and I had hoped to be able to use the text collection and a growing set of tools to help more students engage with and learn about her life’s work. Her fearlessness in fighting for voting rights and organizing sharecroppers and her sharp wit and tenacity that struck fear into LBJ and other politicians were truly remarkable.
Please get in touch with me if you have any questions about NERtwork. I would also love to know how people are using it!