Below is the text of a talk delivered at the Digital Antiquarian conference in May 2015. (The slides can be downloaded from the conference website). I am grateful to the conference organizers, Molly O’Hagan Hardy and Tom Augst, and the staff of the American Antiquarian Society, for the opportunity to present my work “under the dome.”
When it comes to the digital humanities, my most strongly-held belief is that the field, in its most powerful instantiation, can perform a double function: facilitating new digital approaches to scholarly research, and just as powerfully, calling attention to what knowledge, even with these new approaches, still remains out of reach. I will illustrate this double function through the example of the TOME project, a digital tool that I’ve been developing with my colleague at Georgia Tech, Jacob Eisenstein, and a team of several graduate and undergraduate students. Our tool employs topic modeling, a technique that derives from the field of machine learning, to support the interactive thematic exploration of digitized archival collections. (And more on that soon).
But since our test archive consists of a set of abolitionist newspapers, including many held at the AAS, I thought I’d use this particular occasion to work through some of the things that our tool, and the process of its development, have taught us about nineteenth century knowledge production, before considering how digital tools, more generally, do—and do not—help to bring that process of knowledge production to light.
To this end, I want to introduce two concepts that, to me, strongly resonate in both historical and contemporary contexts. These are carework and codework, as the title of this talk indicates, and I want to begin by briefly explaining what I mean by each.
“For those of us who do the work of editing in large part because we envision ourselves as careworkers for the commons: how do we articulate this work?”
– Sarah Blackwood, “Editing as Carework: The Gendered Labor of Public Intellectuals” (Avidly, 2014)
Carework, as we know, most commonly refers to the subset of feminized reproductive labor that is undertaken out of a sense of compassion with or responsibility for others, rather than with a goal of monetary gain. The concept meant to be problematic, as Natalia Cecire observes—that is, if you do it because you care, it’s not supposed to be work. And yet—ask anyone who’s ever cared for another—it is most certainly work! Sarah Blackwood, in the essay you see quoted above, has suggested that carework is central to the production of scholarship, especially public scholarship. Following from this line of reasoning, one of the main questions I want to ask today is how DH might be better served if we re-envisioned our work as carework—and not just the obvious parallels, like project management or mentoring; but also in relation to practices like tool-building, that are so often framed in masculinized terms, and that are so central to the field.
“Codework: The computer stirring into the text, and the text stirring the computer.”
– Alan Sondheim, “Introduction: Codework”” (American Book Review, 2001)
The second concept I want to engage is that of codework. And it’s important to note that codework, at least as it was originally conceived, has nothing to do with labor. Alan Sondheim, the poet-theorist who coined the term, employed it to refer to the genre of electronic literature that mixes computer code with natural language. According to Sondheim, codework is characterized by “the computer stirring into the text, and the text stirring the computer.” In applying the concept of codework to DH, I want to explore how we might facilitate the creation of digital tools for stirring into the archive, while also allowing the archive to stir the tool. I will argue, moreover, that by thinking about digital tools in terms of codework, we can more fully account for the carework involved in their creation, as well as in the creation of the archives that in designing our tools, we seek to more fully understand.
I’ve been fortunate enough to spend time at the AAS immersed in some of these archives, although it’s actually at the New York Public Library where you can read the correspondence of one individual responsible for this content: Lydia Maria Child, who, between 1841 and 1843, edited the official newspaper of the American Anti-Slavery Society, The National Anti-Slavery Standard. Child, as we know, was at that time most famous as a novelist, but she also wrote stories for children, and had published a bestselling cookbook. So when the Society was looking to broaden its reach, William Lloyd Garrison suggested that Child be appointed editor of the Standard. Garrison hoped she could “impart useful hints to the government as well as to the family circle,” thereby inviting women into the abolitionist fold. But did she? And if so, how effective– or how widely adopted– was this change in topic or tone?
Questions like these, about the evolution of issues and ideas, were what prompted our work on the TOME project. TOME, short for Interactive TOpic Modeling and MEtadata Visualization, is, as I mentioned before, a tool designed to support the exploratory thematic analysis of digitized archival collections. It rests upon the technique of topic modeling, a technique developed by computational linguists that, by automatically analyzing the words or phrases that tend to appear together in a collection of documents, helps to identify their thematic or stylistic patterns. Topic modeling, in other words, is a technique that stirs the archive. To give a specific example, if you wanted to find out whether Child’s editorial oversight influenced the contents of the Standard, you could run a topic model, and, by taking the additional step of incorporating metadata, such as date or editor, into the model, as we did, you could see which topics were most prevalent in the issues that she edited.
When you run a topic model, the output typically takes the form of lists of words and percentages, like the ones that you see at left, which I’ve re-formatted from plain text, removed some statistical information, and for ease of reference, given descriptive names. Here you see several of the most prominent topics in four distinct sets of newspapers: one consisting of all of the articles published in the Standard during the time when Child was editor; one of all of the articles published in the Standard over the course of its entire run (at least all that have been digitized to date); one of all of the abolitionist newspapers in our dataset; and then a set I find quite interesting, of only the black-owned and edited abolitionist newspapers that we have.
Within each topic, you see listed the words that appear together with the most statistical significance. So the word “soap” often appears in the same article as “acid,” as does “soda,” “gallons,” etc. So to begin to answer the question I just posed, about whether Child’s editorship coincided with a turn to more domestic issues, the preponderance of topics related to recipes and advice about the home—those I’ve labeled “cleaning,” “sewing,” and “baking”—as opposed to the topics that deal explicitly with slavery and its abolition that dominate the other sets of texts—suggest that yes, Child did have an impact on the Standard’s content and tone.
But in spite of the thematic associations that these topics so often suggest, there’s no inherent reason to believe that words grouped together on the basis of co-occurrence statistics really mean or prove anything. A topic model is, after all, a model. And for the model to be truly meaningful, domain experts—that’s us—must be able to probe the semantic associations that the model proposes, and seek out additional perspectives on the model, as well as on the archive itself. In other words, they need a tool that facilitates codework.
So what you see here, at left, is a screenshot from our visual interface—and now I’ll explain it a bit more fully. The interaction begins with your standard keyword search. So if you had a question like, “How did the discourse surrounding emancipation differ in white- vs. black-owned newspapers—and in fact, some sort of difference is suggested by the topics in the previous slide—you might type “emancipation” into the search box at the top of the page.
But instead of the results being a list of links to specific documents, as you would then see with a standard database search, what is displayed is, instead, a visualization of the topics that contain that keyword. We chose to display this information visually, rather than in list form, because we wanted to convey the degree of abstraction that a topic model necessarily entails. We wanted a dynamically-generated visualization because, with the notion of codework in mind, we also wanted the archive to be able to stir the tool.
So that’s the rationale that underlies the colorful “race tracks” that you see. Each color represents a topic as it appears in the archive over time. The topics are ranked from top to bottom in terms of relevance to the initial search query—here, “emancipation.” (The height of each topic block corresponds to its overall prevalence in the corpus). So you can see how the various topics that deal with “emancipation” wax and wane in the archive over time. Using the drop-down menu at the top, you can select specific subsets of newspapers; and then you can observe how, for instance, topics related to “emancipation” feature quite prominently in white-owned newspapers, but are much less prevalent in black-owned newspapers, which are more concerned with the lived experience of slavery: how you’re perceived as a slave, how you might escape northward from slavery to freedom, what sorts of things (and people) you experience along the way, and the like.
In the live interface, clicking on a topic gets you more information about it: its keywords, its overall popularity in the dataset, its geographical distribution, and the documents in the dataset that best encapsulate its use. (There will also be a keyword-in-context view). With these features, which we’re hoping to complete in the coming academic year, the interface will facilitate a very useful new method of exploring archives according to the themes they contain.
But even still, there will be certain things that this tool will never be able to convey. Consider this evidence (at left) from another archive, the letters of Lydia Maria Child that are housed at the NYPL. Writing in March 1842, after a merger with the Pennsylvania Freeman required her to republish large amounts of its content in her paper, Child laments, “I cannot manage the paper at all as I would. Public documents of one kind or another crowd upon me, and since the union with the Freeman, I am flooded with communications, mostly of an ordinary character.” She admits to rewriting almost all of the content she receives in order to make more room for her own editorials, but even then, she can’t find enough space: “I fear to injure the interest of the cause and the paper by omission!”
So here you have an example of influence that operates in the negative; it could never be uncovered by our digital tool. Child’s own arguments—those that she believes will best advance the abolitionist cause—never make it into her paper, and therefore not into the archive. While we might infer, on the basis of her other writings, what Child might have argued, the “three editorials” she claims she’d have rather written remained uncomposed. What we do have in these letters, however, is a record of Child’s carework: her editing, of course, as well as the effort of arranging the articles on the page. We have evidence of her management of relationships—with the Freeman’s editors, with Garrison, with her good-for-nothing husband. (And that’s the subject for another talk entirely). How might we design an interface to exhume this work? Or, to phrase it in the terms of my talk today, is it possible to use codework to bring carework into view?
One of the most interesting aspects of Sondheim’s formulation of codework is how it describes a genre, as well as the effect of that genre on the reader. When a reader confronts computer code inserted into—or in this case (at left), as a poem—he or she isn’t supposed to be able to understand what the code actually does or means. The code is intended to prompt a figurative interpretation. And in fact, it’s this interpretive component of codework that has held the most resonance for digital humanities scholars over the years. Thinking about how we might incorporate interpretation into tools for archival exploration and discovery—and to do so with attention to carework—is one of the most exciting new directions for digital humanities scholarship—and in particular, for the kind of scholarship that the Digital Antiquarian initiative is so well poised to be able to achieve.
Consider, for instance, this tool that the NYPL has developed for visualizing the metadata recorded in its finding aids. The tool employs a network model of visualization, in what I think is a very important acknowledgment of the distributed efforts that bind its holdings together. So what you see here is the network centered on the Lydia Maria Child Papers, which houses the letter I just had up on the screen. Clearly on view is how subject headings, such as “abolitionists,” span multiple archival collections. You can also see how letter-writers, such as Child and Loring, are linked to each other, as well as to others whose papers the Library contains. And from this image, I think, it’s possible to get a sense of the complexity of the relationships among the various abolitionists, even if it is impossible to make visible, let alone quantify, the carework they invested in maintaining these ties. Also implicit in this diagram, I think, is evidence of the work involved in the creation of archives—the bringing together of disparate collections in order to help constitute knowledge; the creation of metadata and other annotations that facilitate these collections’ discovery. This could be said to be a form of carework, too.
But even still, you don’t see evidence of the Freeman’s editors, the women who Child characterized, in another letter to Loring, as “fussy [and] ignorant,” who sent her nothing but the “dullest communications,” filled with “bad grammar, and detestable spellings.” These women were not deemed important enough—by Child, or by anyone else at the time, who could have preserved their correspondence, allowing it to enter our archives today. And this is all to say nothing about the actual enslaved men and women, whose liberty was being argued about in these newspapers, but who were so rarely given the opportunity to speak for themselves, let alone to have that speech recorded in print.
This is a case where we might, yet again, recall the concept of codework—whether explicitly enabled by the tools we employ, or where I think, for the time being, it must lie: implicit in our interactions with each and every one of the digital tools that we, as scholars and as archivists, employ. If our tools do not prompt us, we must prompt ourselves to remain attentive to the “reading and revealing” of the codes that helped to constitute our sources—their original media format, of course, but also the social and political conditions of their making, and the contexts of their production, dissemination, and reception (Kirschenbaum 234). And then, of course, you have the various factors that contribute to their preservation and subsequent scholarly use.
What I hope this talk has allowed us to see is that codework, in its most capacious sense, should lead us to carework, because carework is the source of some of the most meaningful stories embedded in our archives. We in this room need no convincing. But this belief is what we “digital antiquarians” might bring to the digital humanities writ large. Our attention to the margins of texts, to the gaps in our archives—we must ensure that these features are acknowledged, if not always fully reconstituted, as we chart the shift from physical to digital archival form.