Frequencies, Concordances, Collocates

A Voyant analysis of several terms in 19th century texts in chronological order.

Presented with the tools for analyzing large amounts of text, I jumped at the chance to test whether my intuition could be exhaustively corroborated (as opposed to partially) and developed the following research question: Does 19th century U.S. fiction alert us to the waning of custom as an organizing principle for society and the rise of the law, particularly federal law, in the United States? As it’s clear from above, I very much believed that such a dynamic or relationship did exist. In other words, put into a more recognizable statistical formulation, I hypothesized (H1) that fiction (the custom sketch) shows the waning of custom and the rise of law as an organizing principle for society. My task was to reject the null hypothesis (H0): The examined texts do not show the waning of custom and the rise of the law as an organizing principle for society.

I composed a corpus of texts readily available on Project Gutenberg made up of texts by Washington Irving, Nathaniel Hawthorne, and Herman Melville, each of whom wrote custom sketches to one degree or another. I also added texts by James Fenimore Cooper, Edgar Allan Poe, and Harriet Beecher Stowe. The last and later author was Charles Chesnutt, an African-American writer who wrote at the end of the 19th century but whose work often alludes to the custom sketch. I ran this corpus through two text mining tools, Voyant and AntConc, as well as MALLET, a machine learning application that can apply statistical techniques to texts that are used to generate topics (i.e., words that seem to have a statistical relationship to each other across a text or set of texts). With Voyant and AntConc, I searched for words and terms that I thought might help me establish a relationship between custom and the law. Initially, I meant to use MALLET to see if it would generate topics around custom and/or the law. In a way, my intent was to “validate” my findings independent of my targeted searches. MALLET did not return topics that seemed to be related to custom and/or the law, but it helped me identify other terms that I could use in my targeted searches. While I searched for a good number of words that seem related to custom and/or the law, in this essay I focus primarily on the results I obtained based on the words custom, law, manner, and character (the last two words suggested by MALLET). My results were decidedly mixed, and here I must acknowledge that, at first, I thought they were inconclusive or null. That’s because my first pass at these texts was, again, very naïve. For example, because Project Gutenberg provides good quality (.txt) files for many literary works in the public domain, I assumed that I did not need to if not clean then consider the data more carefully. Specifically, while I was searching for the word “law” in the corpus, I did not think to remove the legal copyright information at the beginning of each text file. Not removing that information produced some false positives. Nor did I carefully look at every text file to guarantee that the text(s) I assumed was in the file was actually there. Deep into my analysis, I discovered that I had mislabeled Herman Melville’s Omoo as The Piazza Tales and had not in fact downloaded The Piazza Tales. Though a simple error, it was a grave one as many of the texts in The Piazza Tales are custom sketches. When I realized my mistake, I incorporated the collection of short stories and removed “Bartleby, the Scrivener” since that short story is included in The Piazza Tales. Moreover, in my first pass, I took for granted that my research question and hypothesis regarding the waning of custom and the rise of the law was fundamentally about a development through time (i.e., across the 19th century; from 1800 to 1900). When I analyzed the texts with the digital tools, I did not put them in chronological order so that while the corpus could still address the question of the prominence of the terms in the 19th century, there was no way to map the word frequencies across time. In effect, I did not optimize my dataset to answer the question I was asking. In my second pass, I put the texts in order by labeling in the following format: “001_authtitle.txt,” where 001 corresponded to the earliest published text and 040 to the text published last.

Figure 1: This Voyant visualization shows findings for the terms “law,” “savage,” and “custom.” The texts on the x axis are listed alphabetically, not in chronological order.
Figure 2: This Voyant visualization shows findings for the terms “law” and “custom.” The texts in the corpus are now listed chronologically.

Comparing the two images above, it seems clear that while Voyant is correctly identifying relative frequency of the three terms, that’s all it’s doing (all it can do). Figure 1, unsurprisingly, looks like statistical noise. However, figure 2 should give us pause about wholly dismissing a trend: while there is variation in the frequency in which both “law” and “custom” appear in these texts, there seems to be an increase in the use of “law” in terms of absolute value and a decrease in the use of “custom.” To be sure, there seems to be a good deal of usage of “law” in the beginning of the 19th century, but most of the usage occurs in the texts of James Fenimore Cooper. A factor here may be that Fenimore Cooper might have been particularly interested in the law, but putting that speculation aside, Fenimore Cooper was a prolific writer actively publishing from the 1820s through the 1840s, while most of the other texts in the corpus were published between the 1840s and 1850s. The sheer amount of Fenimore Cooper’s writing, then, could be skewing the results for “law” so that it seems more prominent early on. It would be interesting to assign standardized weights by author, akin to the use of z scores, to see whether the graph would look the same or a more generalized trend would emerge.

Figure 3: This Voyant visualization shows findings for the terms “law,” “manner,” and “custom.” The texts in the corpus are listed chronologically.

Adding “manner” to “custom” and “law,” as in figure 3, one could argue, via triangulation, that there does seem to be a general trend to support my hypothesis. That is because manners can be associated with custom insofar as good manners are an established form of custom or social norms. From figure 3 above, it’s clear that “manner” appears much more frequently than “custom” as an absolute value, but both terms display similar trends: commonly used through the 1840s and then declining afterwards. As with “custom,” the relative frequency of “law” outstrips that of “manner” over time.

But perhaps “law” over performs in comparison to “custom” and “manner” in these texts but fits into a pattern we can observe via other terms. In other words, perhaps “law” belongs in another bucket that mimics the possible trend that “law” exhibits. To test for this, I ran the terms “time” and “character,” suggested to me by MALLET’s topic modeling [link to MALLET output], and I am willing to say that we can rule that out.

Figure 4: This Voyant visualization shows findings for the terms “law,” “time,” and “custom.” The texts in the corpus are listed chronologically.
Figure 5: This Voyant visualization shows findings for the terms “law,” “character,” and “custom.” The texts in the corpus are listed chronologically.

Accepting MALLET’s statistical analysis that both “character” and “time” are significant in these texts shows that their relative frequency is prominent but does not exhibit a pattern—either similar to “law” or in any other way. Voyant seems to corroborate what MALLET infers from these words/topics: while they look like random statistical noise, we should focus on the steadfastness of their usage across the 19th century. That is, they matter. But insofar as “law” suggests a trend, it does seem to be exceptional, sufficiently standing out as gaining strength while “manner” and “custom” lose strength.

But we cannot take these results to be terribly conclusive, only suggestive. That is because even if we could more definitively show the rise of the law in fiction across the 19th century, we could not make a determination as to whether fiction was alerting us to the rise of the law independent of the historical context. In fact, we should not at all expect this kind of independence given the rise of the abolitionist cause particularly from the 1830s onward, the passing of the Fugitive Slave Act of 1850, and the culmination of events that led to the Civil War (1861-1865). The Civil War was about slavery as an economic system and the treatment of African Americans as chattel and subhuman in order to ideologically justify the theft of their labor and ill treatment, but the contestation of these issues in the legal arena was both a precursor and contemporary of the outbreak of war, never mind a successor to the war from Reconstruction onward. That is, the legality of slavery was a source of constant debate until and during the Civil War, and the status of formerly enslaved peoples under the law continued to be a matter of grave importance from Reconstruction onward. Fiction that would wholly ignore these material conditions does not seem likely. To me, that kind of fiction would probably entail an extreme level of abstraction that would suggest an attempt to abandon reality either to project a societal order or cohesion that does not exist or to escape the trauma of social anomie.

For my hypothesis to work, it is not necessary for fiction, custom, and law to be either dependent or independent of each other. Indeed, interdependence between these “variables” seems most reflective of 19th century society. However, that does not resolve the issue of my research question and hypothesis. Given the results I have highlighted above, my hypothesis that along the 19th century custom gave way to the law as an organizing principle for society cannot be proven, and I have failed to reject the null hypothesis: The examined texts do not show the waning of custom and the rise of the law as an organizing principle for society. But the results I’ve gleaned so far, while not able to reject the null hypothesis, do suggest cause for further study. Indeed, perhaps what these digital tools show is a shift in usage and meaning operating on different registers, though as designed, this proof of concept project would not be able to detect that kind of granularity. To paraphrase Gertrude Stein, there is a there there, but what it is, and its shape, remains elusive.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s