In “Against Cleaning,” Katie Rawson and Trevor Muñoz make an important contribution to the question of methodology in the digital humanities, especially in relation to preparing and working with large datasets of humanities information. Drawing on their experience with the New York Public Library’s What’s on the Menu? public data, Rawson and Muñoz problematize the notions of “data cleaning” and “messy” data, noting that the traditional view of these data tasks as technical problems in need of technical solutions elides the fact that data standardization may often result in “computing away” difference. As humanists, Rawson and Muñoz are attentive to the fact that “modern humanities have invested mental and moral energy into, and reaped insights from, studying difference.” They are, therefore, interested in working deftly and technically with large datasets while at the same not deleting the granularity, specificity, and at times oddity of humanities materials and data. To remove this information would either remove the complexities that humanities research focuses on and attempts to produce, create a false version or impression of the data (and, thus, the world and context from which this data emerges), or both. In highlighting this methodological, theoretical, and ontological problem, Rawson and Muñoz restore the nuance and particularity that most humanities scholars associate with their work.
In what seems to be an adaptation and transformation of close reading, Rawson and Muñoz begin their essay with an astute point about “data cleaning,” pointing out that the term itself is a kind of empty signifier that: presents as one task a range of activities that can vary greatly from researcher to researcher; obscures detailed descriptions of preparing data for analysis; suggests that the process is straightforward enough to not have an impact on the analysis and findings; and, assumes that there’s not much to learn from inquiring about the inner workings of the process(es) of data cleaning. From these insights about data cleaning, Rawson and Muñoz develop an argument of not only about why talking about the data cleaning process matters (the cleaning process is a kind of curation that is in itself a methodological and intellectual choice that necessarily affects the subsequent analysis), but also about the theoretical and ontological implications of the cleaning (an interpretation of the data that reconstructs a world that is either far removed from its actual context and/or that is counter to the researcher’s theoretical and ontological approach to their research and the world). They develop this argument by tracing their work “cleaning” the NYPL menu dataset, noting that this cleaning aided the work of a crowdsourced transcription process (of scanned menus) but was “insufficient for scholarly inquiry. To ask research questions, [they] needed to create [their] own dataset, which would work in context with the NYPL dataset.” In other words, they learned to differentiate alternate spellings and/or syntax of menu items from menu items that provided new information. Dealing with the new details, however, posed a new problem of scalability; if digital humanities methods allow researchers to deal with large datasets, leaving details that would make the data “messy” would represent a powerful limitation in the analytic and explanatory power of these new methods. To work through this potential impasse, Rawson and Muñoz draw on the work of anthropologist Anna Tsing on nonscalability, making the connection that scalable methods, like working with “clean,” large datasets, are inextricably linked with “totalizing systems” while “nonscalable phenomena are enmeshed in multiple relationships, outside or in tension with the nesting frame.” In other words, nonscalable elements (read: “messy,” unique data points) represent working with “historical contingencies and encounters across difference.” Data points that add diversity or heterogeneity to large datasets, then, are the site of the local, of difference, of phenomena that open up an interpretive field that makes the development of new knowledge possible.
How, then, do Rawson and Muñoz deal with the tensions between the scalable and nonscalable, the global and the local? They elegantly reach into a tried-and-true technique and tool known to humanities researchers: indexes. For Rawson and Muñoz, “an index is an information structure designed to serve as a system of pointers between bodies of information, one of which is organized to provide access to concepts in the other…an array of other terms that people use alongside ‘cleaning’ (wrangling, munging, normalizing, casting) name other important parts of working with data, but indexing best captures the crucial interplay of scalability and diversity that we are trying to trace.”
In exploring the virtual black box of data cleaning, Rawson and Muñoz make a significant contribution that has important implications at the practical, methodological, theoretical, and ontological levels for digital humanists. The diversity of digital humanities research, in and of itself a good thing, means that digital humanists may not always be using the same frame of reference or standards when they refer to data cleaning. Moreover, the tasks of preparing data, which Rawson and Muñoz suggest is about 80% of the labor of working with large datasets, can be so onerous that digital humanists, understandably, may be focusing on how to do something without necessarily scrutinizing what it means to adopt a method from another discipline (e.g., the natural sciences) or, at the granular level, what standardizing a spelling or term across a dataset could mean at the interpretive and ontological levels. In “Against Cleaning,” Rawson and Muñoz very much try to foreground the “humanities” in digital humanities by reaffirming the humanities’ concern for the particular, for nuance, and for complexity. In some ways, Rawson and Muñoz go “back to basics” by “rediscovering” how the humanities have developed and used the information structures of indexes and indexing, which, as it turns out, is quite distinct from indexing optimized to produce better search tools.
By calling back to an “oldie but goodie” like indexes, Rawson and Muñoz also make a powerful political intervention within humanities disciplines. In their article, Rawson and Muñoz often refer to the suspicion some (more traditional) humanities researchers have for digital methods. In exploring data cleaning and the scalable/nonscalable (global/local, total/particular) problems inherent in these tasks and noting the utility and power of indexes to reveal obvious and potentially invisible relationships between and among data, Rawson and Muñoz show that digital humanities research involves the same level of methodological scrutiny, theoretical sophistication, and interpretive techniques that “analog” humanities research does as long as borrowed methods from the natural and social sciences are scrutinized for their ontological and epistemological underpinnings. Thus, the work is intellectually similar even if in practice it looks vastly different. Most importantly, what Rawson and Muñoz show in their article, which is particularly important for non-digital humanists to hear, is that digital humanities methods, if performed thoughtfully and rigorously, do not entail a collapse of interpretation and analysis in humanities work into counting and collation. Rawson and Muñoz show, instead, that digital humanities methods can parse through large datasets and be used productively to complement and/actuate the analytical and interpretive work of traditional humanities to produce new insights and knowledge.
Rawson, Katie and Muñoz, Trevor. “Against Cleaning.” Debates in the Digital Humanities 2019. https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/07154de9-4903-428e-9c61-7a92a6f22e51#ch23. Accessed 1 Oct. 2021.