Data Cleaning with a Humanist Lens
Created by Anthony. Last updated on May 3, 2018.
Diving in (again)
In an initial appeal to the importance of consistency and standardization within data work (that I will discuss at length later in this piece), I open by paralleling my perspective on and overall direction in my contribution to the Critical Comics project to that of Erin, the former digital humanities student whose data work I am, in a way, succeeding. In a talk given at the Harvard Purdue Data Management Symposium, Miriam Posner refers to each individual literature source as a “pool,” in that “you’re not extracting features in order to analyze them; you’re trying to dive into it, [...] and understand it from within.” Like Erin, I chose to frame my work around this quote not only for its truth in regard to the expansive comic database I was given but to symbolize my role in providing a “finished, refined data set” within a long-term digital content project. The first step in facilitating this, therefore, required drawing direct inspiration from my project’s predecessor. As the data cleaner, if I were to have a different or skewed perception of how the data should end up looking and being used, my resultant choices have the potential to throw off the scope of the entire work. The objective is to make the data as objectively accessible and open to analysis as possible. However implicit this notion may seem, it is a necessary one to be conscious of when solely responsible for the first task of a four-part, semester-encompassing project.
I began with an Excel file listing over 5000 different comics from our expansive MSU Comic Collection, paired with hundreds of different descriptive categories to categorize them individually. It was a figurative ocean of comic book data- some meaningful, some utterly unintelligible without other (presumably library) technology or an extensive knowledge of the Library of Congress’ MARC 21 Format for Bibliographic Data. In order to properly understand the data and refine its contents, I had no choice but to dive and begin sifting (and more importantly deleting) through each of the 899 columns and 5399 rows
Fig. 1 – Our data in OpenRefine
In what little time I had to thoughtfully browse the rows/comics (a result of a multitude of failures endured during the cleaning process) I could identify presses (Superheroes, westerns, various Disney works, etc.) that we had learned about in the introduction of the class (through Duncan, Smith and Levitz’s “The Power of Comics”) but quickly came to the naïve conclusion that these entries could only do so much in their classification until the comic itself was needed for analysis. The best example of this being when I wanted to see what a cover of the comic would look like and would be forced to consult an outside resource (like Google Images) in order to see it for myself. While one may be quick to want to “see” the comics, what I failed to initially realize is the vastly different way one “reads” a file of this size versus reading the individual comics that they are comprised of. Whether it be the first-ever monthly crime comic book Crime Does Not Pay or the notoriously family-friendly “talking animal” comic figures like Thumper and Donald Duck, the entries they represent fall short when it comes to providing the visual elements needed to physically read the material contents of the comics. It wasn’t until I began referring to the (generously available) cleaned-data from the year prior that I realized that in this format, comics are not being read in a traditional manner. It is here – at the intersection of comics as a one-dimensional entertainment tool and comics as quantitative data – that I saw the wealth in “capturing quantifiable info- or datafy[ing],” in the words of Viktor Mayer-Schönberger, in regard to the database as a whole. Now, each comic represented a point that weighed equally in any manner of calculation, visualization or analysis. When one is able to synthesize this seemingly meaningless, quantifiable information for such a corpora- It allows for an unlimited number of analysis possibilities relating to things as arbitrary as book size and as specialized as key-words or subject matter. While every physical comic is individually meaningful, once datafied, the segment of text that they translate to pales in comparison to the size and possible utility of the database as a whole.
In order to achieve the optimal amount of analysis utility, however, the data needed far more than the deletion of all blank or useless columns. Continuing my effort to unlock its true potential, several obstacles (like compounded cells and human error) remained in my way. Using OpenRefine, I was able to split columns anywhere from 2 to 4 times, thus creating a more diverse and easy to navigate dataset. This allowed for the separation of meaningful data from things like Library of Congress author records and other “tack-on” information within each column, therefore facilitating easier integration of things like data visualizations, maps/timelines and word analysis. These processes are going to be essential to uncovering remarkable trends, patterns and singularities of the data in later stages of the project. Ensuring the contents of these columns were standardized, however, was a far greater burden. This is simply due to the great volume of human error that is inherent in these type of data sets. For the sake of our data, I established my own standards for things as simple as languages and as complex as Publisher names. This ensures that when using digital processes to create visualizations for the data, it will be presented in the most accurate, useful manner. While this took a lot of hand-correcting and skimming to ensure things navigated and read cleanly, this standardization and attention to consistency is exactly what allows analysis to be carried out smoothly. Some columns, like the one that housed author/illustrator/other information, were simply too non-standardized in regard to its character makeup to be divided and organized (in an efficient way for over 5000 comics, at least). While this column held a wealth of meaningful information for each comic (like foreword or adaptation information) and it was saved in the cleaning process, it serves as a reminder that this data remains imperfect and open for improvement.
Figure 2.– Standardizing names of Publishers
By leaving a lot of the “standardizing” of things like subject matter, authors and fictitious/legendary characters to the Library of Congress, we find ourselves in a very difficult situation. We are in a favorable position in that we are not responsible for having to datafy one’s existence in order to give reference information for the various comic data entries. It, however, puts our institution’s database at the Library of Congress’ mercy in depicting structures of power like gender or ethnicity (for example, a subject heading is only created for a specific Indian tribe if there is literature on that tribe in the LOC, leaving many without representation). As Miriam Posner writes, this necessitates the demand for “more nuanced data models” in order to express these naturally occurring forms of social organization. By using these standardizations, however problematic, we are not only serving as an endorsement of them, but also furthering their continued use by others. In this, and many cases, the need for a highly consistent, widely used standardization system highlighted my decision to stick with the LOC, despite aspirations to be as inclusive as possible with the data. I am but an English/DH student and had not the time nor the intellectual capacity to even begin to represent each marginalized group to their liking. These constraints of the project essentially necessitated such an overseeing body for these classifications. Accepting that I would be unable to grapple with such a difficult, meta-digital problem, I concluded my data cleaning here until I was ready to take on the LOC myself.
Figure 3. – The finished, cleaned data set
My objective in this project was to immerse myself within, wrangle with and manipulate this data in a way that, to quote Posner one last time, lets us as a digital humanities class “reinvent [it] to try and tease out the things we think are really interesting.” Speaking to that, comics – a medium often discounted for its “disposable,” “merit-lacking” nature are rightfully utilized here with the intention to uncover a greater understanding of the people and times that created them. In the process of creating a dataset capable of providing such understanding, I gained a new perspective on the difference between reading a database and a comic, how digital tools “view” data and grappled with questions of social organization in the process.