Diving into Data: Cleaning Data for 1,000 Comics
Created by Erin. Last updated on May 3, 2017.
What is Data Cleaning Anyway?
In Posner’s article “Humanities Data: A Necessary Contradiction,” she describes work with literature in this way, “[...] you’re not extracting features in order to analyze them; you’re trying to dive into it, like a pool, and understand it from within.” This is the framework by which I will approach the data cleaning portion of this project. Let’s dive in.
The backbone of all good (digital) research is clean data. By clean, I mean comprehensible- something our data was anything but when we first started working with it. Our data was taken from Michigan State University’s library catalog. The data was given to us as an excel file with MARC21 formating. MARC21 (MAchine Readable Cataloging) is the international standard for the dissemination of bibliographic data. Adopted in 1971, MARC21 “prevents duplication of work and allows libraries to better share bibliographic resources.” Data formatted in MARC21 was a great place to start. In order to draw conclusions from and create visualizations with our data, however, I needed to clean the data. Data cleaning is the process of identifying irrelevant parts of a set of data and modifying, replacing, or deleting them in order to best fit the context of your research.
Our data in OpenRefine
Following Posner’s analogy, my initial dive was into a sea of 999 rows of bibliographic data (Figure 1), and I felt like I was drowning. I had no idea how to start the data cleaning process, and that is precisely where I was wrong. In my mind, there was a roadmap for data cleaning and, if I didn’t start off the right way, the rest of the project would go haywire. Then, while rereading Posner’s “Humanities Data: A Necessary Contradiction,” I ran across the following quote. “When you call something data, you imply that it exists in discrete, fungible units; that it is computationally tractable; that its meaningful qualities can be enumerated in a finite list; that someone else performing the same operations on the same data will come up with the same results. This is not how humanists think of the material they work with.” That was it! The whole time I was stuck in a rut because I didn’t know the “right way” to start, I was simply missing the point.
As humanists, we study human produced data that is laced in human constructed standards, classifications, and biases. Therefore, there’s no “right” place to start. The work I did to clean this data is different from the work another scholar would do if he or she were assigned the same task. This paper will, therefore, serve as documentation and justification of the decisions I made while diving into our data pool.
Color Coding a Game Plan
An element unique to humanists data cleaning is the ability to make executive decisions about what information makes the cut. These decisions regarding relevant and irrelevant data are vital for conclusions to be drawn. In an effort to maintain my transparency and, therefore, make my work reproducible, I will outline this process below. To start, we decided to eliminate any data that was irrelevant in the context of our project. My classmates and I did this by reviewing MARC21 descriptions of each column of data and, consequently, labeling each column either “delete,” “keep,” “merge,” or “split” (Figure 2).
Deciding what to keep, merge, and delete.
To ensure that columns labeled “delete” were properly labeled so, I went back through each one and looked up the MARC21 description of its contents. After deleting said columns in OpenRefine, the data cleaning software I used, I color coded the remaining columns in the spreadsheet to help me visualize what was left to be done. I assigned green to columns that were to be kept, yellow to columns that I wanted to merge, and beige to columns that needed to be split. When I finished with a column, I changed the top cell to orange. If I was stuck on a column, I changed the top cell to red (Figure 3).
Refining the MARC21 data.
Split. Merge. Repeat.
As I dove deeper into the data, I noticed that some columns contained various types of data points, while others seemed to repeat information that already existed in other columns. This was an issue that I would have to tackle in order for my classmates and me to analyze our data as distinct parts later on.
To combat the repeated data points in our dataset, I used OpenRefine’s merging function. This is where things got a bit tricky. Columns with identical labels did not always contain identical data points, while columns with different labels contained data points that were too similar to keep separate. When trying to merge all columns labeled “Uniform Title,” for instance, I realized that the contents in each column, however ironic, were not uniform. One of the “Uniform Title” columns contained a few titles, another housed titles mixed with links and location data, and yet another a mixture of both. I tried to use the MARC21 description to help me differentiate between these “Uniform Title” columns, but that didn’t really get me anywhere, so I decided to keep these columns separate.
Another approach I took to combat columns with multiple data points was splitting the columns. Before diving into this data pool, I thought MARC21 formatting was pretty specific. After diving in, however, I realized that a lot of columns contained a mixture of data points, and these data points were not very organized at all (i.e. Daumier, Honoré, 1808-1879, http://id.loc.gov/authorities/names/n79089112). Before splitting these columns, I looked at the text facet of the column to find a separator that I could use to split it. A separator could be anything from a period, comma, semicolon, or hyphen. Most of the time, however, the values I wanted to separate were mixed together with values that I didn’t want to separate. Some cells, for instance, contained a mixture of terms and Library of Congress links. Instead of making one clean cut to separate the differing values apart, I had to separate the column into all of its separate parts first. In the “Personal Name” column, for example, I decided to use the periods as separators. This left me with “Daumier, Honoré,” “1808-1879,” “http://id,” ”loc,” and “gov/authorities/names/n79089112.” Then, I went back and re-merged the columns with now similar content. This meant that “http://id,” ”loc,” and “gov/authorities/names/n79089112” would merge back into the link “http://id.loc.gov/authorities/names/n79089112.”
With columns that didn’t contain a distinct separator, I used Open Refine’s value.replace function to transform the cells. This function allowed me to replace the 18 in “1897”, for instance, with “*18”. This then allowed me to separate dates from other data points by using the asterisk as the separator. It’s worth noting that when splitting cells with a separator, you’re basically removing said separator from the data once you use it as a splitting place for a cell.
However lengthy the process, splitting columns was vitally important for the analysis portion of our project. Without separating and organizing columns that contained mixed data points, we wouldn’t have been able to later analyze, create data visualisations with, and draw conclusions from our data.
Hacking the System
As is with most software, there are algorithyms in place that we must work around. The first few times I tried to merge columns, for instance, I received an error message saying that cells with null values could not be merged. To solve this problem, I went to the text facet of the columns I was trying to merge and added an ampersand and asterisk to all blank cells (this value is in no way significant, but rather a standard for filling null cells that I created myself). That way, the cells would no longer show up as null and would merge properly. Afterwards, I had to go back and use value.replace to replace all “&*” with nothing.
Once I got most of the columns split and merged the way the class had suggested and the way I saw best fit based on content, I decided to go back and clean each column on its own. A problem that arose during this was with the “Cluster and Edit” feature. The job of “Cluster and Edit” is to help with the standardization of cell values. This means that names, for example, that are misspelled, not capitalized, missing an accent, or shortened, will be corrected to the standard name. This standard name was essentially up to me. I got to decide whether Becassine or Bécassine, for example, would be the standard title for that collection (Figure 4).
The "Cluster and Edit" feature helped clean up text facets of columns a ton. It also made me realize something very important about this data set. Things like names, something we try and center our data set around, are not always constant. The Doubleday publishing company, for instance, has merged with other companies and changed its name five times over the course of history. Picking this subtle information up upon first glance of the data is almost impossible. This just drives home even more so the fact that humanists need to “dive into” their material in order to fully understand it, as Posner says in “Humanities Data: A Necessary Contradiction.”
Inconsistencies in Classification
Disjuncts in classification throughout the data were, more likely than not, caused by human error. Since the data is input into the fields by humans, inherent mistake makers, it is not surprising that there were some mistakes. With that said, there is no perfect system for classification. Although making classifications is an inherent function of humankind, we humans are not machines. You cannot put one comic book in front of us, for instance, and expect us to analyze and summarize it in the exact same way as the next one. As Bowker and Star discuss in Sorting Things Out, though we have standards and classification, “these dimensions of standards are in some sense idealized. They embody goals of practice and production that are never perfectly realized, like Plato’s triangles.”
In addition to human generated errors, I also ran into some technological issues while cleaning the data. For starters, running OpenRefine on my laptop caused an error symbol to show up for special characters in the data. There wasn’t much I could do to fix this besides use a school computer to clean the data. So, use a school computer I did!
Some standards that were consistent throughout the dataset were measurements and separators. Page sizes, dictated in column 300, were either measured in centimeters or millimeters. Standards for separators seemed to range from colons, semi-colons, commas, or double dashes. Where standards were consistent, I could separate columns in the blink of an eye without ever having to go back and re-merge parts of the columns I had previously separated. Because this process was a lot simpler than the rest, I decided to tackle most of the standardized columns first.
In cleaning the standardized columns of the dataset, I found myself desiring standards for the rest of the data as well. As I continued to wrestle with the standardless columns, however, I realized something. Standards, in one sense, make things like data cleaning a whole lot simpler. On the other hand, they come with a ton of inlaid biases. Standards regarding gender and race are an example of this. Both attempt to fit another human’s outlook on themselves into a nice, neat, universal box. These surface level standardizations contradict our digital humanist “dive” into the depths of our datasets. If all of the columns in our dataset were perfectly structured with consistent standards, I wouldn't have had to dive in. Instead, I’d simply be acting as a computer, separating and merging where the separator indicated, never looking past the separator to the data itself. Messy data allows the humanist to be human, inherent mistakes and all, and not just a computer that follows algorithmic orders.
In recognition of human generated errors and their inherent existence, I quote English mathematician Charles Babbage who once said, “Errors using inadequate data are much less than those using no data at all” (Anderson, Carl. Creating a data-driven organization: practical advice from the trenches. Sebastopol, CA: O'Reilly, 2015. Print.). This quote has given me much solace throughout this intricately messy data cleaning process. It reminds me that the work we put into the foundation of our projects, the data, is vital for the life of the rest of the project. It also reminds me, however, that caring about data in the way we do is uncommon at best. In a world that makes conclusions about politics, religion, etc. by reading no further than the heading of an article, it is our responsibility as digital humanists to, as Posner states, dive into our data and know it intimately before we start drawing any conclusions.