Data Visualization & Critical Theory - 2 Experiments
Created by Anne. Last edited on May 3, 2017.
Data Visualization & Critical Theory
“There's something almost quite magical about visual information. It's effortless, it literally pours in.” (David McCandless)
The idea that data visualizations need to be intuitive and yield immediate insights is widespread and illustrated by David McCandless' quote above. Taken from his Ted Talk titled “The beauty of data visualization“ from August 2010, it not only corresponds to the expectation that data visualizations facilitate access to information by making sense of numbers and characters that would otherwise be incomprehensible or overwhelming, but also that data visualizations allow effortless insights that can passively “pour” into our minds.
While there is great potential in data visualization and the way it can fit complex information and connections into compressed forms, modern data visualization techniques rest upon a hierarchical mode of knowledge transmission (see Klein). As visual schemas they communicate knowledge based on a ranked arrangement; "above," "below," or "at the same level as", which categorize objects of enquiry along the lines of more or less important, relevant, etc. What is more, in order to make data visualizations legible and create these hierarchies, the complexity of our objects of enquiry needs to be reduced.
Forming part of a seminar project, which set out to design and build an online collection of 10 comics from the MSU Comic Collection, this paper sets out to challenge and investigate the issues bound up in transmitting knowledge through data visualization and seeks to examine how critical theory can be used to explore alternatives to hierarchical modes of knowledge transmission. Critical theory can be understood in two different ways. The first one is critical theory as a mode of analyzing cultural products. The second one is using critical theory to create and investigate cultural objects that take critical concepts as basis of their design. It is this second way that this paper is concerned with.
I will be working with two sets of data relating to the MSU Comic Art Collection, a smaller set of 10 comic books and a bigger set of data covering MSU's comic book collection until 1939 (approximately 900 data points). By means of different types of data visualizations I address the procedure of reduction involved in making data visualizations and explore ways in which critical theory can productively intervene in this process. As has been noted in canonical texts such as “Dialectic of Enlightenment”, the “regression” of reason into myth is a starting point for critical theory to re-examine existing logic and thus also points to some of the issues discussed in this paper. The desired intuitiveness of data visualization and the apprehension that it should yield immediate information and accessibility emerges as a result of historical progress and development in visual culture. Consequently, and as Miriam Posner points out, a cultural critique needs to look at the logic that got us to where we are now. This means that a critical engagement with data visualization needs to expose the fact that not only the objects of enquiry (in this case the comics), but also the technologies we employ to digitize them and the tools we use to create data visualizations about them, are the products of history and therefore subject to the choices and categories of humans. As part of this seminar project we were interested in not only interacting critically with the comics themselves, but also with the tools we use to analyze them. For my part on data visualization this means to survey the different ways of critically working with data and to explore alternative ways of visualizing data.
Visual 1: Comic publication in time and place
There are different categories for visualization techniques. Exemplifying a broad range of ways data can be plotted, they differentiate, broadly speaking, between time-series data, statistical distribution, maps, hierarchies and networks. While these techniques fulfill different purposes they all have in common the employment of points, lines and other geometric shapes in order to stand in for the actual objects of inquiry (in this case the comics).
The first images in this paper rely on using geometrical forms as stand-ins for the comics and seek to reproduce some of the characteristics of modern data visualization to retrace and expose processes of reduction. Based on the bigger dataset, which relates to the MSU Comic Collection until 1939, these images aim to provide a digestible visualization that shows publications per year in relation to the place of publication, allowing the viewer to compare different places and the amount of publications these places put forward over time, by means of bar charts, lines and dots.
According to Manovich, it is this reduction of objects into data points and geometric shapes, which has giving visualization its specific identity. It is also those features of data visualization (reduction and spatial variables) that allow us to reveal patterns and compress knowledge. However, there is a price to be paid. Manovich argues that we discard “99% of what is specific about each object to represent only 1%- in the hope of revealing patterns across this 1% of objects’ characteristics” (6). In other words, data visualization is a trade off where only certain qualities of an object are taken into account (such as publication place, year of publication and number of publications), while everything else that is specific about that object (in this case a comic) is disregarded.
For the first image I created, I wanted to have different colors representing the different places of publication and turned to RAW. The X-axis indicates the year of publication, the height of the bar represents the numbers of publications in the year and each color indicates one specific place of publication. Comparing the height of the bar chart with the data, it becomes clear that the program counted all publications in one specific year (not just the ones from one specific location), but it attributes only one place of publication to each year. Based on comparisons with other graphed bar charts and the data, my assumption is that the program picked the dominating location for the time period.
Timeline/bar chart created in RAW. X-axis: year of publication, Y-axis: Number of publications, green charts are places in the US, black is for publications of which the place of publication is unknown. The black bar all the way to the right does not have a year of publication assigned to it)
This visualization gives us a summary of an overall trend of publishing for the comics in MSU's collection between 1833 and 1939. It shows for example the peaks in 1930 (of unknown publication places), 1934, 1935, 1938 and a general increase in comic publication during the first half of the 20th century. It also reveals a big amount of unknown publications in 1930, as well as a number of comics whose publication year is unknown (the black bar all the way to the right of the graph). Taking a look at the data I realized that this bar chart to the right (without a publication year) might be the result of unclean data.
The location data included a lot of places that appeared twice but with different spelling (e.g. “Racine, WI” and “Racine, Wisc.”). I used excel's “find” and the “replace” function to get uniform spelling and clean the data of duplicates. The cleaning of data also included (but was not limited to) removing all question marks, commas or alphabetic characters and making sure that there is only one year value in each cell. This process of data cleaning presented two sources of reduction.
1) The elimination of question marks after the year value wiped out the uncertainty or ambivalence about the year of production of that specific comic, revealing that there is no room for ambivalence in this visualization.
2) The reduction of multiple dates to to one date only. This reduction erased the possibility of accounting for a period in which a comic was published or retrieve information on whether a comic had a particular short or long life lifespan. In the same way, the reduction to one year value also deleted the possibility to account for new republications of a comic.
Next I used the timeline function in Palladio to create the graph below, which is helpful when it can be accessed on on the Palladio website in an interactive way. On the website one scrolls the cursor over the different shades of grey and they are immediately highlighted in turquoise to indicate the place of publication. As a non-interactive image, however, the different locations are difficult to distinguish from one another.
Timeline created with Palladio. X-axis: year of publication, Y-axis: Number of publications, Highlighted area in the Example: Racine, WI (place of publication).
Because I did not have enough control over the display options in RAW and in Palladio to create a graph that would allow me to assign different colors representing the different places of publication while the X-axis indicates the year of publication and the Y-axis the number of publications, I turned to Excel.
Instead of working with a table that was structured into three columns (year of publication, place of publication and number of publications), I created a matrix where the year of publication and the place would be linked to the number of publication that occurred in that year for that specific place. The first step to create this matrix was to seperate the two necessary columns (place of publication and year of publication) from the master data. Then I used the SUMPRODUCT function in Excel to assign each point in the matrix a value. Column A lists all the years of publication and Column B refers to the place of publication. Within the SUMPRODUCT function I told Excel go to Column A and decide whether it is the same as G1 (which is the year 1833 (note that it is important to only fix the 1 of G1, because when copying and pasting the formula for the entire matrix it is necessary that the number changes to G2, G3, G4, etc.). Then I told Excel to look for all the occurrences in Column B (the place of publication) in the cell F2 (this time the entire cell is fixed). SUMPRODUCT then calculates how many times this statement is true and displays, for example, the total amount of publications in New York in 1833 (in this case in 0). The actual formula for this looks like this: =SUMPRODUCT(($A:$A=G$1)*($B:$B=$F2)).
Part of the matrix I created in Excel to generate the stacked area chart
Using predesigned tables in Excel, but selecting the data manually, I then created a stacked area chart. This chart displays which part each location played in accumulating to the total amounts of publications in one year. Since there were 68 places in total (hence 68 different colors), I decided to only graph the places which had 10 publications or more and group the other ones into the category “Others”. The decision to bring down the number of places is another example of how reducing plays an integral part in making data visualization legible and easily digestible. What is more, in order to intuitively understand this graph and to draw conclusions from it, I had to reduce the amount of places even more.
Stacked area chart created in Excel. X-axis: year of publication, Y-axis: Number of publications, each color refers to a different place of publication, as indicated on the right side of the graph. Showing only places with more publications than 10 in total.
This next graph shows only the top five places in terms of numbers of publication. Working with this dataset it became striking to me that only drastic reduction of the data (in several different ways) makes the resulting visualization legible, allowing an “effortless pouring in of information”. Returning to Manovich's quote at the beginning of this paper, it becomes striking that this type of data visualization is not only a reduction to 1% of what is specific about each comic, but displaying only nine places out of 68 total places of publication only takes into account about 13% of all the places of publication, while displaying the top five places only takes into account about 7% of all publication places.
Stacked area chart created in Excel. X-axis: year of publication, Y-axis: Number of publications, each color refers to a different place of publication, as indicated on the right side of the graph. Showing only the top five places of publication (in terms of total numbers of publications)
While these last two visualizations make information about MSU's comic collection accessible and legible, they only do so after significant complexity reduction. Before addressing the question of alternative ways of visualizing data (in the next section), I want to briefly highlight how these data visualizations (despite their limitations) can nonetheless provide insights that trigger critical engagement with the data.
At the outset of working with the data I had the assumption that comics published in the USA would be dominating the MSU collection. This was met, but also challenged. On the one hand we can see that three out the top five places (in terms of total numbers of publications) are in the USA. It becomes decipherable that with the begin of the 20th century not only comic publication, but also the number of comics published in the US, increased (at least in the MSU dataset). Finally, the chart reveals the short, but intense boom of comic publication in Racine. On the other hand, it is interesting to see that prior to the 20th century most comics were published in Europe, with London and Paris being also among the places with the most publications overall.
The reduction of complexity necessary to make this legible and the domination of US and European cities leaves places like Johannesburg or Buenos Aires unknown and invisible in the data visualization. While this is based on just one data set and not statistically relevant in terms of bigger claims regarding comic publication, it may lead us to a variety of interesting research questions about MSU's comic collection and cooperation structure: In what ways is the existence or lack or intellectual, political and cultural cooperation between MSU and non European countries (there are not many comics on MSU's possession that were published in non European or non US territory) reflected in their comic collection? Is the MSU data set representative of comic publication that was going on in countries outside of the USA and Europe? Is the reason for the small publication numbers in Johannesburg and Buenos Aires a result of lacking means of production in those places, a lack of interest from MSU to collect non-European/non-US comics, or is it because of something else?
Visual 2: “Direct” visualization and non-hierarchical knowledge transmission
As mentioned in the introduction the aim of this page is to explore ways of critically interacting with data and the tools used to analyze and visualize it, but also to explore alternative ways of data visualization. The technical requirements of the tools employed in the previous section and the choices that went into creating those graph charts indicate just how much these visual schemas are based on ranked arrangement of “more or less important”.
The next visuals seek to work with data in different ways. First, I created a visual that displays the comics in a random order, for which I used the smaller dataset made up of 10 comic books that have been digitized into 201 images, with each image making up a datapoint. By using the digital reproduction of a comic (a photograph) this approach of working with data is less reductionist compared to using geometrical stand-ins. In order to create this visual I had to manipulate the data and assign each image two values, one X-axis value and one Y-axis value. However, before starting to work with ImagePlot I had to create the actual dataset. Since the program works with a set of image files and the data about these images saved in a txt file I first merged all the small datasets (there was initially one dataset for each comic, to learn more about the creation of datasets visit the creation of archive section).
Once again the data cleaning process was the most tedious and time consuming part. I had to delete duplicates because after merging all files I had 245 values, although I only had 201 pictures. In addition to that there were errors in the data, for example wrong column descriptions or inconsistencies in the structures of organizing the individual csv files, which caused errors in the final data set. While in the previous stacked area graph the objects visualized were assigned arbitrary signifiers (e.g. a year became a dot), the following data visualization is using visual representation of the original artifact to visualize them as data points.
Image generated with ImagePlot. X-axis: random numbers 1-10, Y-axis: random numbers 1-20
As a data visualization that does not seem to 'make sense' in an obvious way, and does not use graphical primitives as stand-ins, the visual representation of the artifact itself moves to the foreground. Using the actual images of the comics preserves some of the visual details of the original artifact and allows a more aesthetic experience.
What I mean by aesthetic experience is that the display of the digital copy of the comic demands a different mode of sensory engagement than a geometrical stand in. A photograph of a comic (more so than a line or dot that represents a comic (see Manovich 14) relocates the aesthetic qualities and specificities in the experience of the viewer and involves him or her in an emotional and sensational way of experiencing the object of enquiry. Klein describes this sort of engagement when she talks about Peabody's mural charts she used to teach history: "They replace the decorative and utilitarian function of a rug with an experience designed to generate knowledge, and in so doing, they require viewers to reconsider the actual position of their bodies in relation to their objects of knowledge" (Klein). By appealing more directly to the senses than abstract geometrical forms this kind of visualization demands a different engagement from the viewer, foregrounding for example the haptic qualities of the materiality or the fragility of the paper.
Instead of plotting the visuals by means of hierarchical order, there is no apparent reason why one image is above, below, to the right, or to the left of another image. This way of plotting the visuals perhaps comes closest to what Klein called for when she referred to horizontal knowledge transmission where the source of knowledge is located “[…] in the interplay between viewer image and text”. An analysis of this visualization can encourage interpretations as multiple as audiences, and the possibility to see the digitized version of the actual artifact (instead of seeing a dot or a bar) might even help to trigger an affective and embodied response to these comics. In the image we can see the actual pages of the comic, their size and the material, the colors and if we zoom in we can even see the drawings and read the text.
As a counter-image to the stacked area chart this is a visualization that is not intuitive in its own right, nor does it provide evidence or proof of something. I would not dare to claim that it is as useful as Klein's “mural charts” as a tools for horizontal knowledge production, but since it is not clear immediately what it is showing, each onlooker needs to make sense of it for him or herself.
Visual 3: Saturation in time/space and sequential art
Manovich points out that an advanced degree of abstraction can make patterns more perceivable, but at the same time it withdraws the viewer further from the aesthetic experience he or she would have in a more direct encounter with the object of enquiry. Conversely he sates: “[s]taying closer to the original artifact preserves the original detail and aesthetic experience but may not be able to reveal some of the patterns” (14). In other words, we trade off one problem for another. In the following section I tried to generate data visualizations that would allow both, abstraction to reveal patterns while also encouraging an aesthetic experience. These final images are “direct” visualization using ImagePlot to examine possible patterns, namely the relationship between saturation (the perceived intensity level of color) and time/space.
In order to plot the images according to their saturation I first had to acquire the information on how saturated each image in my dataset is. In order to do this I used Imageplot's color calculation function. After measuring the visual features, the most challenging part was to convert the result of that analysis into useable data. For each image the program generated data (the mean and the standard deviation) around three image characteristics: hue, saturation and brightness. After assigning each image the corresponding information, I used the program to plot the saturation mean for each image against the year in which the work was published.
The first image (with the white dots) shows the abstract visualization and the second image demonstrates the direct visualization with the comic images.
Images created with ImagePlot. X-axis: Year of Publication, Y-axis: Saturation mean.
These visualizations seem to suggest that saturation would increase over time (see the slight increase in saturation as the years progress, illustrated by the images towards the right, upper corner), but they also trigger more questions regarding the relationship between saturation and comic production. Did saturation really increase over the years? What kind of technological developments (for example advancements in coloring techniques) could be responsible for this? In order to answer any of these questions, however we would have to first produce a bigger dataset with more than just 10 comics and also go back to the comics themselves to understand and study the different coloring techniques.
After I saw this first visualization involving saturation, I was wondering whether there would be an interesting relationship between page numbers and saturation. The hypothesis was that the first page (which is the cover page in all 10 comics) would be more saturated than the other pages (perhaps to make it look more attractive). In order to examine this assumption I plotted page numbers on the X-axis and the saturation mean on the Y-axis, but we can see that my second assumption did not hold true. There is no decrease of saturation after the first page.
Image created with ImagePlot. X-axis: page numbers, Y-axis: saturation mean
Finally, it is important to acknowledge that the calculation of the saturation and plotting the saturation mean is made possible only by turning analog data (the comics) into a digital representation (Manovich 14), presenting us once again with the constant trade off and negotiation between abstraction and the aesthetic experience.
Comics are an art form that makes use of images arranged in sequence for telling a story. They use text and visuals and the story develops in the interaction between the two. In fact, one without the other cannot convey the whole picture. This interaction between image and text is exactly what Scott McCloud writes is the critical potential of comics as medium: “Juxtaposed pictorial and other images in deliberate sequence, intended to convey information and/or to produce an aesthetic response in the viewer” (9).
In terms of data visualization we have a similar situation. A data visualization without a critical engagement, one that lays open the trade off's and reductionism involved in the process of creating the visual schemas, only tells part of the story. A digital archive that only displaying the digitized comics and data points without critical reflection on the tools and processes that shaped it final form obscures complexity for the sake of intuitiveness and yielding easy insights. The critical potential lies in an interaction between text and images.