“Data science is not just about AI or machine learning. It is the discipline of turning raw data into understanding.”
- Julia Silge, useR Conference 2019
Over the next few lessons, we’ll start learning how to transform data into information. While the technical details of achieving this goal will require practice to master, the goal should always be kept in mind first and foremost.
Below is a short case study on the 1986 Space Shuttle Challenger explosion. In this example, we’ll be examining data about the relationship between the temperature of different NASA rocket launches and damage to the rock O-rings - the root cause of the accident. Don’t worry yet if you don’t understand the specific code used - just try to follow along to see how we can pull out information from raw data.
On January 28, 1986 the space shuttle Challenger exploded. In his book titled “Visual Explanations”, Edward Tufte (1997) provides a detailed account of the background to the incident. In short, the temperature on the day of the launch was too low and resulted in failure of the O-rings in the rocket, which led to an explosion that destroyed the rocket and killed the 7-person crew, pictured below.
The R package DAAG
has a dataset called
orings
which contains data on temperatures and O-ring
damage during launches prior to the Challenger incident. Let’s load the
DAAG
library and preview the data:
library(DAAG)
head(orings)
#> Temperature Erosion Blowby Total
#> 1 53 3 2 5
#> 2 57 1 0 1
#> 3 58 1 0 1
#> 4 63 1 0 1
#> 5 66 0 0 0
#> 6 67 0 0 0
We can see that the dataset contains observations about the temperatures of launches and O-ring damage, but we don’t yet have information. One step forward towards information is to simply plot the data to see if there might be a relationship between temperature and O-ring damage:
library(ggplot2)
challengerPlot <- ggplot(orings, aes(x = Temperature, y = Total)) +
geom_point(size = 1.5) +
scale_x_continuous(limits = c(25, 85), breaks = seq(25, 85, 5)) +
scale_y_continuous(limits = c(-0.15, 8), breaks = seq(0, 8, 2)) +
labs(x = 'Temperature (°F) of field joints at time of launch',
y = 'Total o-ring damage') +
theme_bw() +
theme(panel.grid.minor = element_blank())
challengerPlot
The graph above shows O-ring damage on the y-axis and temperature on the x-axis. We can easily see that no prior launches below 66 degrees F were damage-free, and it appears that at lower temperatures (such as 55 degrees) the damage was even more severe.
Now, what temperature was forecasted for the day of the Challenger launch? 26 to 29 degrees. Let’s add that context to our plot:
annotation <- paste("26°-29°:", "Range of forecasted temperatures",
"for Jan. 28, 1986 Challenger launch", sep = "\n")
challengerPlot +
annotate("rect", xmin = 26, xmax = 29, ymin = -0.15, ymax = 0.15,
alpha = 0.6, fill = "grey60") +
annotate("text", x = 26, y = 1.4, label = annotation, hjust = 0)
Now we have some information. The transformation of the raw data into a visualization makes it obvious that the temperature forecasted for the day of the Challenger launch should raise red flags. It falls far below the temperature range of prior launches, and those prior launches suggest that O-ring damage may be correlated with decreasing temperature.
To their credit, the engineers working on the Challenger were worried about the potential for O-ring failure. But the critical step in making the link to temperature was not thoroughly communicated. Instead, the raw data was presented in tabular form along with diagrams like the one below, which show how erosion in the primary O-ring interacted with the secondary O-ring:
While the above diagram contains a lot of data, the critical information about the relationship between launch temperature and O-ring damage is not obvious. In contrast, the scatterplot achieves this without putting much cognitive load on the viewer. Just about anyone can look at that plot and understand that the forecasted temperature on January 28, 1986 might be a risk for O-ring failure.
References: