ANLY-503: Advanced Data Visualization
The ideas behind visual EDA dates back over 100 years
Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone –- as the first step
John Tukey, 1977
An approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to maximize insight into a data set, uncover underlying structure, extract important variables, detect outliers and anomalies, test underlying assumptions, develop parsimonious models and determine optimal factor settings.
U.S. National Institute of Standards and Technology
Maximize the analyst’s insight into a data set and into the underlying structure of a data set, while providing all of the specific items that an analyst would want to extract from a data set
There is of course a lot of univariate and bivariate visualizations you can do to understand you data, including histograms, density plots, scatter plots and the like.
Our first steps should be to get an overall view of the dataset to see if
We’ll look at some tools that summarize the whole data
ggplot2::msleep
Rows: 83
Columns: 11
$ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
$ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
$ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
$ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
$ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
$ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
$ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
$ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
$ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
$ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
$ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…
Not very useful, since we can’t see much information
name genus vore order
Length:83 Length:83 Length:83 Length:83
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
conservation sleep_total sleep_rem sleep_cycle
Length:83 Min. : 1.90 Min. :0.100 Min. :0.1167
Class :character 1st Qu.: 7.85 1st Qu.:0.900 1st Qu.:0.1833
Mode :character Median :10.10 Median :1.500 Median :0.3333
Mean :10.43 Mean :1.875 Mean :0.4396
3rd Qu.:13.75 3rd Qu.:2.400 3rd Qu.:0.5792
Max. :19.90 Max. :6.600 Max. :1.5000
NA's :22 NA's :51
awake brainwt bodywt
Min. : 4.10 Min. :0.00014 Min. : 0.005
1st Qu.:10.25 1st Qu.:0.00290 1st Qu.: 0.174
Median :13.90 Median :0.01240 Median : 1.670
Mean :13.57 Mean :0.28158 Mean : 166.136
3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750
Max. :22.10 Max. :5.71200 Max. :6654.000
NA's :27
A little better, but not great. Good univariate summaries
We can also look as how correlated the numerical variables are to each other using a correlation heatmap
We can also look at whether the data meets expectations, or are their “outliers” or potential issues in particular observations
This visualization provides both missing data patterns and summary statistics about the missing data
The naniar
package by Nicholas Tierney provides more detailed looks at missing data patterns
This is a clever use of a standard visualization where the red dots show the values of one variable when the other variable is missing. This can show
klib
packageThere are a couple of packages in Python for visually looking at full datasets and missing patterns: klib
and missingno
.
klib
package is more feature rich
Looking at multivariate relationships among categorical variables
applied to missing data
The UpSet plot was originally developed at Harvard in 2014.
The main purpose was to solve the problem of set visualizations when you have more than one set (so an extension of Venn Diagrams), in an intuitive manner
It tries to solve the problem created by the following visualization looking at the intersection of 6 sets
Let’s look at this from a missing data perspective. Each “set” is the missing/non-missing annotation of each variable in a data set, and we’re interested in when the missing data co-occur.
UpSetR
packageconda install -c conda-forge upsetplot
vore sleep_rem brainwt conservation sleep_cycle
False False False False False 20
True 9
True False 9
True 5
True False False 1
True 10
True False 1
True 1
True False False True 5
True True 3
True False True 7
True True 5
True False False False True 2
True False 1
True 2
True True True True 2
dtype: int64
The UpSet.js
library is a JS re-implementation of the UpSetR
R package. This is wrapped in htmlwidget
and provided as the R package upsetjs
Shifting gears
Stories are a survival mechanism across generations
Often, your jobs as a data-scientist is to be an effective communicator
There is more to communication than numbers on a paper
Stories are up to 22 times more memorable than facts alone
When in doubt, tell stories
Data stories appear to be most effective when they have constrained interaction at various checkpoints within a narrative, allowing the user to explore the data without veering too far from the intended narrative.
It is not merely:
a technical matter of creating an image
designing the right chart
Rather it is:
To tell a story you have to define a story
A story is how what happens affects someone who is trying to achieve what turns out to be a difficult goal, and how they change as a result
The old adage on how to present anything:
https://medium.com/nightingale/the-past-present-and-future-of-scrollytelling-10dd37dc1003
https://opensourcelibs.com/lib/rolldown
https://elementor.com/blog/guide-to-scrollytelling/
https://www.visualstorytell.com/blog/what-is-visual-storytelling
https://mathisonian.github.io/idyll/scaffolding-interactives/
https://idl.cs.washington.edu/
https://distill.pub/2020/communicating-with-interactive-articles/
Lets take a 10 minute break before moving onto the lab.