library(RCurl)
library(lemon)
library(dplyr)
library(data.table)
library(ggplot2)
library(sf)
library(rnaturalearth)
library(tidyverse)
library(ggspatial)
library(cartogram)
library(ggthemes)
knit_print.data.frame <- lemon_print
One of the many online courses for cleaning data in R can be found at https://www.datacamp.com/courses/cleaning-data-in-r.
In this example, I would like to download Covid data and create a simple choropleth map with the country-level death rate. However, I need to clean the data first.
John Hopkins University’s Center for Systems Science and Engineering has daily updates on the number of Covid-19 cases and publishes it on Github: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports. You can get data directly from a githunb CSV by using library RCurl: copy the link from Github page with raw data, use function getURL, and then read.csv. Whenever you imported data, always inspect it with head(), or click on the name of the data frame in the upper right corner of RStudio.
x <- getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/09-26-2020.csv")
COVID_cases <- read.csv(text = x)
head(COVID_cases)
FIPS | Admin2 | Province_State | Country_Region | Last_Update | Lat | Long_ | Confirmed | Deaths | Recovered | Active | Combined_Key | Incidence_Rate | Case.Fatality_Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NA | Afghanistan | 2020-09-27 04:23:45 | 33.93911 | 67.70995 | 39192 | 1453 | 32635 | 5104 | Afghanistan | 100.67729 | 3.707389 | ||
NA | Albania | 2020-09-27 04:23:45 | 41.15330 | 20.16830 | 13153 | 375 | 7397 | 5381 | Albania | 457.05052 | 2.851061 | ||
NA | Algeria | 2020-09-27 04:23:45 | 28.03390 | 1.65960 | 50914 | 1711 | 35756 | 13447 | Algeria | 116.10670 | 3.360569 | ||
NA | Andorra | 2020-09-27 04:23:45 | 42.50630 | 1.52180 | 1836 | 53 | 1263 | 520 | Andorra | 2376.23762 | 2.886710 | ||
NA | Angola | 2020-09-27 04:23:45 | -11.20270 | 17.87390 | 4672 | 171 | 1639 | 2862 | Angola | 14.21518 | 3.660103 | ||
NA | Antigua and Barbuda | 2020-09-27 04:23:45 | 17.06080 | -61.79640 | 98 | 3 | 92 | 3 | Antigua and Barbuda | 100.07352 | 3.061224 |
Summary() is useful for a quick overview of the columns in the data frame. It displays the length of the columns, number of missing values, as well as min, 1st quartile, median, mean, 3rd quartile and max values. It can point to outliers, messy data types in the columns and a general usefulness of the variables (columns) in your data.
summary(COVID_cases)
FIPS | Admin2 | Province_State | Country_Region | Last_Update | Lat | Long_ | Confirmed | Deaths | Recovered | Active | Combined_Key | Incidence_Rate | Case.Fatality_Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 66 | Length:3955 | Length:3955 | Length:3955 | Length:3955 | Min. :-52.37 | Min. :-174.16 | Min. : 0.0 | Min. : 0.0 | Min. : 0 | Min. :-2750459 | Length:3955 | Min. : 0 | Min. : 0.0000 |
1st Qu.:19054 | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 33.27 | 1st Qu.: -96.61 | 1st Qu.: 152.5 | 1st Qu.: 1.0 | 1st Qu.: 0 | 1st Qu.: 125 | Class :character | 1st Qu.: 689 | 1st Qu.: 0.6823 |
Median :30069 | Mode :character | Mode :character | Mode :character | Mode :character | Median : 37.94 | Median : -86.88 | Median : 537.0 | Median : 9.0 | Median : 0 | Median : 463 | Mode :character | Median : 1332 | Median : 1.6260 |
Mean :32395 | Mean : 35.99 | Mean : -72.34 | Mean : 8281.0 | Mean : 251.1 | Mean : 5720 | Mean : 2312 | Mean : 1659 | Mean : 2.3396 | |||||
3rd Qu.:47038 | 3rd Qu.: 42.16 | 3rd Qu.: -77.64 | 3rd Qu.: 2226.5 | 3rd Qu.: 50.0 | 3rd Qu.: 0 | 3rd Qu.: 1530 | 3rd Qu.: 2271 | 3rd Qu.: 2.9432 | |||||
Max. :99999 | Max. : 71.71 | Max. : 178.06 | Max. :1300757.0 | Max. :37270.0 | Max. :2750459 | Max. : 418503 | Max. :15441 | Max. :124.1779 | |||||
NA s :696 |
NA s :81 |
NA s :81 |
NA s :3 |
NA s :81 |
NA s :47 |
Before you start editing and cleaning the data set, it’s always a good idea to make a copy you can come back to.
COVID_cases_copy <- copy(COVID_cases)
It seems that there are several columns we don’t need. Let’s select the ones we do.
COVID_cases <- select(COVID_cases, Province_State, Country_Region, Confirmed, Deaths)
head(COVID_cases)
## Province_State Country_Region Confirmed Deaths
## 1 Afghanistan 39192 1453
## 2 Albania 13153 375
## 3 Algeria 50914 1711
## 4 Andorra 1836 53
## 5 Angola 4672 171
## 6 Antigua and Barbuda 98 3
Upon further inspection, it looks like that some countries have a country-level data, while the others have a province-level data.
Canada <- COVID_cases %>% filter(Country_Region == "Canada")
head(Canada)
## Province_State Country_Region Confirmed Deaths
## 1 Alberta Canada 17343 261
## 2 British Columbia Canada 8641 230
## 3 Diamond Princess Canada 0 1
## 4 Grand Princess Canada 13 0
## 5 Manitoba Canada 1829 19
## 6 New Brunswick Canada 200 2
To get the country-level data, I need to group the provinces by the country, and then sum the numbers.
Function group_by() is used to group the data according to one or more variables. Grouping doesn’t change how the data looks, but it tells R to look for the same values in the specified column.
Once grouped, values can be added, averaged, or counted (similar to the pivot tables in Excel).
Covid_by_country <- COVID_cases %>%
group_by(Country_Region) %>%
summarize(Confirmed_by_country = sum(Confirmed), Deaths_by_country = sum(Deaths))
head(Covid_by_country)
## # A tibble: 6 x 3
## Country_Region Confirmed_by_country Deaths_by_country
## <chr> <int> <int>
## 1 Afghanistan 39192 1453
## 2 Albania 13153 375
## 3 Algeria 50914 1711
## 4 Andorra 1836 53
## 5 Angola 4672 171
## 6 Antigua and Barbuda 98 3
Other functions that summarize() can take are n(), mean(), median(), min(), and max(). Let’s calculate death rate.
Covid_by_country <- Covid_by_country %>%
mutate(Death_rate = Deaths_by_country/Confirmed_by_country*100)
Let’s plot that.
world <- ne_countries(scale = 50, returnclass = "sf") %>%
filter(name != "Antarctica")
covid_rate <- world %>%
left_join(Covid_by_country, by = c("name_long" = "Country_Region"))
ggplot(covid_rate, aes(fill = Death_rate)) +
geom_sf(color="grey") +
scale_fill_gradient(low = "yellow", high = "red",
name = "Covid Death rate [%]",
breaks = seq(0, 30, by = 5),
labels = formatC(seq(0, 30, by = 5),
big.mark = ",", format = "f", digits = 1)) +
theme_map()
Some countries are missing! The most likely cause is that the names of a country is different in two data sets. Let’s explore missing countries further. Function is.na() helps us find missing values.
missing_countries <- covid_rate %>% filter(is.na(Death_rate))
n <- nrow(missing_countries)
print(paste("The number of missing countries is ", n))
## [1] "The number of missing countries is 72"
Upon inspecting missing_countries, it looks like some places are not sovereign states, but dependencies which might not have separate COVID data. Let’s focus only on sovereign countries.
missing_countries <- missing_countries %>% filter(type == "Sovereign country" | type=="Country")
missing_countries$Death_rate <- NULL
missing_countries$Deaths_by_country <- NULL
missing_countries$Confirmed_by_country <- NULL
n <- nrow(missing_countries)
print(paste("The number of missing countries we will analyze is ", n))
## [1] "The number of missing countries we will analyze is 39"
When you compare COVID data and world data, it looks like countries are not missing, but they are named differently. For example, in the COVID data base, US is called United States, while in the map data - United States of America.To get around that, we can use fuzzyjoin. Fuzzyjoin is equivalent to join, except it matches partial names instead of the full entire name. For example, join_left will not join “Bahamas” and “The Bahamas”. Fuzzyjoin will. Let’s try that.
library(stringr)
library(fuzzyjoin)
missing_countries <- missing_countries %>%
fuzzy_inner_join(Covid_by_country, by = c("admin" = "Country_Region"), match_fun = str_detect)
n <- nrow(missing_countries)
print(paste("The number of missing countries for which we managed to find data is ", n))
## [1] "The number of missing countries for which we managed to find data is 7"
# select only the columns we need
missing_countries <- missing_countries %>%
select(admin, Death_rate, geometry)
covid_rate <- covid_rate %>%
select(admin, Death_rate, geometry)
# match and replace the data we found for the missing countries with the original data set
covid_rate[match(missing_countries$admin, covid_rate$admin), ] <- missing_countries
ggplot(covid_rate, aes(fill = Death_rate)) +
geom_sf(color="grey") +
scale_fill_gradient(low = "yellow", high = "red",
name = "Covid Death rate [%]",
breaks = seq(0, 30, by = 5),
labels = formatC(seq(0, 30, by = 5),
big.mark = ",", format = "f", digits = 1)) +
theme_map()
That’s mildly better, but still disappointing. The countries that are missing on the map either have:
When you have missing data, the following options are available:
I will replace missing values with averages.
# calculate the average death rate for all countries. Make sure to include na.omit() to exclude NA values from the calculation.
averate_death_rate <- mean(na.omit(covid_rate$Death_rate))
#replace na's with the average death rate calculated above
covid_rate <- covid_rate %>%
mutate(Death_rate = ifelse(is.na(Death_rate),averate_death_rate, Death_rate))
Let’s see how that looks…
ggplot(covid_rate, aes(fill = Death_rate)) +
geom_sf(color="grey") +
scale_fill_gradient(low = "yellow", high = "red",
name = "Covid Death rate [%]",
breaks = seq(0, 30, by = 5),
labels = formatC(seq(0, 30, by = 5),
big.mark = ",", format = "f", digits = 1)) +
theme_map()
It looks like one country has significantly larger death rate than the others which is causing the rest to be almost indistinguishable from the others.
# the simplest plot funciton in R is plot()
plot(covid_rate$Death_rate)
Indeed, there’s one outlier. What happens with the map if we remov it from the data set?
covid_no_max <- covid_rate %>% filter(Death_rate < 25 | is.na(Death_rate))
ggplot(covid_no_max, aes(fill = Death_rate)) +
geom_sf(color="grey") +
scale_fill_gradient(low = "yellow", high = "red",
name = "Covid Death rate [%]",
breaks = seq(0, 15, by = 3),
labels = formatC(seq(0, 15, by = 3),
big.mark = ",", format = "f", digits = 1)) +
theme_map()
Now the map looks a bit better, but the country with the maximum value is now gone from the map, which is not acceptable. Instead, we could manually set the value for the outlier to be equal to the second largest value. That way, we both preserve the maximum value and the differentiation between the countries on the map.
# find the second largest value in Death_rate
second_largest <- sort(covid_rate$Death_rate, decreasing = TRUE)[2]
# which.max returns the index of max value in Death_rate column
outlier_index <- which.max(covid_rate$Death_rate)
# let's replace max with the second largest
covid_rate[outlier_index,"Death_rate"] <- second_largest
ggplot(covid_rate, aes(fill = Death_rate)) +
geom_sf(color="grey") +
scale_fill_gradient(low = "yellow", high = "red",
name = "Covid Death rate [%]",
breaks = seq(0, 12, by = 3),
labels = formatC(seq(0, 12, by = 3),
big.mark = ",", format = "f", digits = 1)) +
theme_map()