library(RCurl)
library(lemon)
library(dplyr)
library(data.table)
library(ggplot2)
library(sf)
library(rnaturalearth)
library(tidyverse)
library(ggspatial)
library(cartogram)
library(ggthemes)
knit_print.data.frame <- lemon_print

One of the many online courses for cleaning data in R can be found at https://www.datacamp.com/courses/cleaning-data-in-r.

Analysis of COVID cases

In this example, I would like to download Covid data and create a simple choropleth map with the country-level death rate. However, I need to clean the data first.

1. Download the data

John Hopkins University’s Center for Systems Science and Engineering has daily updates on the number of Covid-19 cases and publishes it on Github: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports. You can get data directly from a githunb CSV by using library RCurl: copy the link from Github page with raw data, use function getURL, and then read.csv. Whenever you imported data, always inspect it with head(), or click on the name of the data frame in the upper right corner of RStudio.

x <- getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/09-26-2020.csv")
COVID_cases <- read.csv(text = x)

2. Explore the data with head() and summary() functions

head(COVID_cases)
FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key Incidence_Rate Case.Fatality_Ratio
NA Afghanistan 2020-09-27 04:23:45 33.93911 67.70995 39192 1453 32635 5104 Afghanistan 100.67729 3.707389
NA Albania 2020-09-27 04:23:45 41.15330 20.16830 13153 375 7397 5381 Albania 457.05052 2.851061
NA Algeria 2020-09-27 04:23:45 28.03390 1.65960 50914 1711 35756 13447 Algeria 116.10670 3.360569
NA Andorra 2020-09-27 04:23:45 42.50630 1.52180 1836 53 1263 520 Andorra 2376.23762 2.886710
NA Angola 2020-09-27 04:23:45 -11.20270 17.87390 4672 171 1639 2862 Angola 14.21518 3.660103
NA Antigua and Barbuda 2020-09-27 04:23:45 17.06080 -61.79640 98 3 92 3 Antigua and Barbuda 100.07352 3.061224

Summary() is useful for a quick overview of the columns in the data frame. It displays the length of the columns, number of missing values, as well as min, 1st quartile, median, mean, 3rd quartile and max values. It can point to outliers, messy data types in the columns and a general usefulness of the variables (columns) in your data.

summary(COVID_cases)
FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key Incidence_Rate Case.Fatality_Ratio
Min. : 66 Length:3955 Length:3955 Length:3955 Length:3955 Min. :-52.37 Min. :-174.16 Min. : 0.0 Min. : 0.0 Min. : 0 Min. :-2750459 Length:3955 Min. : 0 Min. : 0.0000
1st Qu.:19054 Class :character Class :character Class :character Class :character 1st Qu.: 33.27 1st Qu.: -96.61 1st Qu.: 152.5 1st Qu.: 1.0 1st Qu.: 0 1st Qu.: 125 Class :character 1st Qu.: 689 1st Qu.: 0.6823
Median :30069 Mode :character Mode :character Mode :character Mode :character Median : 37.94 Median : -86.88 Median : 537.0 Median : 9.0 Median : 0 Median : 463 Mode :character Median : 1332 Median : 1.6260
Mean :32395 Mean : 35.99 Mean : -72.34 Mean : 8281.0 Mean : 251.1 Mean : 5720 Mean : 2312 Mean : 1659 Mean : 2.3396
3rd Qu.:47038 3rd Qu.: 42.16 3rd Qu.: -77.64 3rd Qu.: 2226.5 3rd Qu.: 50.0 3rd Qu.: 0 3rd Qu.: 1530 3rd Qu.: 2271 3rd Qu.: 2.9432
Max. :99999 Max. : 71.71 Max. : 178.06 Max. :1300757.0 Max. :37270.0 Max. :2750459 Max. : 418503 Max. :15441 Max. :124.1779
NAs :696 NAs :81 NAs :81 NAs :3 NAs :81 NAs :47

Before you start editing and cleaning the data set, it’s always a good idea to make a copy you can come back to.

COVID_cases_copy <- copy(COVID_cases)

3. Select only the necessary columns

It seems that there are several columns we don’t need. Let’s select the ones we do.

COVID_cases <- select(COVID_cases, Province_State, Country_Region, Confirmed, Deaths)
head(COVID_cases)
##   Province_State      Country_Region Confirmed Deaths
## 1                        Afghanistan     39192   1453
## 2                            Albania     13153    375
## 3                            Algeria     50914   1711
## 4                            Andorra      1836     53
## 5                             Angola      4672    171
## 6                Antigua and Barbuda        98      3

Upon further inspection, it looks like that some countries have a country-level data, while the others have a province-level data.

Canada <- COVID_cases %>% filter(Country_Region == "Canada")
head(Canada)
##     Province_State Country_Region Confirmed Deaths
## 1          Alberta         Canada     17343    261
## 2 British Columbia         Canada      8641    230
## 3 Diamond Princess         Canada         0      1
## 4   Grand Princess         Canada        13      0
## 5         Manitoba         Canada      1829     19
## 6    New Brunswick         Canada       200      2

To get the country-level data, I need to group the provinces by the country, and then sum the numbers.

4. Group and aggregate the rows

Function group_by() is used to group the data according to one or more variables. Grouping doesn’t change how the data looks, but it tells R to look for the same values in the specified column.

Once grouped, values can be added, averaged, or counted (similar to the pivot tables in Excel).

Covid_by_country <- COVID_cases %>% 
  group_by(Country_Region) %>%
  summarize(Confirmed_by_country = sum(Confirmed), Deaths_by_country = sum(Deaths))
head(Covid_by_country)
## # A tibble: 6 x 3
##   Country_Region      Confirmed_by_country Deaths_by_country
##   <chr>                              <int>             <int>
## 1 Afghanistan                        39192              1453
## 2 Albania                            13153               375
## 3 Algeria                            50914              1711
## 4 Andorra                             1836                53
## 5 Angola                              4672               171
## 6 Antigua and Barbuda                   98                 3

Other functions that summarize() can take are n(), mean(), median(), min(), and max(). Let’s calculate death rate.

Covid_by_country <-  Covid_by_country %>%
  mutate(Death_rate = Deaths_by_country/Confirmed_by_country*100)

Let’s plot that.

world <- ne_countries(scale = 50, returnclass = "sf") %>%
  filter(name != "Antarctica")

covid_rate <- world %>%
  left_join(Covid_by_country, by = c("name_long" = "Country_Region"))


ggplot(covid_rate, aes(fill = Death_rate)) +
  geom_sf(color="grey") +
  scale_fill_gradient(low = "yellow", high = "red",
    name = "Covid Death rate [%]",
      breaks = seq(0, 30, by = 5),
    labels = formatC(seq(0, 30, by = 5), 
                     big.mark = ",", format = "f", digits = 1)) +
  theme_map()

5. Missing values

Some countries are missing! The most likely cause is that the names of a country is different in two data sets. Let’s explore missing countries further. Function is.na() helps us find missing values.

missing_countries <- covid_rate %>% filter(is.na(Death_rate))
n <- nrow(missing_countries)
print(paste("The number of missing countries is ", n))
## [1] "The number of missing countries is  72"

Upon inspecting missing_countries, it looks like some places are not sovereign states, but dependencies which might not have separate COVID data. Let’s focus only on sovereign countries.

missing_countries <- missing_countries %>% filter(type == "Sovereign country" | type=="Country")
missing_countries$Death_rate <- NULL
missing_countries$Deaths_by_country <- NULL
missing_countries$Confirmed_by_country <- NULL

n <- nrow(missing_countries)
print(paste("The number of missing countries we will analyze is ", n))
## [1] "The number of missing countries we will analyze is  39"

Same country, different name (enter James Franco meme)

When you compare COVID data and world data, it looks like countries are not missing, but they are named differently. For example, in the COVID data base, US is called United States, while in the map data - United States of America.To get around that, we can use fuzzyjoin. Fuzzyjoin is equivalent to join, except it matches partial names instead of the full entire name. For example, join_left will not join “Bahamas” and “The Bahamas”. Fuzzyjoin will. Let’s try that.

library(stringr)
library(fuzzyjoin)
missing_countries <- missing_countries %>% 
  fuzzy_inner_join(Covid_by_country, by = c("admin" = "Country_Region"), match_fun = str_detect)

n <- nrow(missing_countries)
print(paste("The number of missing countries for which we managed to find data is ", n))
## [1] "The number of missing countries for which we managed to find data is  7"
# select only the columns we need
missing_countries <- missing_countries %>% 
  select(admin, Death_rate, geometry)

covid_rate <- covid_rate %>%
  select(admin, Death_rate, geometry)
 
# match and replace the data we found for the missing countries with the original data set
covid_rate[match(missing_countries$admin, covid_rate$admin), ] <- missing_countries
ggplot(covid_rate, aes(fill = Death_rate)) +
  geom_sf(color="grey") +
  scale_fill_gradient(low = "yellow", high = "red",
    name = "Covid Death rate [%]",
      breaks = seq(0, 30, by = 5),
    labels = formatC(seq(0, 30, by = 5), 
                     big.mark = ",", format = "f", digits = 1)) +
  theme_map()

That’s mildly better, but still disappointing. The countries that are missing on the map either have:

  • completely different names that cannot be matched by a partial search (e.g. Czech Republic is called Czechia in the COVID database, “United States” in one data base is “US” in the other).
  • Data might actually be missing (e.g. data for Greenland doesn’t exist in the COVID database).

When you have missing data, the following options are available:

  • Exclude all rows or columns that contain missing values using the function na.exclude() or na.omit(). However this can be wasteful because it removes all rows, regardless if the row only has 1 missing value. Additionally, this will leave an empty space in the map.
  • Replace missing values with another value, such as zero, or the mean or median value for that column. Example of such code: data[is.na(data$column_name), ] <- 0.
  • Edit data by hand.

I will replace missing values with averages.

# calculate the average death rate for all countries. Make sure to include na.omit() to exclude NA values from the calculation. 
averate_death_rate <- mean(na.omit(covid_rate$Death_rate))

#replace na's with the average death rate calculated above
covid_rate <- covid_rate %>%
  mutate(Death_rate = ifelse(is.na(Death_rate),averate_death_rate, Death_rate))

Let’s see how that looks…

ggplot(covid_rate, aes(fill = Death_rate)) +
  geom_sf(color="grey") +
  scale_fill_gradient(low = "yellow", high = "red",
    name = "Covid Death rate [%]",
      breaks = seq(0, 30, by = 5),
    labels = formatC(seq(0, 30, by = 5), 
                     big.mark = ",", format = "f", digits = 1)) +
  theme_map()

6. Outliers

It looks like one country has significantly larger death rate than the others which is causing the rest to be almost indistinguishable from the others.

# the simplest plot funciton in R is plot()
plot(covid_rate$Death_rate)

Indeed, there’s one outlier. What happens with the map if we remov it from the data set?

covid_no_max <- covid_rate %>% filter(Death_rate < 25 | is.na(Death_rate))
ggplot(covid_no_max, aes(fill = Death_rate)) +
  geom_sf(color="grey") +
  scale_fill_gradient(low = "yellow", high = "red",
    name = "Covid Death rate [%]",
      breaks = seq(0, 15, by = 3),
    labels = formatC(seq(0, 15, by = 3), 
                     big.mark = ",", format = "f", digits = 1)) +
  theme_map()

Now the map looks a bit better, but the country with the maximum value is now gone from the map, which is not acceptable. Instead, we could manually set the value for the outlier to be equal to the second largest value. That way, we both preserve the maximum value and the differentiation between the countries on the map.

# find the second largest value in Death_rate
second_largest <- sort(covid_rate$Death_rate, decreasing = TRUE)[2]

# which.max returns the index of max value in Death_rate column
outlier_index <- which.max(covid_rate$Death_rate)

# let's replace max with the second largest
covid_rate[outlier_index,"Death_rate"] <- second_largest
ggplot(covid_rate, aes(fill = Death_rate)) +
  geom_sf(color="grey") +
  scale_fill_gradient(low = "yellow", high = "red",
    name = "Covid Death rate [%]",
      breaks = seq(0, 12, by = 3),
    labels = formatC(seq(0, 12, by = 3), 
                     big.mark = ",", format = "f", digits = 1)) +
  theme_map()