Tidyverse

Manipuler, formater et visualiser les données en toute simplicité

Groupe BioStatInfo

Un sage a dit un jour …

“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

… et celui qui nous intéresse aujourd’hui a renchéri :

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham.

Messy and unhappy

Le monde du désordre est vaste …

# A tibble: 6 × 4
  country      year type           count
  <chr>       <dbl> <chr>          <dbl>
1 Afghanistan  1999 cases            745
2 Afghanistan  1999 population  19987071
3 Afghanistan  2000 cases           2666
4 Afghanistan  2000 population  20595360
5 Brazil       1999 cases          37737
6 Brazil       1999 population 172006362

… et varié

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Messy and unhappy

Parfois même fragmenté et dispersé !

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583


Heureusement il existe …

Une galaxie d’outils

Et une philosophie

Les 3 commandements du tidyverse :

Each variable must have its own column.

Each observation must have its own row.

Each value must have its own cell.

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Installation et chargement

#install.packages("tidyverse") 
library(tidyverse)


Tidyverse est un meta-package, qui charge un ensemble de packages très pratiques, qui forment en quelque sorte la base du langage R moderne, pour un bon usage, propre et reproductible.


Nous allons voir par la suite quelques cas d’usage récurrents autour d’un même dataset illustratif.

readr: data importation

cheatsheet

Réimplementation des fonctions d’importation de R, et chargement des datasets au format tibble.


  • read_csv(): comma-separated values (CSV)
  • read_tsv(): tab-separated values (TSV)
  • read_csv2(): semicolon-separated values
  • read_delim(): delimited files, more options
write_csv(table1, "table1.csv")

data = read_csv("table1.csv")

tibble: modern data.frame

data_df = as.data.frame(data)
data_df
      country year  cases population
1 Afghanistan 1999    745   19987071
2 Afghanistan 2000   2666   20595360
3      Brazil 1999  37737  172006362
4      Brazil 2000  80488  174504898
5       China 1999 212258 1272915272
6       China 2000 213766 1280428583
as_tibble(data_df)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

magrittr: ceci est un pipe %>%

Introduit l’opérateur pipe %>% (ctrl+maj+m) qui permet d’enchainer les séquences d’instructions.

data %>% 
  filter(year == 1999) %>% 
  pull(cases) %>% 
  mean()
[1] 83580

Depuis, R version 4.1, un opérateur pipe natif |> est disponible et voué à remplacer %>% à terme.

data |>
  filter(year == 1999) |>
  pull(cases) |>
  mean()
[1] 83580

tidyr: tidy data for a tidy mind

cheatsheet

Commençons simplement par un problème récurrent, les données manquantes.

data[5,3] = NA 
data
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999     NA 1272915272
6 China        2000 213766 1280428583

Plusieurs manières de traiter cette question.

La suppression des lignes avec des NA.


data %>% 
  drop_na()
# A tibble: 5 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        2000 213766 1280428583

Le remplacement par une valeur donnée.


data %>% 
  replace_na(list(cases = 42))
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999     42 1272915272
6 China        2000 213766 1280428583

Remplacement par une valeur précédente


data %>%
  fill(cases)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999  80488 1272915272
6 China        2000 213766 1280428583

Remplacement par une valeur suivante


data %>% 
  fill(cases, .direction = "up")
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 213766 1272915272
6 China        2000 213766 1280428583

tidyr: force them to be tidy

data %>% pivot_wider(names_from = country, values_from = cases)
# A tibble: 6 × 5
   year population Afghanistan Brazil  China
  <dbl>      <dbl>       <dbl>  <dbl>  <dbl>
1  1999   19987071         745     NA     NA
2  2000   20595360        2666     NA     NA
3  1999  172006362          NA  37737     NA
4  2000  174504898          NA  80488     NA
5  1999 1272915272          NA     NA 212258
6  2000 1280428583          NA     NA 213766
data %>% pivot_wider(names_from = country, values_from = c(cases, population))
# A tibble: 2 × 7
   year cases_Afghanistan cases_Brazil cases_China population_Afghanistan
  <dbl>             <dbl>        <dbl>       <dbl>                  <dbl>
1  1999               745        37737      212258               19987071
2  2000              2666        80488      213766               20595360
# ℹ 2 more variables: population_Brazil <dbl>, population_China <dbl>
data %>% 
  pivot_longer(cols = c(cases, population), names_to = "type", values_to = "number")
# A tibble: 12 × 4
   country      year type           number
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

dplyr: the one and only

cheatsheet
S’il ne devait en rester qu’un, ce serait lui.

data %>% 
  filter(year == 1999)
# A tibble: 3 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Brazil       1999  37737  172006362
3 China        1999 212258 1272915272
data %>% 
  slice(2:4)
# A tibble: 3 × 4
  country      year cases population
  <chr>       <dbl> <dbl>      <dbl>
1 Afghanistan  2000  2666   20595360
2 Brazil       1999 37737  172006362
3 Brazil       2000 80488  174504898
data %>% 
  slice_sample(prop = 0.5) ## random sample
# A tibble: 3 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 China        2000 213766 1280428583
2 Afghanistan  1999    745   19987071
3 Brazil       1999  37737  172006362
data %>% select(c(country, cases))
# A tibble: 6 × 2
  country      cases
  <chr>        <dbl>
1 Afghanistan    745
2 Afghanistan   2666
3 Brazil       37737
4 Brazil       80488
5 China       212258
6 China       213766
data %>% pull(year)
[1] 1999 2000 1999 2000 1999 2000
data %>% summarize(mean_cases = mean(cases))
# A tibble: 1 × 1
  mean_cases
       <dbl>
1     91277.
data %>% count(country)
# A tibble: 3 × 2
  country         n
  <chr>       <int>
1 Afghanistan     2
2 Brazil          2
3 China           2

dplyr: to gather them all

data2 = tibble(country='France', year=2001, cases=42, population=42)

data %>% 
  bind_rows(data2)
# A tibble: 7 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
7 France       2001     42         42
data %>% 
  bind_cols(Month = c(5, 7, 12, 3, 8, 7))
# A tibble: 6 × 5
  country      year  cases population Month
  <chr>       <dbl>  <dbl>      <dbl> <dbl>
1 Afghanistan  1999    745   19987071     5
2 Afghanistan  2000   2666   20595360     7
3 Brazil       1999  37737  172006362    12
4 Brazil       2000  80488  174504898     3
5 China        1999 212258 1272915272     8
6 China        2000 213766 1280428583     7
data3 = data %>% rename(disease = cases, pop = population)

data %>% 
  left_join(data3)
# A tibble: 6 × 6
  country      year  cases population disease        pop
  <chr>       <dbl>  <dbl>      <dbl>   <dbl>      <dbl>
1 Afghanistan  1999    745   19987071     745   19987071
2 Afghanistan  2000   2666   20595360    2666   20595360
3 Brazil       1999  37737  172006362   37737  172006362
4 Brazil       2000  80488  174504898   80488  174504898
5 China        1999 212258 1272915272  212258 1272915272
6 China        2000 213766 1280428583  213766 1280428583

Il existe aussi right_join, inner_join et full_join.

dplyr: we are better as a group

data %>% 
  mutate(sum = cases + population)
# A tibble: 6 × 5
  country      year  cases population        sum
  <chr>       <dbl>  <dbl>      <dbl>      <dbl>
1 Afghanistan  1999    745   19987071   19987816
2 Afghanistan  2000   2666   20595360   20598026
3 Brazil       1999  37737  172006362  172044099
4 Brazil       2000  80488  174504898  174585386
5 China        1999 212258 1272915272 1273127530
6 China        2000 213766 1280428583 1280642349
data %>% 
  group_by(year) %>% 
  summarize(cases = mean(cases))
# A tibble: 2 × 2
   year  cases
  <dbl>  <dbl>
1  1999 83580 
2  2000 98973.
data %>% 
  group_by(year) %>% 
  summarise(across(c(cases, population), mean))
# A tibble: 2 × 3
   year  cases population
  <dbl>  <dbl>      <dbl>
1  1999 83580  488302902.
2  2000 98973. 491842947 

Quelques autres packages

cheatsheet

Vectorise des fonctions simples, évite les boucles.

data %>% 
  map(mean)
$country
[1] NA

$year
[1] 1999.5

$cases
[1] 91276.67

$population
[1] 490072924

cheatsheet

Facilite la manipulation de dates, durées, etc.

data %>% 
  mutate(date = as_date(cases))
# A tibble: 6 × 5
  country      year  cases population date      
  <chr>       <dbl>  <dbl>      <dbl> <date>    
1 Afghanistan  1999    745   19987071 1972-01-16
2 Afghanistan  2000   2666   20595360 1977-04-20
3 Brazil       1999  37737  172006362 2073-04-27
4 Brazil       2000  80488  174504898 2190-05-15
5 China        1999 212258 1272915272 2551-02-22
6 China        2000 213766 1280428583 2555-04-10

cheatsheet

Facilite la manipulation des chaines de caractères.

data %>% 
  mutate(is_brazil = str_detect(country, 'raz'))
# A tibble: 6 × 5
  country      year  cases population is_brazil
  <chr>       <dbl>  <dbl>      <dbl> <lgl>    
1 Afghanistan  1999    745   19987071 FALSE    
2 Afghanistan  2000   2666   20595360 FALSE    
3 Brazil       1999  37737  172006362 TRUE     
4 Brazil       2000  80488  174504898 TRUE     
5 China        1999 212258 1272915272 FALSE    
6 China        2000 213766 1280428583 FALSE    

cheatsheet
Facilite la manipulation des types factor.

data %>% 
  mutate(year_fact = as_factor(year)) %>% 
  pull(year_fact) %>% 
  levels()
[1] "1999" "2000"

ggplot2: best graphs in town

cheatsheet

  • ggplot(data, aes(x, y)) + geom_*()

  • Les trois éléments clés :

    1. Données (data)

    2. Aesthetics (aes()) : variables à représenter

    3. Geometries (geom_*()) : type de graphique

Un graph ggplot2 est une superposition de calques.

ggplot(data, aes(x = cases, y = population)) + geom_point()

Personnalisation des graphiques

library(InraeThemes)

ggplot(data, aes(x = cases, y = population, color = country)) + 
  geom_point() +
  theme_inrae()

Quelques exemples

ggplot(data, aes(x = country)) + 
  geom_bar(binwidth = 2, fill = "blue", color = "black")

ggplot(data, aes(x = country, y = cases)) + 
  geom_boxplot(fill = "lightblue") + theme_classic()

ggplot(data, aes(x = population, y = cases, , color = country)) + 
  geom_line() + theme_minimal()

ggplot(data, aes(x = population, y = cases, color = country)) + 
  geom_point() + facet_wrap(~year) + theme_inrae()

Conclusion

Easy bind, tidy data, happy mind