Introduction to Tidyverse

Module 12

Mahendra Mariadassou

MaIAGE

Christelle Hennequet-Antier

MaIAGE

June 19, 2023

Introduction

Practical informations

9h00 - 17h00
2 breaks morning and afternoon
Lunch at INRAE restaurant (not mandatory)
Questions allowed
Everyone has something to learn from each other

About you

Who are you?

Institution / Laboratory / position

What is your scientific topic?

Research question
A few keywords

What is your background?

What are your needs in data manipulation ?
How kownledgeable are you about R ?
Have you ever used tidyverse ?

Better know us

Open infrastructure dedicated to life sciences
- Computing resources, tools, databanks…
Dissemination of expertise in bioinformatics
Design and development of applications
Data analysis

Data analysis service

We are specialized in genomics/metagenomics
Bioinformaticians and Statisticians
More than 110 projects since 2016
2 types of services
- Collaboration or Accompaniement

Training objectives

After this training day, you will be:

able to import data using the read_*() functions
able to select, filter and summarize
able to tidy your data
able to pivot your data
able to merge your data
be familiar with tidyverse and tidyverse verbs

Program - Day I

Morning

Introduction & Round table
Introduction to tidyverse
First steps with the vignettes
Tibbles

Break

Data selection: theory
Data selection: practice

Afternoon

Summaries: theory
Summaries: practice

Break

Mutate: theory
Challenges and recap

Program - Day II

Morning

Mutate: practice
Tidy data: theory

Break

Pivot: theory
Tidy data and pivot: practice

Afternoon

Merging tables: theory

Break

Merging tables: practice
Additional challenges
Q&A and recap

Setup

Setting up your computer

Start Rstudio
Install the vignettes

remotes::install_github("mahendra-mariadassou/tidytraining")

For additional fun, download the prenoms database (provided as an R package)

remotes::install_github( "ThinkR-open/prenoms")

Start the tutorials with the tutorial button

Tidyverse principles

Attribution

The following slides are remixed from original material by the awesome Olivier Gimenez available here under the CC BY 4.0 licence.

The material has been shortened to keep only selected slides and changed from xaringan to quarto format.

Tidyverse (I)

Ordocosme in 🇫🇷 with Tidy for “bien rangé” and verse for “univers”
A collection of R 📦 developed by H. Wickham and others at Rstudio

Tidyverse (II)

“A framework for managing data that aims at making the cleaning and preparing steps [muuuuuuuch] easier” (Julien Barnier).
Main characteristics of a tidy dataset:
- each variable is a column
- each observation is a row
- each value is in a different cell

Is this dataset tidy?

#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

Nope

What about this one ?

#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

No more

And this one ?

# Spread across two tibbles
# cases
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
# population
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

Try again

And that one ?

#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

Finally 🎉

Why is the tidy format useful?

Allows using a consistent format for which powerful tools work
Makes data manipulation pretty natural

Tidyverse is a collection of R 📦

Workflow in data science

Workflow in data science with tidyverse

Focus on piping

Syntax with pipe

Verb(Subject,Complement) replaced by Subject %>% Verb(Complement)
No need to name unimportant intermediate variables
Clear syntax (readability)

Base R from Lise Vaudor’s blog

white_and_yolk <- crack(egg, add_seasoning)
omelette_batter <- beat(white_and_yolk)
omelette_with_chives <- cook(omelette_batter,add_chives)

Piping from Lise Vaudor’s blog

egg %>%
  crack(add_seasoning) %>%
  beat() %>%
  cook(add_chives) -> omelette_with_chives

Focus on joins

Tidyexplain

Tidying tables

Long and wide formats

From Long to wide and vice-versa

Conclusion

To deep diver 🤿

Learn the tidyverse: books, workshops and online courses
Olivier’s selection of books (I also recommend them)
- R for Data Science et Advanced R
- Introduction à R et au tidyverse
Tidy Tuesdays videos by D. Robinson
Material of the 2-day workshop Data Science in the tidyverse held at the RStudio 2019 conference
Material of the stat545 course on Data wrangling, exploration, and analysis with R at the University of British Columbia
List of best R packages (with their description) on data import, wrangling and visualization

The RStudio Cheat Sheets