Introduction to Tidyverse

Module 12ter

Mahendra Mariadassou

MaIAGE

Christelle Hennequet-Antier

MaIAGE

April 2, 2024

Introduction

Practical informations

  • 9h00 - 17h00
  • 2 breaks morning and afternoon
  • Lunch at INRAE restaurant (not mandatory)
  • Questions allowed
  • Everyone has something to learn from each other

About you

Who are you?

  • Institution / Laboratory / position

What is your scientific topic?

  • Research question
  • A few keywords

What is your background?

  • What are your needs in data manipulation ?
  • How kownledgeable are you about R ?
  • Have you ever used tidyverse ?

Better know us

  • Open infrastructure dedicated to life sciences
    • Computing resources, tools, databanks…
  • Dissemination of expertise in bioinformatics
  • Design and development of applications
  • Data analysis

Data analysis service

  • We are specialized in genomics/metagenomics
  • Bioinformaticians and Statisticians
  • More than 110 projects since 2016
  • 2 types of services
    • Collaboration or Accompaniement

Training objectives

After this training day, you will be:

  • able to import data using the read_*() functions
  • able to select, filter and summarize
  • able to tidy your data
  • able to pivot your data
  • able to merge your data
  • be familiar with tidyverse and tidyverse verbs

Program - Day I

Morning

  • Introduction & Round table
  • Introduction to tidyverse
  • First steps with the vignettes
  • Tibbles

Break

  • Data selection: theory
  • Data selection: practice

Afternoon

  • Summaries: theory
  • Summaries: practice

Break

  • Mutate: theory
  • Challenges and recap

Program - Day II

Morning

  • Mutate: practice
  • Tidy data: theory

Break

  • Pivot: theory
  • Tidy data and pivot: practice

Afternoon

  • Merging tables: theory

Break

  • Merging tables: practice
  • Additional challenges
  • Q&A and recap

Setup

Setting up your computer

  • Start Rstudio
  • Install the vignettes
remotes::install_github("mahendra-mariadassou/tidytraining")
  • For additional fun, download the prenoms database (provided as an R package)
remotes::install_github( "ThinkR-open/prenoms")
  • Start the tutorials with the tutorial button

Tidyverse principles

Attribution

The following slides are remixed from original material by the awesome Olivier Gimenez available here under the CC BY 4.0 licence.

The material has been shortened to keep only selected slides and changed from xaringan to quarto format.

Tidyverse (I)

  • Ordocosme in 🇫🇷 with Tidy for “bien rangé” and verse for “univers”

  • A collection of R 📦 developed by H. Wickham and others at Rstudio

Tidyverse (II)

  • “A framework for managing data that aims at making the cleaning and preparing steps [muuuuuuuch] easier” (Julien Barnier).

  • Main characteristics of a tidy dataset:

    • each variable is a column
    • each observation is a row
    • each value is in a different cell

Is this dataset tidy?

#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

Nope

What about this one ?

#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

Not yet

And this one ?

# Spread across two tibbles
# cases
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
# population
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

Try again

And that one ?

#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

Finally 🎉

Why is the tidy format useful?

  • Allows using a consistent format for which powerful tools work

  • Makes data manipulation pretty natural

Tidyverse is a collection of R 📦

Workflow in data science

Workflow in data science with tidyverse

Focus on piping

Syntax with pipe

  • Verb(Subject,Complement) replaced by Subject %>% Verb(Complement)
  • No need to name unimportant intermediate variables
  • Clear syntax (readability)

Base R from Lise Vaudor’s blog

white_and_yolk <- crack(egg, add_seasoning)
omelette_batter <- beat(white_and_yolk)
omelette_with_chives <- cook(omelette_batter,add_chives)

Piping from Lise Vaudor’s blog

egg %>%
  crack(add_seasoning) %>%
  beat() %>%
  cook(add_chives) -> omelette_with_chives

Focus on joins

Tidyexplain

Tidying tables

Long and wide formats

From Long to wide and vice-versa

Conclusion

To deep diver 🤿

The RStudio Cheat Sheets