Tutorial R advanced with tidyverse

Illustration with a RNA-Seq data set

Authors

Christelle Hennequet-Antier

Mahendra Mariadassou

Published

March 14, 2024

Introduction

In this tutorial, you will learn how to organize data in R with the principles of tidy data. Datasets are most often represented in the form of a rectangular table. To be considered as a tidy data, it should be structured:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

For this purpose we will use tidyverse which is a collection of R core packages (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats) [1].

Make data tidy using tidyverse R package will facilitate statistical analyses and visualization. We illustrate some operations to manipulate the data in the framework of RNA-Seq data analysis.

To install tidyverse

Code
install.packages("tidyverse")

Case study

The RNA-Seq data of this case study is described in [2]. The sequence count data are publicly available from the Gene Expression Omnibus (GEO) at the series accession number GSE60450. This study examines the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Each group contains two biological replicates.

Differential analysis consists of identifying differences in gene expression between several experimental conditions. It was previously performed with edgeR package. Data and results were obtained during the “Statistical analysis of RNA-Seq data with R” course from “Bioinformatics by practicing” cycle trainings located at INRAE, Jouy-en-Josas center, MaIAGE.

See also Bioconductor edgeR vignette, case study “RNA-Seq profiles of mouse mammary gland”.

Data files and Resources

Connect to R migale with login and password provided over the two days of training: https://rstudio.migale.inrae.fr/

The data files should be stored in a folder named Data in your working directory.

  • Experimental design data file (separator character is tabs “\t”): GSE60450_Lactation-targets.csv
  • Counts data file (separator character is tabs “\t”): GSE60450_Lactation-GenewiseCounts.txt
  • Results of differential analysis from basal cells between pregnant and lactating mice (separator character is “;”): complete_Basal.Pregnant_vs_Lactate.csv

Bioconductor/R Packages

Load tidyverse packages needed for this tutorial.

Code
#Use packages and functions
library(tidyverse) #load the core tidyverse packages

If you need to install theses packages

Code
# Install from CRAN
install.packages("tidyverse")

Activity 1 - Reading a file, viewing the data

Description of the biological experiment

The aim is to create a design object from experimental design data file GSE60450_Lactation-targets.csv with a tabs separator (“\t”). We use the readr package from the tidyverse ecosystem. A tibble object is produced, which is is an enhanced data.frame. Below is an extract of the data.

File Sample CellType Status
GSM1480297_MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1.txt GSM1480297 Basal virgin
GSM1480298_MCL1-DH_BC2CTUACXX_CAGATC_L002_R1.txt GSM1480298 Basal virgin
GSM1480299_MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1.txt GSM1480299 Basal pregnant
GSM1480300_MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1.txt GSM1480300 Basal pregnant
GSM1480301_MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1.txt GSM1480301 Basal lactate
GSM1480302_MCL1-DL_BC2CTUACXX_ATCACG_L002_R1.txt GSM1480302 Basal lactate

Each row in the table represents a given sample. The variables are File (counts file processed using bioinformatics tools), Sample (the name of the sample), CellType (the type of cells) and Status (the developmental stage of mice).

  • Read the delimited file GSE60450_Lactation-targets.csv using readr package. Attach it to a tibble called design.
  • View design object.
  • Check the path of the file and the delim character
  • Use readr package. Note that readr package is included in tidyverse package.
  • ?read_delim # help
  • read_csv() and read_tsv() are special cases of the more general read_delim(). They’re useful for reading the most common types of flat file data, comma separated values and tab separated values, respectively. read_csv2() uses ⁠;⁠ for the field separator and ⁠,⁠ for the decimal point. This format is common in some European countries.
  • https://readr.tidyverse.org/
  • https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf
  • Use View() or glimpse() to view the entire data set
Code
design <- read_tsv("./Data/GSE60450_Lactation-targets.csv")
design 
# A tibble: 12 × 4
   File                                             Sample     CellType Status  
   <chr>                                            <chr>      <chr>    <chr>   
 1 GSM1480297_MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1.txt GSM1480297 Basal    virgin  
 2 GSM1480298_MCL1-DH_BC2CTUACXX_CAGATC_L002_R1.txt GSM1480298 Basal    virgin  
 3 GSM1480299_MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1.txt GSM1480299 Basal    pregnant
 4 GSM1480300_MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1.txt GSM1480300 Basal    pregnant
 5 GSM1480301_MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1.txt GSM1480301 Basal    lactate 
 6 GSM1480302_MCL1-DL_BC2CTUACXX_ATCACG_L002_R1.txt GSM1480302 Basal    lactate 
 7 GSM1480291_MCL1-LA_BC2CTUACXX_GATCAG_L001_R1.txt GSM1480291 Luminal  virgin  
 8 GSM1480292_MCL1-LB_BC2CTUACXX_TGACCA_L001_R1.txt GSM1480292 Luminal  virgin  
 9 GSM1480293_MCL1-LC_BC2CTUACXX_GCCAAT_L001_R1.txt GSM1480293 Luminal  pregnant
10 GSM1480294_MCL1-LD_BC2CTUACXX_GGCTAC_L001_R1.txt GSM1480294 Luminal  pregnant
11 GSM1480295_MCL1-LE_BC2CTUACXX_TAGCTT_L001_R1.txt GSM1480295 Luminal  lactate 
12 GSM1480296_MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1.txt GSM1480296 Luminal  lactate 
Code
# To view the column headers
spec(design)
cols(
  File = col_character(),
  Sample = col_character(),
  CellType = col_character(),
  Status = col_character()
)
Code
colnames(design)
[1] "File"     "Sample"   "CellType" "Status"  
Code
# To remove Column specification's message, use show_col_types = FALSE argument
design <- read_tsv("./Data/GSE60450_Lactation-targets.csv", show_col_types = FALSE )
design
# A tibble: 12 × 4
   File                                             Sample     CellType Status  
   <chr>                                            <chr>      <chr>    <chr>   
 1 GSM1480297_MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1.txt GSM1480297 Basal    virgin  
 2 GSM1480298_MCL1-DH_BC2CTUACXX_CAGATC_L002_R1.txt GSM1480298 Basal    virgin  
 3 GSM1480299_MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1.txt GSM1480299 Basal    pregnant
 4 GSM1480300_MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1.txt GSM1480300 Basal    pregnant
 5 GSM1480301_MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1.txt GSM1480301 Basal    lactate 
 6 GSM1480302_MCL1-DL_BC2CTUACXX_ATCACG_L002_R1.txt GSM1480302 Basal    lactate 
 7 GSM1480291_MCL1-LA_BC2CTUACXX_GATCAG_L001_R1.txt GSM1480291 Luminal  virgin  
 8 GSM1480292_MCL1-LB_BC2CTUACXX_TGACCA_L001_R1.txt GSM1480292 Luminal  virgin  
 9 GSM1480293_MCL1-LC_BC2CTUACXX_GCCAAT_L001_R1.txt GSM1480293 Luminal  pregnant
10 GSM1480294_MCL1-LD_BC2CTUACXX_GGCTAC_L001_R1.txt GSM1480294 Luminal  pregnant
11 GSM1480295_MCL1-LE_BC2CTUACXX_TAGCTT_L001_R1.txt GSM1480295 Luminal  lactate 
12 GSM1480296_MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1.txt GSM1480296 Luminal  lactate 
Code
#Other ways to view the entire dataset
#View(design) # open via Environment pane
glimpse(design)
Rows: 12
Columns: 4
$ File     <chr> "GSM1480297_MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1.txt", "GSM14802…
$ Sample   <chr> "GSM1480297", "GSM1480298", "GSM1480299", "GSM1480300", "GSM1…
$ CellType <chr> "Basal", "Basal", "Basal", "Basal", "Basal", "Basal", "Lumina…
$ Status   <chr> "virgin", "virgin", "pregnant", "pregnant", "lactate", "lacta…

Activity 2 - Selecting variables of interest

If we want to create a new dataset that only includes certain variables, we can use the select() function from the dplyr package.

Note that select() does not change the original tibble, but makes a new tibble with the specified columns. If you want to keep this new dataset, create a new object.

  • Create a new data called new_design which contains only Sample, CellType, Status variables.
  • Use select() function from the dplyr package.
  • vignette(package=“dplyr”, topic =“dplyr”)
  • ?select
  • Use select() function
Code
# select variables of interest
select(design, Sample, CellType, Status)
# A tibble: 12 × 3
   Sample     CellType Status  
   <chr>      <chr>    <chr>   
 1 GSM1480297 Basal    virgin  
 2 GSM1480298 Basal    virgin  
 3 GSM1480299 Basal    pregnant
 4 GSM1480300 Basal    pregnant
 5 GSM1480301 Basal    lactate 
 6 GSM1480302 Basal    lactate 
 7 GSM1480291 Luminal  virgin  
 8 GSM1480292 Luminal  virgin  
 9 GSM1480293 Luminal  pregnant
10 GSM1480294 Luminal  pregnant
11 GSM1480295 Luminal  lactate 
12 GSM1480296 Luminal  lactate 
Code
# store the selection in a new dataset
new_design <- select(design, Sample, CellType, Status)
new_design
# A tibble: 12 × 3
   Sample     CellType Status  
   <chr>      <chr>    <chr>   
 1 GSM1480297 Basal    virgin  
 2 GSM1480298 Basal    virgin  
 3 GSM1480299 Basal    pregnant
 4 GSM1480300 Basal    pregnant
 5 GSM1480301 Basal    lactate 
 6 GSM1480302 Basal    lactate 
 7 GSM1480291 Luminal  virgin  
 8 GSM1480292 Luminal  virgin  
 9 GSM1480293 Luminal  pregnant
10 GSM1480294 Luminal  pregnant
11 GSM1480295 Luminal  lactate 
12 GSM1480296 Luminal  lactate 
Code
# an other way 
new_design <- select(design, -File)
new_design
# A tibble: 12 × 3
   Sample     CellType Status  
   <chr>      <chr>    <chr>   
 1 GSM1480297 Basal    virgin  
 2 GSM1480298 Basal    virgin  
 3 GSM1480299 Basal    pregnant
 4 GSM1480300 Basal    pregnant
 5 GSM1480301 Basal    lactate 
 6 GSM1480302 Basal    lactate 
 7 GSM1480291 Luminal  virgin  
 8 GSM1480292 Luminal  virgin  
 9 GSM1480293 Luminal  pregnant
10 GSM1480294 Luminal  pregnant
11 GSM1480295 Luminal  lactate 
12 GSM1480296 Luminal  lactate 
Code
# with pipe operator %>% from magrittr package
new_design <- design %>% select(-File)

Activity 3 - Filtering rows of interest

If we want to subset a data frame keeping only rows that satisfy some criteria based on variables, we can use the filter() function from the dplyr package. You need to specify an expression named Boolean expression whose evaluation results in a value of TRUE or FALSE.

  • Create data frame retaining only the rows where the samples come from Luminal cells.
  • Create data frame retaining only the rows where the samples come from lacting mice.
  • Filter data to keep only the rows where the samples come from lacting mice and Luminal cells.
  • ?filter
  • Use filter() function
  • Use operator ==
Code
# only the rows where the samples come from Luminal cells
subset_1 <- filter(design, CellType == "Luminal")
# only the rows where the samples come from lacting mice.
subset_2 <- filter(design, Status == "lactate")
# only the rows where the samples come  from lacting mice and Luminal cells
subset_3 <- filter(design, CellType == "Luminal", Status == "lactate")
# same result
subset_3 <- filter(design, CellType == "Luminal" & Status == "lactate")

Activity 4 - Add a new variable

To create new variables, we use the mutate() function from dplyr package. Note that if you want to save this new column, you need to save it to an object.

To identify genes that are differentially expressed between different biological conditions, it is convenient to define a new group variable. This synthetic variable is the combination of several variables. It is very useful to calculate some statistics and also to construct the statistical model of the differential analysis.

  • Create a new variable group which is the combination between CellType and Status.
  • Attach the group variable to the design data frame using the mutate() function.
  • Use str_c() from stringr package which is equivalent to paste() function.
  • ?mutate
  • Use mutate() function
Code
# group is defined as the combination between CellType and Status
group <- str_c(design$CellType, design$Status, sep = "_")
group <- paste(design$CellType, design$Status, sep = "_")
# Attach group to the design data frame
design <- mutate(design, group = str_c(CellType, Status, sep = "_"))

Activity 5 - Arranging the data

Description of differential analysis results

The file complete_Basal.Pregnant_vs_Lactate.csv contains the results of a differential analysis performed using edgeR package. Read counts were normalized using trimmed mean of M-values method (TMM). We are interested in identifying genes whose expression changes between the Basal_pregnant and Basal_lactate groups. A generalized linear model was fitted for each gene [3]. A positive (respectively, negative) log-foldchange indicates genes that are up- (respectively, down-) regulated in Basal_pregnant over Basal_lactate group. Multiplicity correction is performed by applying the Benjamini-Hochberg method on the p-values, to control the false discovery rate (FDR). A gene is declared differentially expressed if its FDR is lower than 0.05.

Below is an extract of the results of a differential analysis.

...1 ENTREZID SYMBOL GENENAME logFC logCPM LR PValue FDR
497097 497097 Xkr4 X-linked Kx blood group related 4 -1.2575060 2.572255 5.9808162 0.0144623 0.0408293
20671 20671 Sox17 SRY (sex determining region Y)-box 17 0.3670579 1.310053 0.6801565 0.4095328 0.5453077
27395 27395 Mrpl15 mitochondrial ribosomal protein L15 -0.0353455 3.998092 0.0348689 0.8518705 0.9081256
18777 18777 Lypla1 lysophospholipase 1 -0.0854954 5.055784 0.4514209 0.5016610 0.6290739
21399 21399 Tcea1 transcription elongation factor A (SII) 1 0.2226795 5.629010 3.4048849 0.0650036 0.1357270

Note that ENTREZID is the gene identifier, SYMBOL and GENENAME are other information about the gene. The logFC measure is the fold-change transformed into log2 and represents the difference in expression. A log(FC) of 1 means twice as expressed. The variables LR and PValue are likelihood ratio test and the associated p-value, respectively. FDR represents the false discovery rate calculated with the Benjamini-Hochberg correction.

This Volcano plot enable us to visualize the significance of change (i.e. p-value or adjusted p-value like FDR) versus the fold change (i.e. logFC). Each point represents a gene which are colored according to its significance. The most differentially expressed genes are located towards the top of the plot.

The results of differential analysis are very often explored by sorting the data file according to the significance of the test and the fold-change value. We will use the arrange() function of the dplyr package which allows to easily order the rows of a data table by the values of the selected columns.

Part1: arranging

  • Read the file complete_Basal.Pregnant_vs_Lactate.csv using readr package. Attach it to a tibble called res. It contains the results of the differential analysis between the Basal_pregnant and Basal_lactate groups.
  • Order the rows by values of FDR using arrange() function from dplyr package. The top genes are the most differentially expressed and the bottom genes are not differentially expressed. Give the top ten of genes.
  • Order the rows in descending order of logFC using arrange() function from dplyr package. At the top of the list are the genes with the highest difference in expression but which are not necessarily differentially expressed. Give an example of a gene which is not differentially expressed although its logFC is large.

Part2: challenge

  • Create the data frame up_genes including only differentially expressed genes (FDR < 0.05) which are up regulated (logFC > 0).
  • Modify the code to create the data frame down_genes including only differentially expressed genes which are down regulated (logFC < 0).
  • Order differentially expressed genes from highest value to lowest of logFC in the up_genes data frame and conversely in the down_genes data frame.

Part1: arranging

  • Use readr package
  • Have a look at the path of the file and the delim character
  • Choose read_csv() and read_csv2() for csv files, read_tsv() for tabs separated files
  • ?read_delim
  • ?arrange
  • arrange()

Part2: challenge

  • Use filter() and arrange() functions
  • The pipe operator %>% from magrittr package can facilitate the chaining of lines of code

Part1: arranging

Code
# read file
res <- read_csv2(file = "./Data/complete_Basal.Pregnant_vs_Lactate.csv")
# orders the rows by values of FDR
res_signif_ord <- arrange(res, FDR)
# top 10 of genes
res_signif_ord
# A tibble: 15,804 × 9
     ...1 ENTREZID SYMBOL  GENENAME         logFC logCPM    LR   PValue      FDR
    <dbl>    <dbl> <chr>   <chr>            <dbl>  <dbl> <dbl>    <dbl>    <dbl>
 1 211577   211577 Mrgprf  MAS-related GPR… -5.15   2.74  407. 1.61e-90 2.55e-86
 2  12992    12992 Csn1s2b casein alpha s2… -6.08  10.2   404. 7.71e-90 6.09e-86
 3  14620    14620 Gjb3    gap junction pr…  3.58   5.50  316. 1.20e-70 6.33e-67
 4  12740    12740 Cldn4   claudin 4         5.32   9.87  310. 2.09e-69 8.25e-66
 5 140474   140474 Muc4    mucin 4           7.17   6.05  258. 3.86e-58 1.22e-54
 6 226101   226101 Myof    myoferlin        -2.32   6.44  252. 1.20e-56 3.17e-53
 7  21953    21953 Tnni2   troponin I, ske… -5.74   3.86  242. 1.45e-54 3.27e-51
 8 231830   231830 Micall2 MICAL-like 2      2.25   5.18  230. 6.10e-52 1.21e-48
 9  78896    78896 Ecrg4   ECRG4 augurin p…  2.81   6.68  225. 7.87e-51 1.38e-47
10  24117    24117 Wif1    Wnt inhibitory …  1.82   6.76  221. 4.32e-50 6.83e-47
# ℹ 15,794 more rows
Code
# other solution
# res_signif_ord %>% slice_head(n = 10)
# rows are sorted in descending order of logFC
res_FC_ord <- arrange(res, desc(logFC))
#View(res_FC_ord)
# this gene is not differentially expressed (FDR > 0.05) although its logFC is large
filter(res_FC_ord, ENTREZID==100123473)
# A tibble: 1 × 9
       ...1  ENTREZID SYMBOL GENENAME           logFC logCPM    LR PValue    FDR
      <dbl>     <dbl> <chr>  <chr>              <dbl>  <dbl> <dbl>  <dbl>  <dbl>
1 100123473 100123473 Trdc   T cell receptor d…  4.24  -1.33  5.22 0.0223 0.0582
Code
# the first 5 genes which are not differentially expressed although their logFC is large
res_FC_ord %>% filter(FDR > 0.05) %>% slice_head(n = 5)
# A tibble: 5 × 9
       ...1  ENTREZID SYMBOL  GENENAME          logFC logCPM    LR PValue    FDR
      <dbl>     <dbl> <chr>   <chr>             <dbl>  <dbl> <dbl>  <dbl>  <dbl>
1 100123473 100123473 Trdc    T cell receptor …  4.24 -1.33   5.22 0.0223 0.0582
2     19144     19144 Klk6    kallikrein relat…  3.95  0.878  4.56 0.0328 0.0791
3     78076     78076 Lcn8    lipocalin 8        3.95 -2.41   4.84 0.0278 0.0694
4     20309     20309 Cxcl15  chemokine (C-X-C…  3.94  4.11   4.78 0.0288 0.0714
5 100503289 100503289 Gm14133 predicted gene 1…  3.94 -1.09   4.81 0.0283 0.0704

Part2: challenge

Code
# challenge arrange
up_genes <- filter(res, FDR < 0.05, logFC > 0) %>%
  arrange(desc(logFC)) # Note that we order from highest to lowest value
down_genes <- filter(res, FDR < 0.05, logFC < 0) %>%
  arrange(logFC) # Note that we order from lowest to highest value

Activity 6 - Grouping and summarising

Descriptive statistics are used to describe the data in a study and provide some summary statistics. They can be calculated using the function summarise() from dplyr package.

We want to obtain descriptive statistics (e.g. mean, median, variance) on differentially expressed genes for both up- and down- regulated genes.

Part1: summarising

  • Create the data frame up_genes including only differentially expressed genes (FDR < 0.05) which are up regulated (logFC > 0). Provide the following statistics: number of genes up regulated, the minimum, the mean, the median and the maximum of the logFC variable.
  • Optional: Reproduce the code for the data frame down_genes including only differentially expressed genes which are down regulated.

Part2: grouping

  • Subset the rows from res tibble to keep only differentially expressed genes (FDR < 0.05).
  • Convert this tibble into a grouped tibble defined by the evaluation of the criterion (logFC > 0) using group_by() function form dplyr package.
  • Provide the following statistics from the grouped tibble: number of genes (up- or down-) regulated, the minimum, the mean, the median and the maximum of the logFC variable. What is the highest and the lowest logFC ?

Part1: summarising

  • subset rows using values from FDR and logFC variables
  • Use filter() function
  • ?summarise
  • Use summarise()function

Part2: grouping

  • ?group_by
  • Use summarise() function
  • The pipe operator %>% from magrittr package can facilitate the chaining of lines of code

Part1: summarising

Code
# create data containing only up, respectively down genes
up_genes <- filter(res, FDR < 0.05, logFC > 0)
down_genes <- filter(res, FDR < 0.05, logFC < 0) 
# summary statistics for up_genes data frame
summarise(up_genes, 
          Nb = n(),
          min_logFC = min(logFC),
          mean_logFC = mean(logFC),
          median_logFC = median(logFC),
          max_logFC = max(logFC))
# A tibble: 1 × 5
     Nb min_logFC mean_logFC median_logFC max_logFC
  <int>     <dbl>      <dbl>        <dbl>     <dbl>
1  3060     0.253       1.08        0.793      7.72
Code
# summary statistics for down_genes data frame          
summarise(down_genes, 
          Nb = n(),
          min_logFC = min(logFC),
          mean_logFC = mean(logFC),
          median_logFC = median(logFC),
          max_logFC = max(logFC))
# A tibble: 1 × 5
     Nb min_logFC mean_logFC median_logFC max_logFC
  <int>     <dbl>      <dbl>        <dbl>     <dbl>
1  2812     -6.48     -0.812       -0.654    -0.257

Part2: grouping

Code
res %>% 
  filter(FDR < 0.05) %>% 
  group_by(logFC > 0) %>% 
  summarise(Nb = n(),
          min_logFC = min(logFC),
          mean_logFC = mean(logFC),
          median_logFC = median(logFC),
          max_logFC = max(logFC),
          .groups = "drop") # if you want to "drop" the groupings for further operations
# A tibble: 2 × 6
  `logFC > 0`    Nb min_logFC mean_logFC median_logFC max_logFC
  <lgl>       <int>     <dbl>      <dbl>        <dbl>     <dbl>
1 FALSE        2812    -6.48      -0.812       -0.654    -0.257
2 TRUE         3060     0.253      1.08         0.793     7.72 

Activity 7 - Formatting the data from wide to long format

Description of the counts data

Below is the first lines of the counts data table in the wide format, named counts. Each row in the table represents a given gene identified by its EntrezGeneID identifier. The other variables are Length (gene’s length) and the name of the studied samples. The counts data table contains the counts, i.e. the number of reads that align at a particular genomic position for each gene and sample.

EntrezGeneID Length MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1 MCL1-DH_BC2CTUACXX_CAGATC_L002_R1 MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1 MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1 MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1 MCL1-DL_BC2CTUACXX_ATCACG_L002_R1 MCL1-LA_BC2CTUACXX_GATCAG_L001_R1 MCL1-LB_BC2CTUACXX_TGACCA_L001_R1 MCL1-LC_BC2CTUACXX_GCCAAT_L001_R1 MCL1-LD_BC2CTUACXX_GGCTAC_L001_R1 MCL1-LE_BC2CTUACXX_TAGCTT_L001_R1 MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1
497097 3634 438 300 65 237 354 287 0 0 0 0 0 0
100503874 3259 1 0 1 1 0 4 0 0 0 0 0 0
100038431 1634 0 0 0 0 0 0 0 0 0 0 0 0

We want to convert the counts data table from wide format into long format using pivot_longer() function from tidyr package. This will make it easier to perform statistical data analyses and create graphs with ggplot2 package. In the wide format count values are stored in multiple columns for each sample. We convert it into long format containing all the count values in a single column. The number of rows increases and the number of columns decreases in the long format.

Below is the first lines of the counts data table reformatted in the long format, named counts_long.

EntrezGeneID Length Sample Count
497097 3634 MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1 438
497097 3634 MCL1-DH_BC2CTUACXX_CAGATC_L002_R1 300
497097 3634 MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1 65
497097 3634 MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1 237
497097 3634 MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1 354

Part1: Reading data

  • Read the counts data file GSE60450_Lactation-GenewiseCounts.txt with tabs “\t” separator character is tabs using readr package. Attach it to a tibble called counts.
  • View counts object. How many genes were expressed ? How many samples were studied ?

Part2: Convert data from wide format into long format

  • Use pivot_longer() function to convert the counts data from wide format to long format. Attach it to a tibble called counts_long which contains two new columns Sample and Count columns.

Part1: Reading data

  • Check path and delimiter character to read file data
  • ?read_delim or ?read_csv or ?read_tsv for help
  • Use View() or glimpse() to view the entire data set
  • Use slice_head() to view the first lines

Part2: Convert data from wide format into long format

  • vignette(“pivot”)
  • Have a look to cols, names_to and values_to arguments
  • cols argument indicates the columns to reformat
  • To select columns to pivot in the cols argument, you can use starts_with() function, or c() or : between first and last column of interest
  • Two new columns are thus created in the reformatted data (in longer format) with the name specified with names_to corresponding to Sample and values_to arguments named Count containing gene expression values.

Part1: Reading data

Code
counts <- read_tsv("./Data/GSE60450_Lactation-GenewiseCounts.txt", show_col_types = FALSE)
# To know the data
spec(counts)      # type of columns
cols(
  EntrezGeneID = col_double(),
  Length = col_double(),
  `MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1` = col_double(),
  `MCL1-DH_BC2CTUACXX_CAGATC_L002_R1` = col_double(),
  `MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1` = col_double(),
  `MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1` = col_double(),
  `MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1` = col_double(),
  `MCL1-DL_BC2CTUACXX_ATCACG_L002_R1` = col_double(),
  `MCL1-LA_BC2CTUACXX_GATCAG_L001_R1` = col_double(),
  `MCL1-LB_BC2CTUACXX_TGACCA_L001_R1` = col_double(),
  `MCL1-LC_BC2CTUACXX_GCCAAT_L001_R1` = col_double(),
  `MCL1-LD_BC2CTUACXX_GGCTAC_L001_R1` = col_double(),
  `MCL1-LE_BC2CTUACXX_TAGCTT_L001_R1` = col_double(),
  `MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1` = col_double()
)
Code
#View(counts)      # view entire data
glimpse(counts)   # view data
Rows: 27,179
Columns: 14
$ EntrezGeneID                        <dbl> 497097, 100503874, 100038431, 1988…
$ Length                              <dbl> 3634, 3259, 1634, 9747, 3130, 4203…
$ `MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1` <dbl> 438, 1, 0, 1, 106, 309, 652, 0, 16…
$ `MCL1-DH_BC2CTUACXX_CAGATC_L002_R1` <dbl> 300, 0, 0, 1, 182, 234, 515, 1, 14…
$ `MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1` <dbl> 65, 1, 0, 0, 82, 337, 948, 0, 1721…
$ `MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1` <dbl> 237, 1, 0, 0, 105, 300, 935, 0, 13…
$ `MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1` <dbl> 354, 0, 0, 0, 43, 290, 928, 0, 115…
$ `MCL1-DL_BC2CTUACXX_ATCACG_L002_R1` <dbl> 287, 4, 0, 0, 82, 270, 791, 0, 106…
$ `MCL1-LA_BC2CTUACXX_GATCAG_L001_R1` <dbl> 0, 0, 0, 10, 16, 560, 826, 0, 1334…
$ `MCL1-LB_BC2CTUACXX_TGACCA_L001_R1` <dbl> 0, 0, 0, 3, 25, 464, 862, 1, 1258,…
$ `MCL1-LC_BC2CTUACXX_GCCAAT_L001_R1` <dbl> 0, 0, 0, 10, 18, 489, 668, 2, 1068…
$ `MCL1-LD_BC2CTUACXX_GGCTAC_L001_R1` <dbl> 0, 0, 0, 2, 8, 328, 646, 0, 926, 6…
$ `MCL1-LE_BC2CTUACXX_TAGCTT_L001_R1` <dbl> 0, 0, 0, 0, 3, 307, 544, 2, 508, 2…
$ `MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1` <dbl> 0, 0, 0, 0, 10, 342, 581, 2, 500, …
Code
slice_head(counts, n=3) # view first lines
# A tibble: 3 × 14
  EntrezGeneID Length `MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1` MCL1-DH_BC2CTUACXX_C…¹
         <dbl>  <dbl>                               <dbl>                  <dbl>
1       497097   3634                                 438                    300
2    100503874   3259                                   1                      0
3    100038431   1634                                   0                      0
# ℹ abbreviated name: ¹​`MCL1-DH_BC2CTUACXX_CAGATC_L002_R1`
# ℹ 10 more variables: `MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1` <dbl>,
#   `MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1` <dbl>,
#   `MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1` <dbl>,
#   `MCL1-DL_BC2CTUACXX_ATCACG_L002_R1` <dbl>,
#   `MCL1-LA_BC2CTUACXX_GATCAG_L001_R1` <dbl>,
#   `MCL1-LB_BC2CTUACXX_TGACCA_L001_R1` <dbl>, …
Code
is_tibble(counts) # to check the structure of the data
[1] TRUE

There are 27179 genes and 12 samples.

Part2: Convert data from wide format into long format

Code
# Use pivot_longer() function to convert data from wide format into long format
# cols argument describes which columns need to be reshaped
# use starts_with() function to select columns of interest
counts_long <- pivot_longer(counts, cols = starts_with("MCL1"), names_to = "Sample", values_to = "Count")

# code with pipe operator
counts_long <- counts %>%
  pivot_longer(cols = starts_with("MCL1"), names_to = "Sample", values_to = "Count")

# use c() and - to remove specified columns
counts_long <- counts %>% 
  pivot_longer(cols = -c("EntrezGeneID", "Length"), names_to = "Sample", values_to = "Count")

# use ":" to specify first and last column of interest 
counts_long <- counts %>% 
  pivot_longer(cols = 'MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1':'MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1', 
               names_to = "Sample", 
               values_to = "Count")

Activity 8 - Joining two tables

Now, we want to add metadata about the samples provided by the design table into the counts_long table. This will make it easier to identify the information carried by each sample (i.e. whether it is a basal cell type) and use it in the graphs.

So, the joined data counts_long_allinfo looks like that.

EntrezGeneID Length Sample Count File Sample.y CellType Status group
497097 3634 MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1 438 GSM1480297_MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1.txt GSM1480297 Basal virgin Basal_virgin
497097 3634 MCL1-DH_BC2CTUACXX_CAGATC_L002_R1 300 GSM1480298_MCL1-DH_BC2CTUACXX_CAGATC_L002_R1.txt GSM1480298 Basal virgin Basal_virgin
497097 3634 MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1 65 GSM1480299_MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1.txt GSM1480299 Basal pregnant Basal_pregnant
  • Copy the source code below and run it into the Console. We added a new variable File2 in the design table using mutate() function. File2 is a key variable which matches to the Sample variable from the counts_long table. In this case, this step creates the relationship between the two tables counts_long and design and is necessary before joining.
Code
# create File2 variable as key variable from design table
design <- design %>%
  mutate(File2 = str_remove(File, ".txt")) %>%  # pattern '.txt' is removed
  mutate(File2 = str_sub(File2, 12))            # begin at the 12th character 
  • Use full_join() function to join the two tables counts_long and design by key variables identifying the samples. Attach it to counts_long_allinfo.
  • Work with counts_long and design data
  • ?full_join
  • Check that key variables included in by argument will match, argument by = c("Sample" = "File2")
Code
# create File2 variable as key variable from design table
design <- design %>%
  mutate(File2 = str_remove(File, ".txt")) %>%
  mutate(File2 = str_sub(File2, 12) )
# File 2 from design table corresponds to Sample from counts_long table
counts_long_allinfo <- full_join(counts_long, design, by = c("Sample" = "File2"))

Activity 9 - Formatting the data from long to wide

Now, we can easily perform statistics by taking advantage of the long format of the count data table counts_long. Next, we format the data table in wide format for a more comfortable presentation. To do it, we use pivot_wider() function which is the inverse function of pivot_longer(). In this wide format, the number of columns increases and the number of rows decreases.

Have a look below at the table called counts_mean which contains the mean of counts for each gene by biological condition (ie group).

EntrezGeneID mean_Basal_lactate mean_Basal_pregnant mean_Basal_virgin mean_Luminal_lactate mean_Luminal_pregnant mean_Luminal_virgin
11287 7.0 7.5 4.0 3.0 1.0 7
11298 0.5 0.5 0.5 0.5 1.0 0
11302 450.5 666.0 704.0 3.5 3.5 14

Reproduce the table above.

  • Calculate mean of counts for each gene by biological condition (group variable) from the counts_long_allinfo table.
  • Reformat data table into wide format using pivot_wider() function. Attach it to counts_mean.
  • Work with counts_long_allinfo data
  • Use group_by() and summarise() functions to calculate mean of counts for each gene by biological condition
  • ?pivot_wider
  • Look at names_from, values_from, names_prefix arguments
Code
# calculate the mean of counts for each gene by biological condition (ie group)
tmp <- counts_long_allinfo %>% 
  select(EntrezGeneID, group, Count) %>% 
  group_by(EntrezGeneID, group) %>% 
  summarise(mean = mean(Count))
Code
# reformatted data in wide format
# each line corresponds to a given gene with its average counts by group
counts_mean <- tmp %>%
  pivot_wider(names_from = group, values_from = mean)

# customize column headers using names_prefix
counts_mean <- tmp %>%
  pivot_wider(names_from = group, 
              values_from = mean,
              names_prefix = "mean_",
              )

R and packages version

Code
sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3  forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4     
 [5] purrr_1.0.2      readr_2.1.4      tidyr_1.3.0      tibble_3.2.1    
 [9] ggplot2_3.4.4    tidyverse_2.0.0  kableExtra_1.3.4

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    xml2_1.3.6        stringi_1.8.3    
 [5] hms_1.1.3         digest_0.6.33     magrittr_2.0.3    evaluate_0.23    
 [9] grid_4.3.3        timechange_0.2.0  fastmap_1.1.1     jsonlite_1.8.8   
[13] httr_1.4.7        rvest_1.0.3       fansi_1.0.6       viridisLite_0.4.2
[17] scales_1.3.0      cli_3.6.2         crayon_1.5.2      rlang_1.1.2      
[21] bit64_4.0.5       munsell_0.5.0     withr_2.5.2       yaml_2.3.8       
[25] parallel_4.3.3    tools_4.3.3       tzdb_0.4.0        colorspace_2.1-0 
[29] webshot_0.5.5     vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[33] bit_4.0.5         htmlwidgets_1.6.4 vroom_1.6.5       pkgconfig_2.0.3  
[37] pillar_1.9.0      gtable_0.3.4      glue_1.6.2        systemfonts_1.0.5
[41] highr_0.10        xfun_0.41         tidyselect_1.2.0  rstudioapi_0.15.0
[45] knitr_1.45        farver_2.1.1      htmltools_0.5.7   labeling_0.4.3   
[49] rmarkdown_2.25    svglite_2.1.3     compiler_4.3.3   

References

1. Wickham H, Vaughan D, Girlich M. Tidyr: Tidy messy data. 2023.
2. Fu NY, Rios AC, Pal B, Soetanto R, Lun ATL, Liu K, et al. EGF-mediated induction of mcl-1 at the switch to lactation is essential for alveolar cell survival. Nature Cell Biology. 2015;17:365–75. doi:10.1038/ncb3117.
3. Mc DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012;40:4288–97.
 

A work by Migale Bioinformatics Facility

https://migale.inrae.fr

Our two affiliations to cite us:

Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France

Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France