Code
install.packages("tidyverse")
Illustration with a RNA-Seq data set
In this tutorial, you will learn how to organize data in R with the principles of tidy data. Datasets are most often represented in the form of a rectangular table. To be considered as a tidy data, it should be structured:
For this purpose we will use tidyverse which is a collection of R core packages (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats) [1].
Make data tidy using tidyverse
R package will facilitate statistical analyses and visualization. We illustrate some operations to manipulate the data in the framework of RNA-Seq data analysis.
To install tidyverse
install.packages("tidyverse")
The RNA-Seq data of this case study is described in [2]. The sequence count data are publicly available from the Gene Expression Omnibus (GEO) at the series accession number GSE60450. This study examines the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Each group contains two biological replicates.
Differential analysis consists of identifying differences in gene expression between several experimental conditions. It was previously performed with edgeR
package. Data and results were obtained during the “Statistical analysis of RNA-Seq data with R” course from “Bioinformatics by practicing” cycle trainings located at INRAE, Jouy-en-Josas center, MaIAGE.
See also Bioconductor edgeR vignette, case study “RNA-Seq profiles of mouse mammary gland”.
Connect to R migale with login and password provided over the two days of training: https://rstudio.migale.inrae.fr/
The data files should be stored in a folder named Data
in your working directory.
GSE60450_Lactation-targets.csv
GSE60450_Lactation-GenewiseCounts.txt
complete_Basal.Pregnant_vs_Lactate.csv
Load tidyverse packages needed for this tutorial.
#Use packages and functions
library(tidyverse) #load the core tidyverse packages
If you need to install theses packages
# Install from CRAN
install.packages("tidyverse")
Description of the biological experiment
The aim is to create a design
object from experimental design data file GSE60450_Lactation-targets.csv
with a tabs separator (“\t”). We use the readr package from the tidyverse
ecosystem. A tibble
object is produced, which is is an enhanced data.frame. Below is an extract of the data.
File | Sample | CellType | Status |
---|---|---|---|
GSM1480297_MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1.txt | GSM1480297 | Basal | virgin |
GSM1480298_MCL1-DH_BC2CTUACXX_CAGATC_L002_R1.txt | GSM1480298 | Basal | virgin |
GSM1480299_MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1.txt | GSM1480299 | Basal | pregnant |
GSM1480300_MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1.txt | GSM1480300 | Basal | pregnant |
GSM1480301_MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1.txt | GSM1480301 | Basal | lactate |
GSM1480302_MCL1-DL_BC2CTUACXX_ATCACG_L002_R1.txt | GSM1480302 | Basal | lactate |
Each row in the table represents a given sample. The variables are File (counts file processed using bioinformatics tools), Sample (the name of the sample), CellType (the type of cells) and Status (the developmental stage of mice).
GSE60450_Lactation-targets.csv
using readr
package. Attach it to a tibble called design
.design
object.readr
package. Note that readr
package is included in tidyverse package.View()
or glimpse()
to view the entire data set<- read_tsv("./Data/GSE60450_Lactation-targets.csv")
design design
# A tibble: 12 × 4
File Sample CellType Status
<chr> <chr> <chr> <chr>
1 GSM1480297_MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1.txt GSM1480297 Basal virgin
2 GSM1480298_MCL1-DH_BC2CTUACXX_CAGATC_L002_R1.txt GSM1480298 Basal virgin
3 GSM1480299_MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1.txt GSM1480299 Basal pregnant
4 GSM1480300_MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1.txt GSM1480300 Basal pregnant
5 GSM1480301_MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1.txt GSM1480301 Basal lactate
6 GSM1480302_MCL1-DL_BC2CTUACXX_ATCACG_L002_R1.txt GSM1480302 Basal lactate
7 GSM1480291_MCL1-LA_BC2CTUACXX_GATCAG_L001_R1.txt GSM1480291 Luminal virgin
8 GSM1480292_MCL1-LB_BC2CTUACXX_TGACCA_L001_R1.txt GSM1480292 Luminal virgin
9 GSM1480293_MCL1-LC_BC2CTUACXX_GCCAAT_L001_R1.txt GSM1480293 Luminal pregnant
10 GSM1480294_MCL1-LD_BC2CTUACXX_GGCTAC_L001_R1.txt GSM1480294 Luminal pregnant
11 GSM1480295_MCL1-LE_BC2CTUACXX_TAGCTT_L001_R1.txt GSM1480295 Luminal lactate
12 GSM1480296_MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1.txt GSM1480296 Luminal lactate
# To view the column headers
spec(design)
cols(
File = col_character(),
Sample = col_character(),
CellType = col_character(),
Status = col_character()
)
colnames(design)
[1] "File" "Sample" "CellType" "Status"
# To remove Column specification's message, use show_col_types = FALSE argument
<- read_tsv("./Data/GSE60450_Lactation-targets.csv", show_col_types = FALSE )
design design
# A tibble: 12 × 4
File Sample CellType Status
<chr> <chr> <chr> <chr>
1 GSM1480297_MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1.txt GSM1480297 Basal virgin
2 GSM1480298_MCL1-DH_BC2CTUACXX_CAGATC_L002_R1.txt GSM1480298 Basal virgin
3 GSM1480299_MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1.txt GSM1480299 Basal pregnant
4 GSM1480300_MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1.txt GSM1480300 Basal pregnant
5 GSM1480301_MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1.txt GSM1480301 Basal lactate
6 GSM1480302_MCL1-DL_BC2CTUACXX_ATCACG_L002_R1.txt GSM1480302 Basal lactate
7 GSM1480291_MCL1-LA_BC2CTUACXX_GATCAG_L001_R1.txt GSM1480291 Luminal virgin
8 GSM1480292_MCL1-LB_BC2CTUACXX_TGACCA_L001_R1.txt GSM1480292 Luminal virgin
9 GSM1480293_MCL1-LC_BC2CTUACXX_GCCAAT_L001_R1.txt GSM1480293 Luminal pregnant
10 GSM1480294_MCL1-LD_BC2CTUACXX_GGCTAC_L001_R1.txt GSM1480294 Luminal pregnant
11 GSM1480295_MCL1-LE_BC2CTUACXX_TAGCTT_L001_R1.txt GSM1480295 Luminal lactate
12 GSM1480296_MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1.txt GSM1480296 Luminal lactate
#Other ways to view the entire dataset
#View(design) # open via Environment pane
glimpse(design)
Rows: 12
Columns: 4
$ File <chr> "GSM1480297_MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1.txt", "GSM14802…
$ Sample <chr> "GSM1480297", "GSM1480298", "GSM1480299", "GSM1480300", "GSM1…
$ CellType <chr> "Basal", "Basal", "Basal", "Basal", "Basal", "Basal", "Lumina…
$ Status <chr> "virgin", "virgin", "pregnant", "pregnant", "lactate", "lacta…
If we want to create a new dataset that only includes certain variables, we can use the select()
function from the dplyr
package.
Note that select()
does not change the original tibble, but makes a new tibble with the specified columns. If you want to keep this new dataset, create a new object.
new_design
which contains only Sample, CellType, Status variables.select()
function from the dplyr
package.select()
function# select variables of interest
select(design, Sample, CellType, Status)
# A tibble: 12 × 3
Sample CellType Status
<chr> <chr> <chr>
1 GSM1480297 Basal virgin
2 GSM1480298 Basal virgin
3 GSM1480299 Basal pregnant
4 GSM1480300 Basal pregnant
5 GSM1480301 Basal lactate
6 GSM1480302 Basal lactate
7 GSM1480291 Luminal virgin
8 GSM1480292 Luminal virgin
9 GSM1480293 Luminal pregnant
10 GSM1480294 Luminal pregnant
11 GSM1480295 Luminal lactate
12 GSM1480296 Luminal lactate
# store the selection in a new dataset
<- select(design, Sample, CellType, Status)
new_design new_design
# A tibble: 12 × 3
Sample CellType Status
<chr> <chr> <chr>
1 GSM1480297 Basal virgin
2 GSM1480298 Basal virgin
3 GSM1480299 Basal pregnant
4 GSM1480300 Basal pregnant
5 GSM1480301 Basal lactate
6 GSM1480302 Basal lactate
7 GSM1480291 Luminal virgin
8 GSM1480292 Luminal virgin
9 GSM1480293 Luminal pregnant
10 GSM1480294 Luminal pregnant
11 GSM1480295 Luminal lactate
12 GSM1480296 Luminal lactate
# an other way
<- select(design, -File)
new_design new_design
# A tibble: 12 × 3
Sample CellType Status
<chr> <chr> <chr>
1 GSM1480297 Basal virgin
2 GSM1480298 Basal virgin
3 GSM1480299 Basal pregnant
4 GSM1480300 Basal pregnant
5 GSM1480301 Basal lactate
6 GSM1480302 Basal lactate
7 GSM1480291 Luminal virgin
8 GSM1480292 Luminal virgin
9 GSM1480293 Luminal pregnant
10 GSM1480294 Luminal pregnant
11 GSM1480295 Luminal lactate
12 GSM1480296 Luminal lactate
# with pipe operator %>% from magrittr package
<- design %>% select(-File) new_design
If we want to subset a data frame keeping only rows that satisfy some criteria based on variables, we can use the filter()
function from the dplyr
package. You need to specify an expression named Boolean expression whose evaluation results in a value of TRUE or FALSE.
filter()
function==
# only the rows where the samples come from Luminal cells
<- filter(design, CellType == "Luminal")
subset_1 # only the rows where the samples come from lacting mice.
<- filter(design, Status == "lactate")
subset_2 # only the rows where the samples come from lacting mice and Luminal cells
<- filter(design, CellType == "Luminal", Status == "lactate")
subset_3 # same result
<- filter(design, CellType == "Luminal" & Status == "lactate") subset_3
To create new variables, we use the mutate()
function from dplyr
package. Note that if you want to save this new column, you need to save it to an object.
To identify genes that are differentially expressed between different biological conditions, it is convenient to define a new group variable. This synthetic variable is the combination of several variables. It is very useful to calculate some statistics and also to construct the statistical model of the differential analysis.
group
which is the combination between CellType and Status.group
variable to the design
data frame using the mutate()
function.str_c()
from stringr
package which is equivalent to paste()
function.mutate()
function# group is defined as the combination between CellType and Status
<- str_c(design$CellType, design$Status, sep = "_")
group <- paste(design$CellType, design$Status, sep = "_")
group # Attach group to the design data frame
<- mutate(design, group = str_c(CellType, Status, sep = "_")) design
Description of differential analysis results
The file complete_Basal.Pregnant_vs_Lactate.csv
contains the results of a differential analysis performed using edgeR
package. Read counts were normalized using trimmed mean of M-values method (TMM). We are interested in identifying genes whose expression changes between the Basal_pregnant and Basal_lactate groups. A generalized linear model was fitted for each gene [3]. A positive (respectively, negative) log-foldchange indicates genes that are up- (respectively, down-) regulated in Basal_pregnant over Basal_lactate group. Multiplicity correction is performed by applying the Benjamini-Hochberg method on the p-values, to control the false discovery rate (FDR). A gene is declared differentially expressed if its FDR is lower than 0.05.
Below is an extract of the results of a differential analysis.
...1 | ENTREZID | SYMBOL | GENENAME | logFC | logCPM | LR | PValue | FDR |
---|---|---|---|---|---|---|---|---|
497097 | 497097 | Xkr4 | X-linked Kx blood group related 4 | -1.2575060 | 2.572255 | 5.9808162 | 0.0144623 | 0.0408293 |
20671 | 20671 | Sox17 | SRY (sex determining region Y)-box 17 | 0.3670579 | 1.310053 | 0.6801565 | 0.4095328 | 0.5453077 |
27395 | 27395 | Mrpl15 | mitochondrial ribosomal protein L15 | -0.0353455 | 3.998092 | 0.0348689 | 0.8518705 | 0.9081256 |
18777 | 18777 | Lypla1 | lysophospholipase 1 | -0.0854954 | 5.055784 | 0.4514209 | 0.5016610 | 0.6290739 |
21399 | 21399 | Tcea1 | transcription elongation factor A (SII) 1 | 0.2226795 | 5.629010 | 3.4048849 | 0.0650036 | 0.1357270 |
Note that ENTREZID is the gene identifier, SYMBOL and GENENAME are other information about the gene. The logFC measure is the fold-change transformed into log2 and represents the difference in expression. A log(FC) of 1 means twice as expressed. The variables LR and PValue are likelihood ratio test and the associated p-value, respectively. FDR represents the false discovery rate calculated with the Benjamini-Hochberg correction.
This Volcano plot enable us to visualize the significance of change (i.e. p-value or adjusted p-value like FDR) versus the fold change (i.e. logFC). Each point represents a gene which are colored according to its significance. The most differentially expressed genes are located towards the top of the plot.
The results of differential analysis are very often explored by sorting the data file according to the significance of the test and the fold-change value. We will use the arrange()
function of the dplyr
package which allows to easily order the rows of a data table by the values of the selected columns.
Part1: arranging
complete_Basal.Pregnant_vs_Lactate.csv
using readr
package. Attach it to a tibble called res
. It contains the results of the differential analysis between the Basal_pregnant and Basal_lactate groups.arrange()
function from dplyr
package. The top genes are the most differentially expressed and the bottom genes are not differentially expressed. Give the top ten of genes.arrange()
function from dplyr
package. At the top of the list are the genes with the highest difference in expression but which are not necessarily differentially expressed. Give an example of a gene which is not differentially expressed although its logFC is large.Part2: challenge
up_genes
including only differentially expressed genes (FDR < 0.05)
which are up regulated (logFC > 0)
.down_genes
including only differentially expressed genes which are down regulated (logFC < 0)
.up_genes
data frame and conversely in the down_genes
data frame.Part1: arranging
readr
packageread_csv()
and read_csv2()
for csv files, read_tsv()
for tabs separated filesarrange()
Part2: challenge
filter()
and arrange()
functionsmagrittr
package can facilitate the chaining of lines of codePart1: arranging
# read file
<- read_csv2(file = "./Data/complete_Basal.Pregnant_vs_Lactate.csv")
res # orders the rows by values of FDR
<- arrange(res, FDR)
res_signif_ord # top 10 of genes
res_signif_ord
# A tibble: 15,804 × 9
...1 ENTREZID SYMBOL GENENAME logFC logCPM LR PValue FDR
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 211577 211577 Mrgprf MAS-related GPR… -5.15 2.74 407. 1.61e-90 2.55e-86
2 12992 12992 Csn1s2b casein alpha s2… -6.08 10.2 404. 7.71e-90 6.09e-86
3 14620 14620 Gjb3 gap junction pr… 3.58 5.50 316. 1.20e-70 6.33e-67
4 12740 12740 Cldn4 claudin 4 5.32 9.87 310. 2.09e-69 8.25e-66
5 140474 140474 Muc4 mucin 4 7.17 6.05 258. 3.86e-58 1.22e-54
6 226101 226101 Myof myoferlin -2.32 6.44 252. 1.20e-56 3.17e-53
7 21953 21953 Tnni2 troponin I, ske… -5.74 3.86 242. 1.45e-54 3.27e-51
8 231830 231830 Micall2 MICAL-like 2 2.25 5.18 230. 6.10e-52 1.21e-48
9 78896 78896 Ecrg4 ECRG4 augurin p… 2.81 6.68 225. 7.87e-51 1.38e-47
10 24117 24117 Wif1 Wnt inhibitory … 1.82 6.76 221. 4.32e-50 6.83e-47
# ℹ 15,794 more rows
# other solution
# res_signif_ord %>% slice_head(n = 10)
# rows are sorted in descending order of logFC
<- arrange(res, desc(logFC))
res_FC_ord #View(res_FC_ord)
# this gene is not differentially expressed (FDR > 0.05) although its logFC is large
filter(res_FC_ord, ENTREZID==100123473)
# A tibble: 1 × 9
...1 ENTREZID SYMBOL GENENAME logFC logCPM LR PValue FDR
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100123473 100123473 Trdc T cell receptor d… 4.24 -1.33 5.22 0.0223 0.0582
# the first 5 genes which are not differentially expressed although their logFC is large
%>% filter(FDR > 0.05) %>% slice_head(n = 5) res_FC_ord
# A tibble: 5 × 9
...1 ENTREZID SYMBOL GENENAME logFC logCPM LR PValue FDR
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100123473 100123473 Trdc T cell receptor … 4.24 -1.33 5.22 0.0223 0.0582
2 19144 19144 Klk6 kallikrein relat… 3.95 0.878 4.56 0.0328 0.0791
3 78076 78076 Lcn8 lipocalin 8 3.95 -2.41 4.84 0.0278 0.0694
4 20309 20309 Cxcl15 chemokine (C-X-C… 3.94 4.11 4.78 0.0288 0.0714
5 100503289 100503289 Gm14133 predicted gene 1… 3.94 -1.09 4.81 0.0283 0.0704
Part2: challenge
# challenge arrange
<- filter(res, FDR < 0.05, logFC > 0) %>%
up_genes arrange(desc(logFC)) # Note that we order from highest to lowest value
<- filter(res, FDR < 0.05, logFC < 0) %>%
down_genes arrange(logFC) # Note that we order from lowest to highest value
Descriptive statistics are used to describe the data in a study and provide some summary statistics. They can be calculated using the function summarise()
from dplyr
package.
We want to obtain descriptive statistics (e.g. mean, median, variance) on differentially expressed genes for both up- and down- regulated genes.
Part1: summarising
up_genes
including only differentially expressed genes (FDR < 0.05)
which are up regulated (logFC > 0)
. Provide the following statistics: number of genes up regulated, the minimum, the mean, the median and the maximum of the logFC variable.down_genes
including only differentially expressed genes which are down regulated.Part2: grouping
res
tibble to keep only differentially expressed genes (FDR < 0.05)
.(logFC > 0)
using group_by()
function form dplyr
package.Part1: summarising
filter()
functionsummarise()
functionPart2: grouping
summarise()
functionmagrittr
package can facilitate the chaining of lines of codePart1: summarising
# create data containing only up, respectively down genes
<- filter(res, FDR < 0.05, logFC > 0)
up_genes <- filter(res, FDR < 0.05, logFC < 0)
down_genes # summary statistics for up_genes data frame
summarise(up_genes,
Nb = n(),
min_logFC = min(logFC),
mean_logFC = mean(logFC),
median_logFC = median(logFC),
max_logFC = max(logFC))
# A tibble: 1 × 5
Nb min_logFC mean_logFC median_logFC max_logFC
<int> <dbl> <dbl> <dbl> <dbl>
1 3060 0.253 1.08 0.793 7.72
# summary statistics for down_genes data frame
summarise(down_genes,
Nb = n(),
min_logFC = min(logFC),
mean_logFC = mean(logFC),
median_logFC = median(logFC),
max_logFC = max(logFC))
# A tibble: 1 × 5
Nb min_logFC mean_logFC median_logFC max_logFC
<int> <dbl> <dbl> <dbl> <dbl>
1 2812 -6.48 -0.812 -0.654 -0.257
Part2: grouping
%>%
res filter(FDR < 0.05) %>%
group_by(logFC > 0) %>%
summarise(Nb = n(),
min_logFC = min(logFC),
mean_logFC = mean(logFC),
median_logFC = median(logFC),
max_logFC = max(logFC),
.groups = "drop") # if you want to "drop" the groupings for further operations
# A tibble: 2 × 6
`logFC > 0` Nb min_logFC mean_logFC median_logFC max_logFC
<lgl> <int> <dbl> <dbl> <dbl> <dbl>
1 FALSE 2812 -6.48 -0.812 -0.654 -0.257
2 TRUE 3060 0.253 1.08 0.793 7.72
Description of the counts data
Below is the first lines of the counts data table in the wide format, named counts
. Each row in the table represents a given gene identified by its EntrezGeneID identifier. The other variables are Length (gene’s length) and the name of the studied samples. The counts data table contains the counts, i.e. the number of reads that align at a particular genomic position for each gene and sample.
EntrezGeneID | Length | MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1 | MCL1-DH_BC2CTUACXX_CAGATC_L002_R1 | MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1 | MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1 | MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1 | MCL1-DL_BC2CTUACXX_ATCACG_L002_R1 | MCL1-LA_BC2CTUACXX_GATCAG_L001_R1 | MCL1-LB_BC2CTUACXX_TGACCA_L001_R1 | MCL1-LC_BC2CTUACXX_GCCAAT_L001_R1 | MCL1-LD_BC2CTUACXX_GGCTAC_L001_R1 | MCL1-LE_BC2CTUACXX_TAGCTT_L001_R1 | MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
497097 | 3634 | 438 | 300 | 65 | 237 | 354 | 287 | 0 | 0 | 0 | 0 | 0 | 0 |
100503874 | 3259 | 1 | 0 | 1 | 1 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
100038431 | 1634 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We want to convert the counts
data table from wide format into long format using pivot_longer()
function from tidyr
package. This will make it easier to perform statistical data analyses and create graphs with ggplot2
package. In the wide format count values are stored in multiple columns for each sample. We convert it into long format containing all the count values in a single column. The number of rows increases and the number of columns decreases in the long format.
Below is the first lines of the counts data table reformatted in the long format, named counts_long
.
EntrezGeneID | Length | Sample | Count |
---|---|---|---|
497097 | 3634 | MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1 | 438 |
497097 | 3634 | MCL1-DH_BC2CTUACXX_CAGATC_L002_R1 | 300 |
497097 | 3634 | MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1 | 65 |
497097 | 3634 | MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1 | 237 |
497097 | 3634 | MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1 | 354 |
Part1: Reading data
GSE60450_Lactation-GenewiseCounts.txt
with tabs “\t” separator character is tabs using readr
package. Attach it to a tibble called counts
.counts
object. How many genes were expressed ? How many samples were studied ?Part2: Convert data from wide format into long format
pivot_longer()
function to convert the counts data from wide format to long format. Attach it to a tibble called counts_long
which contains two new columns Sample
and Count
columns.Part1: Reading data
View()
or glimpse()
to view the entire data setslice_head()
to view the first linesPart2: Convert data from wide format into long format
cols
, names_to
and values_to arguments
cols
argument indicates the columns to reformatcols
argument, you can use starts_with()
function, or c()
or :
between first and last column of interestnames_to
corresponding to Sample
and values_to arguments
named Count
containing gene expression values.Part1: Reading data
<- read_tsv("./Data/GSE60450_Lactation-GenewiseCounts.txt", show_col_types = FALSE)
counts # To know the data
spec(counts) # type of columns
cols(
EntrezGeneID = col_double(),
Length = col_double(),
`MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1` = col_double(),
`MCL1-DH_BC2CTUACXX_CAGATC_L002_R1` = col_double(),
`MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1` = col_double(),
`MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1` = col_double(),
`MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1` = col_double(),
`MCL1-DL_BC2CTUACXX_ATCACG_L002_R1` = col_double(),
`MCL1-LA_BC2CTUACXX_GATCAG_L001_R1` = col_double(),
`MCL1-LB_BC2CTUACXX_TGACCA_L001_R1` = col_double(),
`MCL1-LC_BC2CTUACXX_GCCAAT_L001_R1` = col_double(),
`MCL1-LD_BC2CTUACXX_GGCTAC_L001_R1` = col_double(),
`MCL1-LE_BC2CTUACXX_TAGCTT_L001_R1` = col_double(),
`MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1` = col_double()
)
#View(counts) # view entire data
glimpse(counts) # view data
Rows: 27,179
Columns: 14
$ EntrezGeneID <dbl> 497097, 100503874, 100038431, 1988…
$ Length <dbl> 3634, 3259, 1634, 9747, 3130, 4203…
$ `MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1` <dbl> 438, 1, 0, 1, 106, 309, 652, 0, 16…
$ `MCL1-DH_BC2CTUACXX_CAGATC_L002_R1` <dbl> 300, 0, 0, 1, 182, 234, 515, 1, 14…
$ `MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1` <dbl> 65, 1, 0, 0, 82, 337, 948, 0, 1721…
$ `MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1` <dbl> 237, 1, 0, 0, 105, 300, 935, 0, 13…
$ `MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1` <dbl> 354, 0, 0, 0, 43, 290, 928, 0, 115…
$ `MCL1-DL_BC2CTUACXX_ATCACG_L002_R1` <dbl> 287, 4, 0, 0, 82, 270, 791, 0, 106…
$ `MCL1-LA_BC2CTUACXX_GATCAG_L001_R1` <dbl> 0, 0, 0, 10, 16, 560, 826, 0, 1334…
$ `MCL1-LB_BC2CTUACXX_TGACCA_L001_R1` <dbl> 0, 0, 0, 3, 25, 464, 862, 1, 1258,…
$ `MCL1-LC_BC2CTUACXX_GCCAAT_L001_R1` <dbl> 0, 0, 0, 10, 18, 489, 668, 2, 1068…
$ `MCL1-LD_BC2CTUACXX_GGCTAC_L001_R1` <dbl> 0, 0, 0, 2, 8, 328, 646, 0, 926, 6…
$ `MCL1-LE_BC2CTUACXX_TAGCTT_L001_R1` <dbl> 0, 0, 0, 0, 3, 307, 544, 2, 508, 2…
$ `MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1` <dbl> 0, 0, 0, 0, 10, 342, 581, 2, 500, …
slice_head(counts, n=3) # view first lines
# A tibble: 3 × 14
EntrezGeneID Length `MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1` MCL1-DH_BC2CTUACXX_C…¹
<dbl> <dbl> <dbl> <dbl>
1 497097 3634 438 300
2 100503874 3259 1 0
3 100038431 1634 0 0
# ℹ abbreviated name: ¹`MCL1-DH_BC2CTUACXX_CAGATC_L002_R1`
# ℹ 10 more variables: `MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1` <dbl>,
# `MCL1-DJ_BC2CTUACXX_CGATGT_L002_R1` <dbl>,
# `MCL1-DK_BC2CTUACXX_TTAGGC_L002_R1` <dbl>,
# `MCL1-DL_BC2CTUACXX_ATCACG_L002_R1` <dbl>,
# `MCL1-LA_BC2CTUACXX_GATCAG_L001_R1` <dbl>,
# `MCL1-LB_BC2CTUACXX_TGACCA_L001_R1` <dbl>, …
is_tibble(counts) # to check the structure of the data
[1] TRUE
There are 27179 genes and 12 samples.
Part2: Convert data from wide format into long format
# Use pivot_longer() function to convert data from wide format into long format
# cols argument describes which columns need to be reshaped
# use starts_with() function to select columns of interest
<- pivot_longer(counts, cols = starts_with("MCL1"), names_to = "Sample", values_to = "Count")
counts_long
# code with pipe operator
<- counts %>%
counts_long pivot_longer(cols = starts_with("MCL1"), names_to = "Sample", values_to = "Count")
# use c() and - to remove specified columns
<- counts %>%
counts_long pivot_longer(cols = -c("EntrezGeneID", "Length"), names_to = "Sample", values_to = "Count")
# use ":" to specify first and last column of interest
<- counts %>%
counts_long pivot_longer(cols = 'MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1':'MCL1-LF_BC2CTUACXX_CTTGTA_L001_R1',
names_to = "Sample",
values_to = "Count")
Now, we want to add metadata about the samples provided by the design
table into the counts_long
table. This will make it easier to identify the information carried by each sample (i.e. whether it is a basal cell type) and use it in the graphs.
So, the joined data counts_long_allinfo
looks like that.
EntrezGeneID | Length | Sample | Count | File | Sample.y | CellType | Status | group |
---|---|---|---|---|---|---|---|---|
497097 | 3634 | MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1 | 438 | GSM1480297_MCL1-DG_BC2CTUACXX_ACTTGA_L002_R1.txt | GSM1480297 | Basal | virgin | Basal_virgin |
497097 | 3634 | MCL1-DH_BC2CTUACXX_CAGATC_L002_R1 | 300 | GSM1480298_MCL1-DH_BC2CTUACXX_CAGATC_L002_R1.txt | GSM1480298 | Basal | virgin | Basal_virgin |
497097 | 3634 | MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1 | 65 | GSM1480299_MCL1-DI_BC2CTUACXX_ACAGTG_L002_R1.txt | GSM1480299 | Basal | pregnant | Basal_pregnant |
File2
in the design
table using mutate()
function. File2
is a key variable which matches to the Sample
variable from the counts_long
table. In this case, this step creates the relationship between the two tables counts_long
and design
and is necessary before joining.# create File2 variable as key variable from design table
<- design %>%
design mutate(File2 = str_remove(File, ".txt")) %>% # pattern '.txt' is removed
mutate(File2 = str_sub(File2, 12)) # begin at the 12th character
full_join()
function to join the two tables counts_long
and design
by key variables identifying the samples. Attach it to counts_long_allinfo
.counts_long
and design
databy
argument will match, argument by = c("Sample" = "File2")
# create File2 variable as key variable from design table
<- design %>%
design mutate(File2 = str_remove(File, ".txt")) %>%
mutate(File2 = str_sub(File2, 12) )
# File 2 from design table corresponds to Sample from counts_long table
<- full_join(counts_long, design, by = c("Sample" = "File2")) counts_long_allinfo
Now, we can easily perform statistics by taking advantage of the long format of the count data table counts_long
. Next, we format the data table in wide format for a more comfortable presentation. To do it, we use pivot_wider()
function which is the inverse function of pivot_longer()
. In this wide format, the number of columns increases and the number of rows decreases.
Have a look below at the table called counts_mean
which contains the mean of counts for each gene by biological condition (ie group
).
EntrezGeneID | mean_Basal_lactate | mean_Basal_pregnant | mean_Basal_virgin | mean_Luminal_lactate | mean_Luminal_pregnant | mean_Luminal_virgin |
---|---|---|---|---|---|---|
11287 | 7.0 | 7.5 | 4.0 | 3.0 | 1.0 | 7 |
11298 | 0.5 | 0.5 | 0.5 | 0.5 | 1.0 | 0 |
11302 | 450.5 | 666.0 | 704.0 | 3.5 | 3.5 | 14 |
Reproduce the table above.
group
variable) from the counts_long_allinfo
table.pivot_wider()
function. Attach it to counts_mean
.counts_long_allinfo
datagroup_by()
and summarise()
functions to calculate mean of counts for each gene by biological conditionnames_from
, values_from
, names_prefix
arguments# calculate the mean of counts for each gene by biological condition (ie group)
<- counts_long_allinfo %>%
tmp select(EntrezGeneID, group, Count) %>%
group_by(EntrezGeneID, group) %>%
summarise(mean = mean(Count))
# reformatted data in wide format
# each line corresponds to a given gene with its average counts by group
<- tmp %>%
counts_mean pivot_wider(names_from = group, values_from = mean)
# customize column headers using names_prefix
<- tmp %>%
counts_mean pivot_wider(names_from = group,
values_from = mean,
names_prefix = "mean_",
)
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2
[5] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[9] ggplot2_3.4.2 tidyverse_2.0.0 kableExtra_1.3.4
loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 xfun_0.39 colorspace_2.1-0 vctrs_0.6.2
[5] generics_0.1.3 htmltools_0.5.5 viridisLite_0.4.2 yaml_2.3.7
[9] utf8_1.2.3 rlang_1.1.1 pillar_1.9.0 glue_1.6.2
[13] withr_2.5.0 bit64_4.0.5 lifecycle_1.0.3 munsell_0.5.0
[17] gtable_0.3.3 rvest_1.0.3 htmlwidgets_1.6.2 evaluate_0.21
[21] labeling_0.4.2 knitr_1.42 tzdb_0.3.0 fastmap_1.1.1
[25] parallel_4.2.1 fansi_1.0.4 highr_0.10 scales_1.2.1
[29] vroom_1.6.3 webshot_0.5.4 jsonlite_1.8.4 farver_2.1.1
[33] systemfonts_1.0.4 bit_4.0.5 hms_1.1.3 digest_0.6.31
[37] stringi_1.7.12 grid_4.2.1 cli_3.6.1 tools_4.2.1
[41] magrittr_2.0.3 crayon_1.5.2 pkgconfig_2.0.3 xml2_1.3.4
[45] timechange_0.2.0 rmarkdown_2.21 svglite_2.1.1 httr_1.4.6
[49] rstudioapi_0.14 R6_2.5.1 compiler_4.2.1
A work by Migale Bioinformatics Facility
https://migale.inrae.fr
Our two affiliations to cite us:
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France