OENOVARDOCCPNL

Description of the project

Author
Affiliation

Christelle Hennequet-Antier

Migale bioinformatics facility

Published

March 9, 2026

Modified

June 4, 2026

Note

This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.

Aim of the project

Etude des facteurs de variation du microbiote des variétés résistantes aux maladies

Acquérir des données sur la composition des communautés microbiennes des variétés résistantes

Identifier et hiérarchiser les facteurs structurants qui impactent les communautés microbiennes : Effet millésime (2023, 2024), Effet région viticole (4 domaines), Effet variété (analysé dans 3 domaines Cazes, Pech Rouge, Vassal)

Analyse différentielles d’abondances réalisées avec ANCOM-BC2 par Gabriela

The aim of the following analyses is to explore the network of taxa in different context and identify clusters of taxa.

Partners

  • Olivier Rué - Migale bioinformatics facility - BioInfomics - INRAE
  • Christelle Hennequet-Antier - Migale bioinformatics facility - BioInfomics - INRAE
  • Gabriela Pinto Miguel - SPO - INRAE
  • Cécile Neuvéglise - SPO - INRAE

Deliverables

Deliverables agreed at the preliminary meeting (Table 1).

Table 1: Deliverables
  Definition Status
1 HTML report ✔️

Data management

Important

All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.

Raw data

The raw data were phyloseq objects based on ASV and kmer.

Bioinformatics

Done by O. Rué and G. Pinto Miguel.

Biostatistics based on ASV

library(InraeThemes)
library(tidyverse) # data manipulation
library(phyloseq) # analysis of microbiome census data
library(phyloseq.extended)
library(mia)
library(fastcluster)
library(ComplexHeatmap)
library(igraph) # visualisation network
library(tidygraph)
library(ggraph)
library(ggplot2)
library(data.table)
library(DT)
library(openxlsx)
library(kableExtra)
library(paletteer)
library(patchwork) # arrange multiple graphs

Microbiome dataset exploration

Clean taxonomies

Some taxonomies needed to be cleaned to respect the lineage.

library(tidyverse) # data manipulation
library(stringr)
library(phyloseq)
library(phyloseq.extended)
library(microViz)


clean_unidentified_taxa_until_genus <- function(matrix_data, species_list) {
  for (sp in species_list) {
    rows_to_mod <- which(matrix_data[, "Species"] == sp)
    if (length(rows_to_mod) > 0) {
      matrix_data[rows_to_mod, "Species"] <- "S__Unidentified_sp."
      matrix_data[rows_to_mod, "Genus"]   <- "G__Unidentified"
    }
  }
  return(matrix_data)
}

clean_unidentified_taxa_until_family <- function(matrix_data, species_list) {
  for (sp in species_list) {
    rows_to_mod <- which(matrix_data[, "Species"] == sp)
    if (length(rows_to_mod) > 0) {
      matrix_data[rows_to_mod, "Species"] <- "S__Unidentified_sp."
      matrix_data[rows_to_mod, "Genus"]   <- "G__Unidentified"
      matrix_data[rows_to_mod, "Family"]  <- "F__Unidentified"
    }
  }
  return(matrix_data)
}

clean_unidentified_taxa_until_order <- function(matrix_data, species_list) {
  for (sp in species_list) {
    rows_to_mod <- which(matrix_data[, "Species"] == sp)
    if (length(rows_to_mod) > 0) {
      matrix_data[rows_to_mod, "Species"] <- "S__Unidentified_sp."
      matrix_data[rows_to_mod, "Genus"]   <- "G__Unidentified"
      matrix_data[rows_to_mod, "Family"]  <- "F__Unidentified"
      matrix_data[rows_to_mod, "Order"]  <- "O__Unidentified"
    }
  }
  return(matrix_data)
}

clean_unidentified_taxa_until_class <- function(matrix_data, species_list) {
  for (sp in species_list) {
    rows_to_mod <- which(matrix_data[, "Species"] == sp)
    if (length(rows_to_mod) > 0) {
      matrix_data[rows_to_mod, "Species"] <- "S__Unidentified_sp."
      matrix_data[rows_to_mod, "Genus"]   <- "G__Unidentified"
      matrix_data[rows_to_mod, "Family"]  <- "F__Unidentified"
      matrix_data[rows_to_mod, "Order"]  <- "O__Unidentified"
      matrix_data[rows_to_mod, "Class"]  <- "C__Unidentified"
    }
  }
  return(matrix_data)
}

clean_unidentified_taxa_until_phylum <- function(matrix_data, species_list) {
  for (sp in species_list) {
    rows_to_mod <- which(matrix_data[, "Species"] == sp)
    if (length(rows_to_mod) > 0) {
      matrix_data[rows_to_mod, "Species"] <- "S__Unidentified_sp."
      matrix_data[rows_to_mod, "Genus"]   <- "G__Unidentified"
      matrix_data[rows_to_mod, "Family"]  <- "F__Unidentified"
      matrix_data[rows_to_mod, "Order"]  <- "O__Unidentified"
      matrix_data[rows_to_mod, "Class"]  <- "C__Unidentified"
      matrix_data[rows_to_mod, "Phylum"]  <- "P__Unidentified"
    }
  }
  return(matrix_data)
}

physeq <- readRDS(file = "./input_data/ps_without_OC36_OC37.rds")

##### Remplacement des taxa names pour faciliter les modifs
tax_mat <- as(tax_table(physeq), "matrix")
taxa_names(physeq) <- paste0("ASV", 1:ntaxa(physeq))
#############################################################

##### Kingdom ########################################
# Tout doit être Fungi au niveau Kingdom
tax_table(physeq)[, "Kingdom"] <- "Fungi"
######################################################

##### Modification de la taxonomie double ############
tax_mat <- as(tax_table(physeq), "matrix")
tax_mat[tax_mat[, "Phylum"] == "c(\"Basidiomycota\", \"Basidiomycota\")", "Phylum"] <- "Basidiomycota"
tax_mat[tax_mat[, "Class"] == "c(\"Tremellomycetes\", \"Tremellomycetes\")", "Class"] <- "Tremellomycetes"
tax_mat[tax_mat[, "Order"] == "c(\"Tremellales\", \"Tremellales\")", "Order"] <- "Tremellales"
tax_mat[tax_mat[, "Family"] == "c(\"Bulleribasidiaceae\", \"Bulleribasidiaceae\")", "Family"] <- "Bulleribasidiaceae"
tax_mat[tax_mat[, "Genus"] == "c(\"Vishniacozyma\", \"Vishniacozyma\")", "Genus"] <- "Vishniacozyma"
tax_mat[tax_mat[, "Species"] == "c(\"Vishniacozyma_sp\", \"Vishniacozyma_sp\")", "Species"] <- "Vishniacozyma_sp"
tax_table(physeq) <- tax_table(tax_mat)
######################################################

##### Uniformasation des unidentified ################
tax_mat <- as(tax_table(physeq), "matrix")
tax_mat[tax_mat == "Fungi_sp"] <- "Unidentified"
tax_mat[tax_mat == "unidentified"] <- "Unidentified"
tax_mat[tax_mat == "Ascomycota_sp"] <- "Unidentified"
tax_mat[tax_mat == "Basidiomycota_sp"] <- "Unidentified"
tax_table(physeq) <- tax_table(tax_mat)
######################################################

##### Ajout du préfixe ###############################
physeq <- physeq %>% tax_prepend_ranks(sep="__")
######################################################

##### Modification de la taxonomie vide ##############
tax_mat <- as(tax_table(physeq), "matrix")
tax_mat["ASV1010","Kingdom"] <- "K__Fungi"
tax_mat["ASV1010","Phylum"] <- "P__Unidentified_Fungi"
tax_mat["ASV1010","Class"] <- "C__Unidentified_Fungi"
tax_mat["ASV1010","Order"] <- "O__Unidentified_Fungi"
tax_mat["ASV1010","Family"] <- "F__Unidentified_Fungi"
tax_mat["ASV1010","Genus"] <- "G__Unidentified_Fungi"
tax_mat["ASV1010","Species"] <- "S__Unidentified_Fungi"
tax_table(physeq) <- tax_table(tax_mat)
######################################################

##### Phylum #########################################
tax_mat <- as(tax_table(physeq), "matrix")
tax_mat[tax_mat[, "Phylum"] == "P__Basidiomycota_sp","Phylum"] <- "P__Basidiomycota"

tax_table(physeq) <- tax_table(tax_mat)

physeq %>% comp_barplot(tax_level = "Phylum") + coord_flip()

#######################################################

##### Class #########################################
physeq <- physeq %>% tax_fix(unknowns = c("C__Multi-affiliation", "C__Unidentified"))

physeq %>% comp_barplot(tax_level = "Class") + coord_flip()

#####################################################

##### Order #########################################
physeq <- physeq %>% tax_fix(unknowns = c("O__Multi-affiliation", "O__Unidentified"))

tax_mat <- as(tax_table(physeq), "matrix")
tax_mat[tax_mat[, "Order"] == "O__Saccharomycetales", "Class"] <- "C__Saccharomycetes"
tax_mat[tax_mat[, "Order"] == "O__Pleosporales", "Class"] <- "C__Dothideomycetes"
tax_table(physeq) <- tax_table(tax_mat)

physeq %>% comp_barplot(tax_level = "Order") + coord_flip()

#####################################################

##### Family #########################################
physeq <- physeq %>% tax_fix(unknowns = c("F__Multi-affiliation", "F__Unidentified"))

tax_mat <- as(tax_table(physeq), "matrix")
tax_mat[tax_mat[, "Family"] == "F__Cladosporiaceae", "Order"] <- "O__Cladosporiales"
tax_mat[tax_mat[, "Family"] == "F__Microbotryomycetes", "Order"] <- "O__Sporidiobolales"
tax_mat[tax_mat[, "Family"] == "F__Mycosphaerellaceae", "Order"] <- "O__Mycosphaerellales"
tax_mat[tax_mat[, "Family"] == "F__Neodevriesiaceae", "Order"] <- "O__Mycosphaerellales"
tax_mat[tax_mat[, "Family"] == "F__Symmetrosporaceae", "Order"] <- "O__Cystobasidiomycetes incertae sedis"
tax_mat[tax_mat[, "Family"] == "F__Teratosphaeriaceae", "Order"] <- "O__Mycosphaerellales"
tax_mat[tax_mat[, "Family"] == "F__Microbotryomycetes", "Order"] <- "O__Sporidiobolales"

tax_table(physeq) <- tax_table(tax_mat)

physeq %>% comp_barplot(tax_level = "Family") + coord_flip()

##### Genus #########################################
physeq <- physeq %>% tax_fix(unknowns = c("G__Multi-affiliation", "G__Unidentified"))

tax_mat <- as(tax_table(physeq), "matrix")
tax_mat[tax_mat[, "Genus"] == "G__Aureobasidium", "Family"] <- "F__Aureobasidiaceae"
tax_mat[tax_mat[, "Genus"] == "G__Curvibasidium", "Family"] <- "F__Microbotryomycetes"
tax_mat[tax_mat[, "Family"] == "F__Microbotryomycetes", "Order"] <- "O__Sporidiobolales"
tax_mat[tax_mat[, "Genus"] == "G__Pichia", "Family"] <- "F__Pichiaceae"
tax_table(physeq) <- tax_table(tax_mat)

physeq %>% comp_barplot(tax_level = "Genus") + coord_flip()

##### Species #########################################
physeq <- physeq %>% tax_fix(unknowns = c("S__Multi-affiliation", "S__Unidentified"))

tax_mat <- as(tax_table(physeq), "matrix")

species_modif_genus <- c("S__Phaeosphaeriaceae_sp.",
            "S__Pseudeurotiaceae_sp.",
            "S__Neodevriesiaceae_sp.",
            "S__Lophiostomataceae_sp.",
            "S__Herpotrichiellaceae_sp.",
            "S__Dipodascaceae_sp."
)
tax_mat <- clean_unidentified_taxa_until_genus(tax_mat, species_modif_genus)

species_modif_family <- c("S__Ostropales_sp.",
            "S__Pleosporales_sp.",
            "S__Polyporales_sp.",
            "S__Saccharomycetales_sp.",
            "S__Teratosphaeriaceae_sp.",
            "S__Tremellales_sp.",
            "S__Xylariales_sp.",
            "S__Helotiales_sp.",
            "S__Dothideales_sp.",
            "S__Capnodiales_sp."
)
tax_mat <- clean_unidentified_taxa_until_family(tax_mat, species_modif_family)

species_modif_order <- c("S__Agaricomycetes_sp.","S__Dothideomycetes_sp."
)
tax_mat <- clean_unidentified_taxa_until_order(tax_mat, species_modif_order)

species_modif_phylum <- c("S__Basidiomycota_sp.","S__Fungi_sp."
)
tax_mat <- clean_unidentified_taxa_until_phylum(tax_mat, species_modif_phylum)

tax_table(physeq) <- tax_table(as.matrix(tax_mat))

#physeq1 <- physeq

tax_mat <- as(tax_table(physeq), "matrix")

# 1. Extraction en data.frame
tax_mat <- as(tax_table(physeq), "matrix")
tax_df  <- as.data.frame(tax_mat)

# 2. On cible uniquement les lignes où l'espèce est "S_Unidentified_sp."
lignes_cibles <- which(tax_df$Species == "S__Unidentified_sp.")

# 3. Ordre des rangs pour l'analyse
rangs <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

# 4. Boucle uniquement sur les lignes concernées
for (idx in lignes_cibles) {
  
  # A. Trouver le dernier rang valide (qui ne contient pas "Unidentified")
  last_valid_rank <- "K__Fungi" # Valeur par défaut de sécurité
  
  for (rang in rangs) {
    valeur_case <- tax_df[idx, rang]
    if (!grepl("unidentified", valeur_case, ignore.case = TRUE) & !is.na(valeur_case)) {
      last_valid_rank <- valeur_case
    } else {
      break # On s'arrête dès qu'on croise le premier "Unidentified"
    }
  }
  
  # B. Nettoyer le nom du parent trouvé (ex: "O__Polyporales" -> "Polyporales")
  nom_propre <- gsub("^[A-Za-z]__", "", last_valid_rank)
  
  # C. Remplacer les rangs qui étaient "Unidentified" par la nouvelle nomenclature
  if (grepl("unidentified", tax_df[idx, "Phylum"], ignore.case = TRUE)) {
    tax_df[idx, "Phylum"]  <- paste0("P__Unidentified_", nom_propre)
  }
  if (grepl("unidentified", tax_df[idx, "Class"], ignore.case = TRUE)) {
    tax_df[idx, "Class"]  <- paste0("C__Unidentified_", nom_propre)
  }
  if (grepl("unidentified", tax_df[idx, "Order"], ignore.case = TRUE)) {
    tax_df[idx, "Order"]  <- paste0("O__Unidentified_", nom_propre)
  }
  if (grepl("unidentified", tax_df[idx, "Family"], ignore.case = TRUE)) {
    tax_df[idx, "Family"]  <- paste0("F__Unidentified_", nom_propre)
  }
  if (grepl("unidentified", tax_df[idx, "Genus"], ignore.case = TRUE)) {
    tax_df[idx, "Genus"]   <- paste0("G__Unidentified_", nom_propre)
  }
  
  # L'espèce devient systématiquement "S_Unidentified_NomDuParent_sp"
  tax_df[idx, "Species"] <- paste0("S__Unidentified_", nom_propre, "_sp.")
}

# 5. Réinjection dans l'objet phyloseq
tax_table(physeq) <- tax_table(as.matrix(tax_df))

#-----------------------#
# Suppression du préfixe
tax <- as.data.frame(tax_table(physeq))
tax[] <- lapply(tax, function(x) {
  gsub("^[A-Za-z]__+", "", x)
})

tax_table(physeq) <- tax_table(as.matrix(tax))

### Modif des Ascomycota Phylum Class Order Family restants
tax <- as.data.frame(tax_table(physeq), stringsAsFactors = FALSE)

# taxa concernés
idx <- tax$Species == "Ascomycota_sp."

# reconstruction propre
tax[idx, "Class"]  <- paste0(tax[idx, "Phylum"], "_phylum")
tax[idx, "Order"]  <- paste0(tax[idx, "Phylum"], "_class")
tax[idx, "Family"] <- paste0(tax[idx, "Phylum"], "_order")
tax[idx, "Genus"]  <- paste0(tax[idx, "Phylum"], "_family")

#-----------------------#

# 5. Réinjection dans l'objet phyloseq
tax_table(physeq) <- tax_table(as.matrix(tax))
physeq %>% comp_barplot(tax_level = "Species") + coord_flip()

saveRDS(physeq, file="input_data/physeq_taxcleaned.rds")

Description of the entire dataset

Description of the entire dataset which is at ASV taxonomic rank. The taxonomic table was was cleaned to respect the lineage.

physeq <- readRDS(file = "./input_data/physeq_taxcleaned.rds")

Experimental design

# experimental design
sample_data(physeq) |> 
  as_tibble() |>
  datatable()

Number of biological repetition per factor

# count number of samples per condition
sample_data(physeq) |>
  as_tibble() |>
  group_by(Year) |>
  dplyr::count(Year, Variety_type, locality, treatment) |>
  datatable()

Check affiliation at Species rank

physeq %>% comp_barplot(tax_level = "Species") + coord_flip()

Separate the microbiome dataset by Year in TreeSummarizedExperiment format [1]. Manipulation of data will be performed using Microbiome analysis tools, mia R package [2].

# create TreeSummarizedExperiment objets
tse <- physeq |>
  convertFromPhyloseq() 
# manipulation
# assay(tse, "counts")[1:3, 1:3] #counts
# rowData(tse) |> head() # taxonomic table
# colData(tse) |> head() # samples table

We would like to agglomerate counts at species rank. So we need available and unicity of taxonomic rank names. We also renamed rownames of rowData to facilitate visualisation.

#taxonomyRanks(tse)
# il n'y a pas unicité au niveau espèce - faire une agglomération
# renommer les row si sequence ASV
#rowData(tse)$Species |> head()
table(taxonomyRankEmpty(tse, rank = "Species"))

FALSE 
 1314 
#rowData(tse)[is.na(rowData(tse)$Species), ]
index <- which(is.na(rowData(tse)$Species) == TRUE)
# assay(tse, "counts")[index, ] |> 
#   data.frame() |> 
#   colSums()
# nbASV <- assay(tse, "counts") |> 
#   data.frame() |> 
#   rowSums()  
#assay(tse, "counts")[nbASV < 150, ]
# rare <- getRare(tse, rank = "Species", prevalence = 0.2, detection = 0)
# rare |> head()

Aggregate at Species Rank To be done at Genus ?

Affiliation corrections were made.

tse_species <- agglomerateByRank(tse, rank = "Species")
tse_species
class: TreeSummarizedExperiment 
dim: 409 210 
metadata(1): agglomerated_by_rank
assays(1): counts
rownames(409): Acanthophysium_bisporum Acaromyces_ingoldii ...
  Zygosporium_pseudogibbum Zymoseptoria_brevis
rowData names(7): Kingdom Phylum ... Genus Species
colnames(210): 2023-OC01-1-B 2023-OC01-2-B ... 2024-OC09-2-B
  2024-OC09-3-B
colData names(15): Type Variety_name ... Plot Plot_2
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (409 rows)
rowTree: 1 phylo tree(s) (409 leaves)
colLinks: NULL
colTree: NULL
referenceSeq: a DNAStringSet (409 sequences)

Analyses by year

We analyzed separatly each year.

Year 2023

# separate tse per year
tse_2023 <- tse_species[, tse_species$Year =="2023"]
tse_2023
class: TreeSummarizedExperiment 
dim: 409 102 
metadata(1): agglomerated_by_rank
assays(1): counts
rownames(409): Acanthophysium_bisporum Acaromyces_ingoldii ...
  Zygosporium_pseudogibbum Zymoseptoria_brevis
rowData names(7): Kingdom Phylum ... Genus Species
colnames(102): 2023-OC01-1-B 2023-OC01-2-B ... 2023-OC09-1-B
  2023-OC09-2-B
colData names(15): Type Variety_name ... Plot Plot_2
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (409 rows)
rowTree: 1 phylo tree(s) (409 leaves)
colLinks: NULL
colTree: NULL
referenceSeq: a DNAStringSet (409 sequences)

Remove taxa with no counts

tse_2023 <- tse_2023[ , colSums(assay(tse_2023, "counts")) != 0 ]

Remove rare taxa with prevalence < 0.2

# get rare taxa
# getRare(tse_2023, prevalence = 0.01) |> 
#   head()
tse_2023 <- subsetByPrevalent(tse_2023, prevalence = 0.2, detection = 0)
colData(tse_2023)$SampleName_short <- str_remove(colData(tse_2023)$SampleName, "2023-")
# Calculate summary tables
summary(tse_2023, assay.type = "counts")
$samples
# A tibble: 1 × 6
  total_counts min_counts max_counts median_counts mean_counts stdev_counts
         <dbl>      <dbl>      <dbl>         <dbl>       <dbl>        <dbl>
1      9312353      44948     132792         91773      91298.       14976.

$features
# A tibble: 1 × 3
  total singletons per_sample_avg
  <int>      <int>          <dbl>
1   362          0           240.

Year 2024

tse_2024 <- tse_species[, tse_species$Year =="2024"]
tse_2024
class: TreeSummarizedExperiment 
dim: 409 108 
metadata(1): agglomerated_by_rank
assays(1): counts
rownames(409): Acanthophysium_bisporum Acaromyces_ingoldii ...
  Zygosporium_pseudogibbum Zymoseptoria_brevis
rowData names(7): Kingdom Phylum ... Genus Species
colnames(108): 2024-OC06-1-B 2024-OC06-2-B ... 2024-OC09-2-B
  2024-OC09-3-B
colData names(15): Type Variety_name ... Plot Plot_2
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (409 rows)
rowTree: 1 phylo tree(s) (409 leaves)
colLinks: NULL
colTree: NULL
referenceSeq: a DNAStringSet (409 sequences)

Remove taxa with no counts

tse_2024 <- tse_2024[ , colSums(assay(tse_2024, "counts")) != 0 ]

Get rare taxa with prevalence < 0.2

# get rare taxa
# getRare(tse_2024, prevalence = 0.01) |> 
#   head()
tse_2024 <- subsetByPrevalent(tse_2024, prevalence = 0.2, detection = 0)
colData(tse_2024)$SampleName_short <- str_remove(colData(tse_2024)$SampleName, "2024-")
# Calculate summary tables
summary(tse_2024, assay.type = "counts")
$samples
# A tibble: 1 × 6
  total_counts min_counts max_counts median_counts mean_counts stdev_counts
         <dbl>      <dbl>      <dbl>         <dbl>       <dbl>        <dbl>
1      6266132      40533     127883        56244.      58020.       13300.

$features
# A tibble: 1 × 3
  total singletons per_sample_avg
  <int>      <int>          <dbl>
1   306          0           177.

Barplot of the total number of reads per sample by year

colData(tse_2023)$nbtotal <- colSums(assay(tse_2023, "counts"))
colData(tse_2024)$nbtotal <- colSums(assay(tse_2024, "counts"))
tmp2023 <- colData(tse_2023) %>% 
  as_tibble() %>% 
  select(Variety_name, treatment, locality, SampleName_short, nbtotal)
tmp2024 <- colData(tse_2024) %>% 
  as_tibble() %>% 
  select(Variety_name, treatment, locality, SampleName_short, nbtotal)
tmp_paired <- dplyr::full_join(tmp2023, tmp2024, 
                               by = join_by(SampleName_short == SampleName_short),
                               suffix = c(".2023", ".2024"))
tmp_paired_nbtotaldiff <- tmp_paired$nbtotal.2023 - tmp_paired$nbtotal.2024

We observed a difference of sequence depth in median as {r median(tmp_paired_nbtotaldiff)} reads between the two years, based on paired samples.

p1 <- ggplot(colData(tse_2023), aes(x = SampleName, y = nbtotal, fill = locality)) + 
  geom_col() +
  facet_grid(~ Year + Variety_type + locality + treatment, scales = "free_x", space = "free") +
  theme(axis.text.x = element_text(size = 6, hjust = 1, angle=90)) +
  theme(legend.position="top")
p2 <- ggplot(colData(tse_2024), aes(x = SampleName, y = nbtotal, fill = locality)) + 
  geom_col() +
  facet_grid(~ Year + Variety_type + locality + treatment, scales = "free_x", space = "free") +
  theme(axis.text.x = element_text(size = 6, hjust = 1, angle=90)) +
  theme(legend.position="none")
p1 / p2

heatmap

Produce heatmap on normalized abundances using clr transformation. Rows and columns were clustered using euclidean distance and average aggregation method.

#calculate clr normalized counts
tse_2023 <- transformAssay(
    x = tse_2023,
    assay.type = "counts",
    method = "clr",
    pseudocount = TRUE,
    name = "clr"
)
# access clr normalized counts
#assay(tse_2023, "clr") |> head()
# heatmap options and annotations
# faster hclust
ht_opt$fast_hclust = TRUE
ht_opt$message = FALSE

# colors for column annotation
year_palette <- c(
  "2023" = "#B0CBE7FF",
  "2024" = "#5D74A5FF"
  )

variety_type_palette <- c(
  "resistant" = "#8EBACDFF",
  "conventional" = "#246893FF",
  "mediterraneen" = "#163274FF"
  )

treatment_palette <- c(
  "none" = "#FEF7C7FF",
  "bio" = "#88B560FF",
  "bas-intrants" = "#D89873FF",
  "conventional" = "#803342FF"
    )

locality_palette <- c(
  "Pech Rouge" = "#F5C762FF",
  "Cazes" = "#F8A856FF",
  "Chapitre" = "#F5894CFF",
  "Vassal" = "#DF5B47FF"
)


# column annotation
anno_df = data.frame(
    year = colData(tse_2023)$Year,
    variety_type = colData(tse_2023)$Variety_type,
    treatment = colData(tse_2023)$treatment,
    locality = colData(tse_2023)$locality
)

column_ha = HeatmapAnnotation(df = anno_df,
    col = list(
          year = year_palette,
          variety_type = variety_type_palette,
          treatment = treatment_palette,
          locality = locality_palette)
)


# Defined a large set of colors from RColorBrewer package
library(RColorBrewer)
qual_col_pals <- brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector <- unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
#pie(rep(1,n), col=sample(col_vector, n))

# colors for row annotation
n <- length(unique(rowData(tse_2023)[, "Phylum"]))
rowphylum <- col_vector[1:n]
names(rowphylum) <- unique(rowData(tse_2023)[, "Phylum"])

# rows annotation
row_ha = HeatmapAnnotation(
  name = "Phylum",
  phylum = rowData(tse_2023)[, "Phylum"],
  col = list(phylum = rowphylum),
  which = "row"
)

# cell colors
# custom_colors <- colorRampPalette(c("skyblue", "white", "red"))
ht <- Heatmap(
  assay(tse_2023, "clr"),
  name = "clr(Abund.)",  
#  col = custom_colors(1000),                
  row_title = "Species",
  column_title = "Samples",
  clustering_distance_rows = "euclidean",
 clustering_distance_columns = "euclidean",
 clustering_method_columns = "average",
  clustering_method_rows = "average",
  cluster_columns = TRUE,
# row_split = nbclusters,
  row_dend_reorder = TRUE,
#  top_annotation = col_ha,
  top_annotation = column_ha,
  right_annotation = row_ha,
# row_names_gp = gpar(fontsize = 7),
  show_column_names = FALSE,
  show_row_names = FALSE,
  use_raster = TRUE
)

draw(ht, merge_legend = TRUE)

#calculate clr normalized counts
tse_2024 <- transformAssay(
    x = tse_2024,
    assay.type = "counts",
    method = "clr",
    pseudocount = TRUE,
    name = "clr"
)
# access clr normalized counts
#assay(tse_2024, "clr") |> head()
# heatmap options and annotations
# faster hclust
ht_opt$fast_hclust = TRUE
ht_opt$message = FALSE

# colors for column annotation
year_palette <- c(
  "2023" = "#B0CBE7FF",
  "2024" = "#5D74A5FF"
  )

variety_type_palette <- c(
  "resistant" = "#8EBACDFF",
  "conventional" = "#246893FF",
  "mediterraneen" = "#163274FF"
  )

treatment_palette <- c(
  "none" = "#FEF7C7FF",
  "bio" = "#88B560FF",
  "bas-intrants" = "#D89873FF",
  "conventional" = "#803342FF"
    )

locality_palette <- c(
  "Pech Rouge" = "#F5C762FF",
  "Cazes" = "#F8A856FF",
  "Chapitre" = "#F5894CFF",
  "Vassal" = "#DF5B47FF"
)


# column annotation
anno_df = data.frame(
    year = colData(tse_2024)$Year,
    variety_type = colData(tse_2024)$Variety_type,
    treatment = colData(tse_2024)$treatment,
    locality = colData(tse_2024)$locality
)

column_ha = HeatmapAnnotation(df = anno_df,
    col = list(
          year = year_palette,
          variety_type = variety_type_palette,
          treatment = treatment_palette,
          locality = locality_palette)
)


# Defined a large set of colors from RColorBrewer package
library(RColorBrewer)
qual_col_pals <- brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector <- unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
#pie(rep(1,n), col=sample(col_vector, n))

# colors for row annotation
n <- length(unique(rowData(tse_2024)[, "Phylum"]))
rowphylum <- col_vector[1:n]
names(rowphylum) <- unique(rowData(tse_2024)[, "Phylum"])

# rows annotation
row_ha = HeatmapAnnotation(
  name = "Phylum",
  phylum = rowData(tse_2024)[, "Phylum"],
  col = list(phylum = rowphylum),
  which = "row"
)

# cell colors
# custom_colors <- colorRampPalette(c("skyblue", "white", "red"))
ht <- Heatmap(
  assay(tse_2024, "clr"),
  name = "clr(Abund.)",  
#  col = custom_colors(1000),                
  row_title = "Species",
  column_title = "Samples",
  clustering_distance_rows = "euclidean",
 clustering_distance_columns = "euclidean",
 clustering_method_columns = "average",
  clustering_method_rows = "average",
  cluster_columns = TRUE,
# row_split = nbclusters,
  row_dend_reorder = TRUE,
#  top_annotation = col_ha,
  top_annotation = column_ha,
  right_annotation = row_ha,
# row_names_gp = gpar(fontsize = 7),
  show_column_names = FALSE,
  show_row_names = FALSE,
  use_raster = TRUE
)

draw(ht, merge_legend = TRUE)

References

Computing information

write_rds(tse_2023, "./output_data/tse_2023.rds")
write_rds(tse_2024, "./output_data/tse_2024.rds")
sessioninfo::session_info(pkgs = "attached")
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.6.0 (2026-04-24)
 os       Ubuntu 24.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  fr_FR.UTF-8
 ctype    fr_FR.UTF-8
 tz       Europe/Paris
 date     2026-06-04
 pandoc   3.8.3 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
 quarto   1.9.38 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package                  * version date (UTC) lib source
 Biobase                  * 2.72.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 BiocGenerics             * 0.58.1  2026-05-14 [1] Bioconductor 3.23 (R 4.6.0)
 Biostrings               * 2.80.1  2026-05-22 [1] Bioconductor 3.23 (R 4.6.0)
 ComplexHeatmap           * 2.28.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 data.table               * 1.18.4  2026-05-06 [1] CRAN (R 4.6.0)
 dplyr                    * 1.2.1   2026-04-03 [1] CRAN (R 4.6.0)
 DT                       * 0.34.0  2025-09-02 [1] CRAN (R 4.6.0)
 fastcluster              * 1.3.0   2025-05-07 [1] CRAN (R 4.6.0)
 forcats                  * 1.0.1   2025-09-25 [1] CRAN (R 4.6.0)
 generics                 * 0.1.4   2025-05-09 [1] CRAN (R 4.6.0)
 GenomicRanges            * 1.64.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 ggplot2                  * 4.0.3   2026-04-22 [1] CRAN (R 4.6.0)
 ggraph                   * 2.2.2   2025-08-24 [1] CRAN (R 4.6.0)
 igraph                   * 2.3.1   2026-05-04 [1] CRAN (R 4.6.0)
 InraeThemes              * 3.1.0   2026-05-20 [1] Github (davidcarayon/InraeThemes@0564f7d)
 IRanges                  * 2.46.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 kableExtra               * 1.4.0   2024-01-24 [1] CRAN (R 4.6.0)
 lubridate                * 1.9.5   2026-02-04 [1] CRAN (R 4.6.0)
 MatrixGenerics           * 1.24.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 matrixStats              * 1.5.0   2025-01-07 [1] CRAN (R 4.6.0)
 mia                      * 1.20.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 miaViz                   * 1.20.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 microViz                 * 0.13.1  2026-05-26 [1] Github (david-barnett/microViz@b35bee4)
 MultiAssayExperiment     * 1.38.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 openxlsx                 * 4.2.8.1 2025-10-31 [1] CRAN (R 4.6.0)
 paletteer                * 1.7.0   2026-01-08 [1] CRAN (R 4.6.0)
 patchwork                * 1.3.2   2025-08-25 [1] CRAN (R 4.6.0)
 phyloseq                 * 1.56.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 phyloseq.extended        * 0.1.5   2026-05-05 [1] Github (mahendra-mariadassou/phyloseq-extended@d59e472)
 purrr                    * 1.2.2   2026-04-10 [1] CRAN (R 4.6.0)
 RColorBrewer             * 1.1-3   2022-04-03 [1] CRAN (R 4.6.0)
 readr                    * 2.2.0   2026-02-19 [1] CRAN (R 4.6.0)
 S4Vectors                * 0.50.1  2026-05-13 [1] Bioconductor 3.23 (R 4.6.0)
 Seqinfo                  * 1.2.0   2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 SingleCellExperiment     * 1.34.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 stringr                  * 1.6.0   2025-11-04 [1] CRAN (R 4.6.0)
 SummarizedExperiment     * 1.42.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 tibble                   * 3.3.1   2026-01-11 [1] CRAN (R 4.6.0)
 tidygraph                * 1.3.1   2024-01-30 [1] CRAN (R 4.6.0)
 tidyr                    * 1.3.2   2025-12-19 [1] CRAN (R 4.6.0)
 tidyverse                * 2.0.0   2023-02-22 [1] CRAN (R 4.6.0)
 TreeSummarizedExperiment * 2.20.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)
 XVector                  * 0.52.0  2026-04-28 [1] Bioconductor 3.23 (R 4.6.0)

 [1] /home/chennequet/R/x86_64-pc-linux-gnu-library/4.6
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library
 * ── Packages attached to the search path.

──────────────────────────────────────────────────────────────────────────────

Infer network with Poisson LogNormal model

We used these TreeSummarizedExperiment objects (tse) [1] at species taxonomic rank. The Poisson LogNormal model (PLN model) [3] was performed to infer network for each year separately. The best model was chosen with StARS criterion applied to PLN model.

years <- c(2023, 2024)

The PLN model is design to handle counts data. It is defined as follows :

\[ Y_{ij} | Z_{ij} \sim \mathcal{P}(\text{exp}(Z_{ij})), \quad Z_i \sim \mathcal{N}(o_i + X_i^T B, \Omega^{-1}) \quad \text{with} \quad ||\Omega||_{1,0} \leq c \]

PLN-network with offsets (vector \(o_i\) corresponds to the total reads in sample i). Here, no covariate was included in the model. Inference in PLN-network focuses on the regression parameters \(B\) and the inverse covariance \(\Omega\).

Report analyses

Analysis and visualisation of networks.

We used tidygraph [4] and igraph [5] R packages to analyse and visualise networks.

years <- c(2023, 2024)

Report analyses

Export of the association edges in xlsx format:

Export of the nodes informations involved in the association edges in xlsx format:

Export of graph in graphML format:

Biostatistics based on kmer

We decided to not perform these network analyses as it will be difficult to interpret the results (2026/04/27).

References

1. Huang R, Soneson C, Ernst FGM, Rue-Albrecht KC, Yu G, Hicks SC, et al. TreeSummarizedExperiment: A S4 class for data with hierarchical structure. F1000Research. 2020;9:1246.
2. Borman T, Ernst FGM, Shetty SA, Lahti L. Mia: Microbiome analysis. 2026. https://microbiome.github.io/mia/.
3. Chiquet J, Mariadassou M, Robin S. The poisson-lognormal model as a versatile framework for the joint analysis of species abundances. Frontiers in Ecology and Evolution. 2021;9. doi:10.3389/fevo.2021.588292.
4. Pedersen TL. Tidygraph: A tidy API for graph manipulation. 2025. https://tidygraph.data-imaginist.com.
5. Antonov M, Csárdi G, Horvát S, Müller K, Nepusz T, Noom D, et al. Igraph enables fast and robust network analysis across programming languages. 2023. https://arxiv.org/abs/2311.10260.

Reuse

This document will not be accessible without prior agreement of the partners

A work by Migale Bioinformatics Facility
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France