Plant-CATdb is a large catalogue of transcriptomic experiments dedicated to
plant gene expression studies.
This database, which was published in 2008 (Gagnot S., Tamby J.Ph. et al. 2008 Nucl. Ac. Res. 36: D986-90.),
was first created for managing transcriptomic data obtained with CATMA arrays. CATMA probes
consist of gene-specific sequence tags (GSTs) of 150 to 500 bp in length covering Arabidopsis
thaliana genome.
The projects processed on POPS platform and
stored in CATdb were in evolution to manage other Plant species, i.e. today more than 30
different plants, and evolution of biotechnologies. CATdb contains experiments done with Affymetrix,
NimbleGen and Agilent DNA chips, and arrays are progressively replaced by RNA-Seq with NGS
technology for more than ten years now.
The wealth and originality of CATdb are
a complete description project available from sample description with species, genotype, organ/tissue and condition of growth culture to raw data (pre-processing) and computed data with DE gene lists or count tables;
a same pipeline, i.e. bioinformatics and statistical methods, is used to analyze data. It is adapted to follow the evolution of each technology and tool;
all the data are distributed via a public FTP server.
Bioinformatics tools and Statistical pipelines for RNA-Seq
To facilitate comparisons, each sample follows the same analysis steps, from trimming toward counting.
RNA-Seq preprocessing includes trimming library adapters and performing quality controls using software tools: the raw data (fastq) were trimmed with Trimmomatic (Bolger A.M. et al., 2014) for Phred Quality Score Qscore >20, read length >30 bases, and ribosome sequences were removed with sortMeRNA (Kopylova E. et al., 2012).
Mapping reads against a reference to obtain the counts table. Strategies were used depending on genome accessibility and quality of transcriptomic data. First, the mapper Bowtie version 2 (Langmead B., Salzberg SL. 2012) was used to align reads against a transcriptome reference, with local option and other default parameters. The abundance of each gene was calculated by a handmade script which parses SAM files. Second, the genomic mapper STAR (version 2.7, Dobin A. et al., 2013) was used to align reads against the complete genome with options local and outSAMprimaryFlag AllBestScore to keep the bests results. The abundance of each gene was calculated with STAR. In both strategies, only reads mapping unambiguously onto one gene were kept, removing so multi-hits.
Statistical analysis of RNA-Seq data
Differential analyses were performed using the procedure described
by G. Rigaill et al. 2016. Briefly, genes with less than 1 read after counts
per million (CPM), normalization in at least one half of the samples, were
discarded. Library size was normalized using the trimmed mean of M-value (TMM)
method and counts distribution was modeled with a negative binomial
generalized linear model. Dispersion was estimated by the EdgeR package
(Version 1.12.0, McCarthy, 2012) in the statistical software ‘R’
(Version 3.2.5, R Development Core Team, 2005).
Differentially expressed genes were found using the DicoExpress software
tool (Lambert I. et al., 2020) developed by the
GNet team at IPS2.
Statistical analysis of array data
For array technologies with 2 samples on a chip (control and tested samples),
like CATMA arrays, the raw data comprised the logarithm of median
feature pixel intensity at wavelengths 635 nm (red) and 532 nm
(green). A global intensity-dependent normalization using the loess
procedure (Yang Y.H. et al., 2002) was performed to correct for the dye bias.
The differential analysis is based on the log-ratios averaging over
the duplicate probes and over the technical replicates. Hence the
numbers of available data for each gene equals the number of
biological replicates and are used to calculate the moderated t-test
(Smyth et al., 2004). Under the null hypothesis, no evidence that
the specific variances vary between probes is highlighted by Limma
and consequently the moderated t-statistics is assumed to follow a
standard normal distribution.
For technologies with one sample by array, like Affymetrix Genechip, the raw files
were imported in the Bioconductor package software in R and were normalized with the
GCRMA algorithm available in the package.
In both technologies, to control the false discovery rate (FDR), adjusted p-values found using the optimized FDR approach of Storey and coll. (2003) are calculated. We considered as being differentially expressed the gene/probe with an adjusted p-value ≤ 0.05. All the analyses were done with the R software.
References