Plant-CATdb description  


Plant-CATdb is a large catalogue of transcriptomic experiments dedicated to plant gene expression studies.
This database, which was published in 2008 (Gagnot S., Tamby J.Ph. et al. 2008 Nucl. Ac. Res. 36: D986-90.), was first created for managing transcriptomic data obtained with CATMA arrays. CATMA probes consist of gene-specific sequence tags (GSTs) of 150 to 500 bp in length covering Arabidopsis thaliana genome.

The projects processed on POPS platform and stored in CATdb were in evolution to manage other Plant species, i.e. today more than 30 different plants, and evolution of biotechnologies. CATdb contains experiments done with Affymetrix, NimbleGen and Agilent DNA chips, and arrays are progressively replaced by RNA-Seq with NGS technology for more than ten years now.


The wealth and originality of CATdb are

  • a complete description project available from sample description with species, genotype, organ/tissue and condition of growth culture to raw data (pre-processing) and computed data with DE gene lists or count tables;

  • a same pipeline, i.e. bioinformatics and statistical methods, is used to analyze data. It is adapted to follow the evolution of each technology and tool;

  • all the data are distributed via a public FTP server.


Bioinformatics tools and Statistical pipelines for RNA-Seq

To facilitate comparisons, each sample follows the same analysis steps, from trimming toward counting.

  • RNA-Seq preprocessing includes trimming library adapters and performing quality controls using software tools: the raw data (fastq) were trimmed with Trimmomatic (Bolger A.M. et al., 2014) for Phred Quality Score Qscore >20, read length >30 bases, and ribosome sequences were removed with sortMeRNA (Kopylova E. et al., 2012).

  • Mapping reads against a reference to obtain the counts table. Strategies were used depending on genome accessibility and quality of transcriptomic data. First, the mapper Bowtie version 2 (Langmead B., Salzberg SL. 2012) was used to align reads against a transcriptome reference, with local option and other default parameters. The abundance of each gene was calculated by a handmade script which parses SAM files. Second, the genomic mapper STAR (version 2.7, Dobin A. et al., 2013) was used to align reads against the complete genome with options local and outSAMprimaryFlag AllBestScore to keep the bests results. The abundance of each gene was calculated with STAR. In both strategies, only reads mapping unambiguously onto one gene were kept, removing so multi-hits.


Statistical analysis of RNA-Seq data

Differential analyses were performed using the procedure described by G. Rigaill et al. 2016. Briefly, genes with less than 1 read after counts per million (CPM), normalization in at least one half of the samples, were discarded. Library size was normalized using the trimmed mean of M-value (TMM) method and counts distribution was modeled with a negative binomial generalized linear model. Dispersion was estimated by the EdgeR package (Version 1.12.0, McCarthy, 2012) in the statistical software ‘R’ (Version 3.2.5, R Development Core Team, 2005).
Differentially expressed genes were found using the DicoExpress software tool (Lambert I. et al., 2020) developed by the GNet team at IPS2.


Statistical analysis of array data

For array technologies with 2 samples on a chip (control and tested samples), like CATMA arrays, the raw data comprised the logarithm of median feature pixel intensity at wavelengths 635 nm (red) and 532 nm (green). A global intensity-dependent normalization using the loess procedure (Yang Y.H. et al., 2002) was performed to correct for the dye bias. The differential analysis is based on the log-ratios averaging over the duplicate probes and over the technical replicates. Hence the numbers of available data for each gene equals the number of biological replicates and are used to calculate the moderated t-test (Smyth et al., 2004). Under the null hypothesis, no evidence that the specific variances vary between probes is highlighted by Limma and consequently the moderated t-statistics is assumed to follow a standard normal distribution.
For technologies with one sample by array, like Affymetrix Genechip, the raw files were imported in the Bioconductor package software in R and were normalized with the GCRMA algorithm available in the package.

In both technologies, to control the false discovery rate (FDR), adjusted p-values found using the optimized FDR approach of Storey and coll. (2003) are calculated. We considered as being differentially expressed the gene/probe with an adjusted p-value ≤ 0.05. All the analyses were done with the R software.


References

  • Bolger A.M. et al. 2014 Bioinformatics 30(15):2114-20
  • Dobin A. et al. 2013 Bioinformatics 29(1):15–21
  • Gagnot S. et al. 2008 Nucleic Acids Research 36:D986-90
  • Kopylova E. et al. 2012 Bioinformatics 28(24):3211-7
  • Lambert I. et al. 2020 Plant Methods 16:68
  • Langmead B. & Salzberg S.L. 2012 Nature Methods 9:357-9
  • McCarthy D. 2012 Nucleic Acids Research 40(10):4288-97
  • Rigaill G. et al. 2016 Briefings in Bioinformatics 19(1):65-76
  • Smyth G.K. 2004 Statistical Applications in Genetics and Molecular Biology 3:article3
  • Storey J.D. & Tibshirani R. 2003 Proc. Natl. Acad. Sci. USA 100(16):9440-5
  • Yang Y.H. et al. 2002 Nucleic Acids Research 30(4):e15