Acropora pulchra PolyA and rRNA depletion comparison
Comparison of deep dive Acropora pulchra E5 samples processed with PolyA to rRNA depletion heatwave gametogenesis samples
Goal:
Use samples from five sequence samples collected in Moorea, French Polynesia part of the E5 Rules of Life project to map to A. pulchra de novo transcriptome to compare PolyA RNAseq and rRNA depletion RNAseq sequencing methods gene counts and identify if the duplication and overrepresented sequences we see in our rRNA depletion samples is an issue for downstream analysis.
Acropora pulchra E5 Rules of Life project transcriptome data files on University of Washington OWL data storage HPC database
Location on OWL, the HPC server for UW:
https://owl.fish.washington.edu/nightingales/A_pulchra/30-789513166/
Downloaded all files to Andromeda URI HPC location
cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/
wget -r -nd -A .fastq.gz https://owl.fish.washington.edu/nightingales/A_pulchra/30-789513166/
wget -r -nd -A .md5 https://owl.fish.washington.edu/nightingales/A_pulchra/30-789513166/
ACR-140_TP2_R1_001.fastq.gz
ACR-140_TP2_R2_001.fastq.gz
ACR-145_TP2_R1_001.fastq.gz
ACR-145_TP2_R2_001.fastq.gz
ACR-150_TP2_R1_001.fastq.gz
ACR-150_TP2_R2_001.fastq.gz
ACR-173_TP2_R1_001.fastq.gz
ACR-173_TP2_R2_001.fastq.gz
ACR-178_TP2_R1_001.fastq.gz
ACR-178_TP2_R2_001.fastq.gz
ACR-140_TP2_R1_001.fastq.gz.md5
ACR-140_TP2_R2_001.fastq.gz.md5
ACR-145_TP2_R1_001.fastq.gz.md5
ACR-145_TP2_R2_001.fastq.gz.md5
ACR-150_TP2_R1_001.fastq.gz.md5
ACR-150_TP2_R2_001.fastq.gz.md5
ACR-173_TP2_R1_001.fastq.gz.md5
ACR-173_TP2_R2_001.fastq.gz.md5
ACR-178_TP2_R1_001.fastq.gz.md5
ACR-178_TP2_R2_001.fastq.gz.md5
Samples QC and Qubit Results
Tube Label | RNA_ng_µl | RNA_µl | RNA µg | Link to notebook post1 |
---|---|---|---|---|
ACR-140_TP2 | 12 | 90 | 1.08 | https://kterpis.github.io/Putnam_Lab_Notebook/20211012-RNA-DNA-extractions-from-E5-project/ |
ACR-145_TP2 | 20.8 | 87 | 1.8096 | https://kterpis.github.io/Putnam_Lab_Notebook/20211012-RNA-DNA-extractions-from-E5-project/ |
ACR-150_TP2 | 13 | 87 | 1.131 | https://github.com/Kterpis/Putnam_Lab_Notebook/blob/master/_posts/2021-09-03-20210903-RNA-DNA-extractions-from-E5-project.md |
ACR-173_TP2 | 11.4 | 87 | 0.9918 | https://kterpis.github.io/Putnam_Lab_Notebook/20211102-RNA-DNA-extractions-from-E5-project/ |
ACR-178_TP2 | 12.2 | 87 | 1.0614 | https://github.com/Kterpis/Putnam_Lab_Notebook/blob/master/_posts/2021-09-02-20210902-RNA-DNA-extractions-from-E5-project.md |
ACRP-CON | 43.2 | 24 | 1.0368 | https://github.com/daniellembecker/DanielleBecker_Lab_Notebook/blob/master/_posts/2023-04-25-Acropora-pulchra-transcriptome-extraction-concentration.md |
Workflow Steps
Downloaded all trimmed E5 deep dive files to Andromeda URI HPC location
Reads for samples ACR-140, ACR-145, ACR-150, ACR-173, and ACR-178 were trimmed using the built-in version of Trimmomatic with the default settings, following the 9FastQ QC and Trimming - E5 Coral RNA-seq Data for A.pulchra protocol.
cd /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive
wget -r -nd -A .fastq.gz https://gannet.fish.washington.edu/Atumefaciens/20230519-E5_coral-fastqc-fastp-multiqc-RNAseq/A_pulchra/trimmed/
RNA-ACR-140-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-140-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-145-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-145-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-150-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-150-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-173-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-173-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-178-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-178-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
1) Perform gene counts with Trinity
Information and instructions for this section of the workflow using Trinity, found here.
- The abundance_and_estimation.pl script in Trinity is a part of the Trinity RNA-Seq assembly and analysis package. It is used for abundance estimation and provides a streamlined way to process RNA-Seq data to estimate transcript and gene expression levels.
a) Bowtie2 was used to make index for trinity.fasta in script below
#Hollie suggested first using bowtie and trinity to build an index and map and conduct gene identification
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/bowtiebuild.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --mem=100GB
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity
module load Bowtie2/2.4.5-GCC-11.3.0
bowtie2-build -f /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trinity_out_dir/trinity_out_dir.Trinity.fasta Apul_denovo
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/bowtiebuild.sh
Submitted batch job 292412
#have trinity_out_dir.Trinity.fasta.gene_trans_map and bowtie build files in this trinity_out_dir
b) Use Trinity for transcript quantification for estimating transcript abundance in a genome-free manner on deep dive samples
#Aligning Reads:
#seqType
#left
Estimating Abundance:
#est_method
#aln_method
#prep_reference: Prepare the reference for downstream analysis.
#trinity_mode: Setting –trinity_mode will automatically generate the gene_trans_map and use it.
Information for output files found here
# Create a list of sample IDs for array below
RNA-ACR-140-S1-TP2
RNA-ACR-145-S1-TP2
RNA-ACR-150-S1-TP2
RNA-ACR-173-S1-TP2
RNA-ACR-178-S1-TP2
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/scripts/rsem_deep_dive.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --cpus-per-task=24
#SBATCH --mem=32GB
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH --output=slurm-%A-%a.out
#SBATCH --array=1-5%5
module load Bowtie2/2.4.5-GCC-11.3.0
module load Trinity/2.15.1-foss-2022a
module load SAMtools/1.16.1-GCC-11.3.0
# get job array ID to set for each individual sample to run through
AR=$SLURM_ARRAY_TASK_ID
# set path for transcript file for all samples
TRANSCRIPT="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trinity_out_dir/Apul_denovo.fasta"
SAMPLE=$(head -n $AR /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/list_samples.txt | tail -n 1)
perl $EBROOTTRINITY/trinityrnaseq-v2.15.1/util/align_and_estimate_abundance.pl \
--transcripts $TRANSCRIPT \
--seqType fq \
--SS_lib_type RF \
--left /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/$(echo $SAMPLE)_R1_001.fastp-trim.20230519.fastq.gz \
--right /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/$(echo $SAMPLE)_R2_001.fastp-trim.20230519.fastq.gz \
--est_method RSEM \
--aln_method bowtie2 \
--trinity_mode \
--output_dir /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/align_estimate_abundance/$(echo $SAMPLE)_gene_count \
--thread_count $SLURM_CPUS_ON_NODE
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/scripts/rsem_deep_dive.sh
Submitted batch job 293782 on 20240129 at 1:50pm
The RSEM.genes.results file is an output file generated by the Trinity software’s abundance_and_estimation.pl script when using RSEM (RNA-Seq by Expectation-Maximization) for gene-level abundance estimation. This file contains information about the estimated expression levels for genes in your RNA-Seq dataset. Let’s break down the key components of this output file:
- Gene ID (gene_id):
- Each row in the file corresponds to a unique gene identified in the RNA-Seq data. The gene_id column contains the identifier for each gene.
- Transcript ID (transcript_id):
- This column provides the identifier for the transcript associated with each gene. In Trinity, genes are often represented by multiple transcripts, and this column helps link transcripts to their respective genes.
- Length (length):
- The length column represents the length of the gene or transcript, usually measured in base pairs.
- Effective Length (effective_length):
- This column contains the effective length of the gene or transcript. Effective length is a concept used in RSEM to account for biases introduced during the sequencing process.
- Expected Counts (expected_count):
- The expected_count column provides the estimated count of reads expected to be assigned to each gene. This count is based on the RSEM algorithm and represents the expected number of reads originating from each gene.
- TPM (Transcripts Per Million) (TPM):
- TPM is a normalization method that scales the expression values to represent the relative abundance of transcripts. The TPM column provides the gene expression values in TPM.
- FPKM (Fragments Per Kilobase of transcript per Million mapped reads) (FPKM):
- FPKM is another normalization method commonly used in RNA-Seq analysis. The FPKM column provides gene expression values in FPKM.
- IsoPct (IsoPct):
- This column represents the percentage of isoforms (transcripts) contributing to the total expression of a gene.
The RSEM.genes.results file is useful for downstream analyses, such as differential expression analysis, as it provides gene-level expression estimates in a standardized format.
d) Build Transcript and Gene Expression Matrices
Using the transcript and gene-level abundance estimates for each of your samples, construct a matrix of counts and a matrix of normalized expression values using the following script:
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/scripts/build_gene_matrix.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --mem=32GB --cpus-per-task=24
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/align_estimate_abundance/
module load Trinity/2.15.1-foss-2022a
module load RSEM/1.3.3-foss-2022a
perl $EBROOTTRINITY/trinityrnaseq-v2.15.1/util/abundance_estimates_to_matrix.pl --est_method RSEM \
--gene_trans_map /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trinity_out_dir/Apul_denovo.fasta.gene_trans_map \
--name_sample_by_basedir \
RNA-ACR-140-S1-TP2_gene_count/RSEM.genes.results RNA-ACR-145-S1-TP2_gene_count/RSEM.genes.results RNA-ACR-150-S1-TP2_gene_count/RSEM.genes.results RNA-ACR-173-S1-TP2_gene_count/RSEM.genes.results RNA-ACR-178-S1-TP2_gene_count/RSEM.genes.results
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/scripts/build_gene_matrix.sh
Submitted batch job 293835
d) Secure-copy gene counts onto local computer, make sure to open a separate command shell outside of Andromeda on your own terminal
#copy gene count matrix
scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity/mapped/RSEM.genes.results.csv /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/trinity_output
#copy transcript count matrix
scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity/mapped/RSEM.isoforms.results /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/trinity_output