Acropora pulchra PolyA and rRNA depletion comparison

Comparison of deep dive Acropora pulchra E5 samples processed with PolyA to rRNA depletion heatwave gametogenesis samples

Goal:

Use samples from five sequence samples collected in Moorea, French Polynesia part of the E5 Rules of Life project to map to A. pulchra de novo transcriptome to compare PolyA RNAseq and rRNA depletion RNAseq sequencing methods gene counts and identify if the duplication and overrepresented sequences we see in our rRNA depletion samples is an issue for downstream analysis.

Acropora pulchra E5 Rules of Life project transcriptome data files on University of Washington OWL data storage HPC database

Location on OWL, the HPC server for UW:

  https://owl.fish.washington.edu/nightingales/A_pulchra/30-789513166/

Downloaded all files to Andromeda URI HPC location

  cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/

  wget -r -nd -A .fastq.gz https://owl.fish.washington.edu/nightingales/A_pulchra/30-789513166/
  wget -r -nd -A .md5 https://owl.fish.washington.edu/nightingales/A_pulchra/30-789513166/

  ACR-140_TP2_R1_001.fastq.gz
  ACR-140_TP2_R2_001.fastq.gz
  ACR-145_TP2_R1_001.fastq.gz
  ACR-145_TP2_R2_001.fastq.gz
  ACR-150_TP2_R1_001.fastq.gz
  ACR-150_TP2_R2_001.fastq.gz
  ACR-173_TP2_R1_001.fastq.gz
  ACR-173_TP2_R2_001.fastq.gz
  ACR-178_TP2_R1_001.fastq.gz
  ACR-178_TP2_R2_001.fastq.gz
  ACR-140_TP2_R1_001.fastq.gz.md5
  ACR-140_TP2_R2_001.fastq.gz.md5
  ACR-145_TP2_R1_001.fastq.gz.md5
  ACR-145_TP2_R2_001.fastq.gz.md5
  ACR-150_TP2_R1_001.fastq.gz.md5
  ACR-150_TP2_R2_001.fastq.gz.md5
  ACR-173_TP2_R1_001.fastq.gz.md5
  ACR-173_TP2_R2_001.fastq.gz.md5
  ACR-178_TP2_R1_001.fastq.gz.md5
  ACR-178_TP2_R2_001.fastq.gz.md5

Samples QC and Qubit Results

Tube Label	RNA_ng_µl	RNA_µl	RNA µg	Link to notebook post1
ACR-140_TP2	12	90	1.08	https://kterpis.github.io/Putnam_Lab_Notebook/20211012-RNA-DNA-extractions-from-E5-project/
ACR-145_TP2	20.8	87	1.8096	https://kterpis.github.io/Putnam_Lab_Notebook/20211012-RNA-DNA-extractions-from-E5-project/
ACR-150_TP2	13	87	1.131	https://github.com/Kterpis/Putnam_Lab_Notebook/blob/master/_posts/2021-09-03-20210903-RNA-DNA-extractions-from-E5-project.md
ACR-173_TP2	11.4	87	0.9918	https://kterpis.github.io/Putnam_Lab_Notebook/20211102-RNA-DNA-extractions-from-E5-project/
ACR-178_TP2	12.2	87	1.0614	https://github.com/Kterpis/Putnam_Lab_Notebook/blob/master/_posts/2021-09-02-20210902-RNA-DNA-extractions-from-E5-project.md
ACRP-CON	43.2	24	1.0368	https://github.com/daniellembecker/DanielleBecker_Lab_Notebook/blob/master/_posts/2023-04-25-Acropora-pulchra-transcriptome-extraction-concentration.md

Workflow Steps

Downloaded all trimmed E5 deep dive files to Andromeda URI HPC location

Reads for samples ACR-140, ACR-145, ACR-150, ACR-173, and ACR-178 were trimmed using the built-in version of Trimmomatic with the default settings, following the 9FastQ QC and Trimming - E5 Coral RNA-seq Data for A.pulchra protocol.

cd /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive

wget -r -nd -A .fastq.gz https://gannet.fish.washington.edu/Atumefaciens/20230519-E5_coral-fastqc-fastp-multiqc-RNAseq/A_pulchra/trimmed/

RNA-ACR-140-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-140-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-145-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-145-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-150-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-150-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-173-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-173-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-178-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-178-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz

1) Perform gene counts with Trinity

Information and instructions for this section of the workflow using Trinity, found here.

The abundance_and_estimation.pl script in Trinity is a part of the Trinity RNA-Seq assembly and analysis package. It is used for abundance estimation and provides a streamlined way to process RNA-Seq data to estimate transcript and gene expression levels.

a) Bowtie2 was used to make index for trinity.fasta in script below

#Hollie suggested first using bowtie and trinity to build an index and map and conduct gene identification

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/bowtiebuild.sh


#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --mem=100GB
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity

module load Bowtie2/2.4.5-GCC-11.3.0

bowtie2-build -f /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trinity_out_dir/trinity_out_dir.Trinity.fasta Apul_denovo

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/bowtiebuild.sh

Submitted batch job 292412

#have trinity_out_dir.Trinity.fasta.gene_trans_map and bowtie build files in this trinity_out_dir

b) Use Trinity for transcript quantification for estimating transcript abundance in a genome-free manner on deep dive samples

#Aligning Reads:

#seqType : Specify the type of sequencing data (e.g., fa for fasta, fq for fastq).

#left and -right : Input left and right reads for paired-end data.

Estimating Abundance:

#est_method : Specify the method for abundance estimation (e.g., RSEM, eXpress, etc.).

#aln_method : Specify the alignment method (e.g., bowtie, bowtie2).

#prep_reference: Prepare the reference for downstream analysis.

#trinity_mode: Setting –trinity_mode will automatically generate the gene_trans_map and use it.

Information for output files found here

# Create a list of sample IDs for array below

RNA-ACR-140-S1-TP2
RNA-ACR-145-S1-TP2
RNA-ACR-150-S1-TP2
RNA-ACR-173-S1-TP2
RNA-ACR-178-S1-TP2

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/scripts/rsem_deep_dive.sh

#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --cpus-per-task=24
#SBATCH --mem=32GB
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH --output=slurm-%A-%a.out
#SBATCH --array=1-5%5

module load Bowtie2/2.4.5-GCC-11.3.0
module load Trinity/2.15.1-foss-2022a
module load SAMtools/1.16.1-GCC-11.3.0

# get job array ID to set for each individual sample to run through
AR=$SLURM_ARRAY_TASK_ID

# set path for transcript file for all samples
TRANSCRIPT="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trinity_out_dir/Apul_denovo.fasta"
SAMPLE=$(head -n $AR /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/list_samples.txt | tail -n 1)

perl $EBROOTTRINITY/trinityrnaseq-v2.15.1/util/align_and_estimate_abundance.pl \
--transcripts $TRANSCRIPT \
--seqType fq \
--SS_lib_type RF \
--left /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/$(echo $SAMPLE)_R1_001.fastp-trim.20230519.fastq.gz \
--right /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/$(echo $SAMPLE)_R2_001.fastp-trim.20230519.fastq.gz \
--est_method RSEM \
--aln_method bowtie2 \
--trinity_mode \
--output_dir /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/align_estimate_abundance/$(echo $SAMPLE)_gene_count \
--thread_count $SLURM_CPUS_ON_NODE

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/scripts/rsem_deep_dive.sh

Submitted batch job 293782 on 20240129 at 1:50pm

The RSEM.genes.results file is an output file generated by the Trinity software’s abundance_and_estimation.pl script when using RSEM (RNA-Seq by Expectation-Maximization) for gene-level abundance estimation. This file contains information about the estimated expression levels for genes in your RNA-Seq dataset. Let’s break down the key components of this output file:

Gene ID (gene_id):
- Each row in the file corresponds to a unique gene identified in the RNA-Seq data. The gene_id column contains the identifier for each gene.
Transcript ID (transcript_id):
- This column provides the identifier for the transcript associated with each gene. In Trinity, genes are often represented by multiple transcripts, and this column helps link transcripts to their respective genes.
Length (length):
- The length column represents the length of the gene or transcript, usually measured in base pairs.
Effective Length (effective_length):
- This column contains the effective length of the gene or transcript. Effective length is a concept used in RSEM to account for biases introduced during the sequencing process.
Expected Counts (expected_count):
- The expected_count column provides the estimated count of reads expected to be assigned to each gene. This count is based on the RSEM algorithm and represents the expected number of reads originating from each gene.
TPM (Transcripts Per Million) (TPM):
- TPM is a normalization method that scales the expression values to represent the relative abundance of transcripts. The TPM column provides the gene expression values in TPM.
FPKM (Fragments Per Kilobase of transcript per Million mapped reads) (FPKM):
- FPKM is another normalization method commonly used in RNA-Seq analysis. The FPKM column provides gene expression values in FPKM.
IsoPct (IsoPct):
- This column represents the percentage of isoforms (transcripts) contributing to the total expression of a gene.

The RSEM.genes.results file is useful for downstream analyses, such as differential expression analysis, as it provides gene-level expression estimates in a standardized format.

d) Build Transcript and Gene Expression Matrices

Using the transcript and gene-level abundance estimates for each of your samples, construct a matrix of counts and a matrix of normalized expression values using the following script:

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/scripts/build_gene_matrix.sh

#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --mem=32GB --cpus-per-task=24
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/align_estimate_abundance/

module load Trinity/2.15.1-foss-2022a
module load RSEM/1.3.3-foss-2022a

perl $EBROOTTRINITY/trinityrnaseq-v2.15.1/util/abundance_estimates_to_matrix.pl --est_method RSEM \
--gene_trans_map /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trinity_out_dir/Apul_denovo.fasta.gene_trans_map \
--name_sample_by_basedir \
RNA-ACR-140-S1-TP2_gene_count/RSEM.genes.results RNA-ACR-145-S1-TP2_gene_count/RSEM.genes.results RNA-ACR-150-S1-TP2_gene_count/RSEM.genes.results RNA-ACR-173-S1-TP2_gene_count/RSEM.genes.results RNA-ACR-178-S1-TP2_gene_count/RSEM.genes.results

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/deep_dive/scripts/build_gene_matrix.sh

Submitted batch job 293835

d) Secure-copy gene counts onto local computer, make sure to open a separate command shell outside of Andromeda on your own terminal

#copy gene count matrix

scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity/mapped/RSEM.genes.results.csv /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/trinity_output

#copy transcript count matrix
scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity/mapped/RSEM.isoforms.results /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/trinity_output

Written on January 22, 2024