Acropora pulchra RNA-seq Bioinformatic Workflow

Mapping sequences from Acropora pulchra coral samples from adult colonies exposed to marine heatwave conditions in Moorea, French Polynesia in March/April 2022 and their offspring (larvae) collected from coral spawning in October 2022 to a Acropora pulchra de novo transcriptome.

Goal

The following document contains the bioinformatic pipeline used for cleaning, aligning and assembling our raw RNA sequences. The goal is to map the adult and larval sequences to a compiled Acropora pulchra de novo transcriptome. All metadata and information for these projects can be found in this repository and in these notebook posts of the de novo transcriptome assembly and the extraction protcol for the samples used in the assembly. These commands were compiled into bash scripts to run on the URI HPC Andromeda server.

Metadata for submitted sequences

Sequence samples metadata

1) Obtain de novo transcriptome

I am using the Acropora pulchra de novo transcriptome assembled in November 2023, following methods outlined for transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity.

Location on Andromeda, the HPC server for URI:

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/stranded_assembly_cdhit.fasta

Make folder structure for sequence mapping analysis

cd /data/putnamlab/dbecks/

mkdir Heatwave_A.pul_2022Project
cd Heatwave_A.pul_2022Project

mkdir data
mkdir scripts

cd data
mkdir raw

Copy already downloaded raw sequence files from putnam lab main storage in Andromeda

cp -r /data/putnamlab/KITT/hputnam/20231214_Apulchra_RiboDepletion/* /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw

2) Check file integrity

a) Count all files to make sure all downloaded

ls -1 | wc -l

b) Verify data transfer integrity with md5sum

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/check_transfer.sh
#!/bin/bash ###creating slurm script
#SBATCH -t 24:00:00 ###give script 24 hours to run
#SBATCH --nodes=1 --ntasks-per-node=1 ###on server, different nodes you can use for processing power, node just do one task
#SBATCH --export=NONE
#SBATCH --mem=100GB ###in server allocate 100GB amount of memory
#SBATCH --account=putnamlab ###primary account
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw ###path


md5sum /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw*.gz > URIcheckmd5.md5

md5sum /data/putnamlab/KITT/hputnam/20231214_Apulchra_RiboDepletion/*.gz > URIcheckmd5.md5

sbatch check_transfer.sh
Submitted batch job 253687

Checksum from Genewiz

b4a85239e88aba5d1762d7e90a9d70d8  ./12_R1_001.fastq.gz
eab831185c3b8ca72cf3914f57fd7330  ./12_R2_001.fastq.gz
7f620547442556c977ed6e64322ad3b4  ./16_R1_001.fastq.gz
96343dd00c1ea31d0f35af312db01405  ./16_R2_001.fastq.gz
1c05cc100907fa24681921aa520721ff  ./20_R1_001.fastq.gz
5ba8e2613376b60d42d4a79de32eeafe  ./20_R2_001.fastq.gz
4b29ffe9c1a711864d2615ba2dafbe06  ./248_R1_001.fastq.gz
332babdf0cce128c46dddf2242923d8d  ./248_R2_001.fastq.gz
32d3be14d0c8eaff9f986a8198dbf101  ./249_R1_001.fastq.gz
955885f77f56a52cfec0f5dbe1bb5aa6  ./249_R2_001.fastq.gz
e970c9fef559d883400616144469740f  ./24_R1_001.fastq.gz
1ce920d6100cb0e97efdd7acb1206afe  ./24_R2_001.fastq.gz
a9dd31f2aca0656e855990b253a216d8  ./250_R1_001.fastq.gz
b6821db41d50b1d0c01f39e1ef9fd40e  ./250_R2_001.fastq.gz
25c2235a292e2a20bf6334d74c164855  ./260_R1_001.fastq.gz
f15e2c1bb8959aa686d4848db6535c53  ./260_R2_001.fastq.gz
d424bde716d0a388d4b82910490b16f9  ./261_R1_001.fastq.gz
723bcabf15e03805e3d45819d2c8a9cf  ./261_R2_001.fastq.gz
ea8944531c703a3ae2eb92b6add58944  ./262_R1_001.fastq.gz
ae433007a1a10464af2b41bef690d7ed  ./262_R2_001.fastq.gz
de405dc4452fb440b2bd4edbc9c30a87  ./5_R1_001.fastq.gz
1f53cd6cbb43668a3bffd094b73c9ad6  ./5_R2_001.fastq.gz
5d73e8c968f198e279b68668ef3fb50f  ./8_R1_001.fastq.gz
d2de35adcc456517be1e09bb097aeadb  ./8_R2_001.fastq.gz

c) Verify data integrity with md5sum

Cross-reference the checksum document from GENEWIZ with the data we have on our computer

With a small amount of files, able to first cross-check that the sequences matched between both files on the desktop

Use the code below in terminal to cross-check the files and compare for sanity check

in directory: /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw

md5sum -c Genewiz_Provided.md5

Should output ‘OK’ next to each file name

d) Count number of reads per file using the code after @ in fastq.gz files (e.g.,@GWNJ).

zgrep -c "@A01940" *.gz > raw_seq_counts

12_R1_001.fastq.gz:22077183
12_R2_001.fastq.gz:22077183
16_R1_001.fastq.gz:21806397
16_R2_001.fastq.gz:21806397
20_R1_001.fastq.gz:22366560
20_R2_001.fastq.gz:22366560
248_R1_001.fastq.gz:24234533
248_R2_001.fastq.gz:24234533
249_R1_001.fastq.gz:24655232
249_R2_001.fastq.gz:24655232
24_R1_001.fastq.gz:23762178
24_R2_001.fastq.gz:23762178
250_R1_001.fastq.gz:21345210
250_R2_001.fastq.gz:21345210
260_R1_001.fastq.gz:26117186
260_R2_001.fastq.gz:26117186
261_R1_001.fastq.gz:24168048
261_R2_001.fastq.gz:24168048
262_R1_001.fastq.gz:25536072
262_R2_001.fastq.gz:25536072
5_R1_001.fastq.gz:23140700
5_R2_001.fastq.gz:23140700
8_R1_001.fastq.gz:23792081
8_R2_001.fastq.gz:23792081

3) Run FastQC on sequences

a) Make folders for raw FastQC results and scripts

b) Write script for checking quality with FastQC and submit as job on Andromeda

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_raw.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file

module load FastQC/0.11.9-Java-11

for file in /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/*.gz
do
fastqc $file --outdir /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/qc
done
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_raw.sh

Submitted batch job 291946

c) Make sure all files were processed

ls -1 | wc -l

#48

Combined QC output into 1 file with MultiQC

module load MultiQC/1.9-intel-2020a-Python-3.8.2

multiqc /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/qc

c) Copy MultiQC files to local computer

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/qc/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/original_fastqc

4) Trim and clean reads

a) Make trimmed reads folder in all other results folders


mkdir data/trimmed
cd trimmed

c) Write script for Trimming and run on Andromeda

#Run fastp on files #Trims 5bp from 5’ end of all reads #Trims poly G, if present

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/trim.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file

module load fastp/0.23.2-GCC-11.2.0

array1=($(ls *R1*.fastq.gz)) #Make an array of sequences to trim
for i in ${array1[@]}; do
fastp --in1 ${i} --in2 $(echo ${i}|sed s/_R1/_R2/) --detect_adapter_for_pe --trim_poly_g --trim_front1 15 --trim_front2 15 --qualified_quality_phred 20 --unqualified_percent_limit 5 --out1 ../trimmed/${i} --out2 ../trimmed/$(echo ${i}|sed s/_R1/_R2/)  
done

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/trim.sh

Submitted batch job 292306

#first two runs with different trimming paramaters resulted in errors for per sequence qc content, so using --qualified_quality_phred 15 --unqualified_percent_limit 5 to --qualified_quality_phred sets the threshold for considering a base as high-quality (default is 15, you can adjust as needed).
--unqualified_percent_limit filters out reads with too many low-quality bases.

5) Check quality of trimmed files

a) Check number of files in /trimmed directory

ls -1 | wc -l
#24

b) Check number of reads in /trimmed directory

zgrep -c "@A01940" *.gz > trimmed_seq_counts

12_R1_001.fastq.gz:19797178
12_R2_001.fastq.gz:19797178
16_R1_001.fastq.gz:19557856
16_R2_001.fastq.gz:19557856
20_R1_001.fastq.gz:19986198
20_R2_001.fastq.gz:19986198
248_R1_001.fastq.gz:21556118
248_R2_001.fastq.gz:21556118
249_R1_001.fastq.gz:22043273
249_R2_001.fastq.gz:22043273
24_R1_001.fastq.gz:21277072
24_R2_001.fastq.gz:21277072
250_R1_001.fastq.gz:19038866
250_R2_001.fastq.gz:19038866
260_R1_001.fastq.gz:23396050
260_R2_001.fastq.gz:23396050
261_R1_001.fastq.gz:21707283
261_R2_001.fastq.gz:21707283
262_R1_001.fastq.gz:22781251
262_R2_001.fastq.gz:22781251
5_R1_001.fastq.gz:20876524
5_R2_001.fastq.gz:20876524
8_R1_001.fastq.gz:21284631
8_R2_001.fastq.gz:21284631


c) Run FastQC on trimmed data

mkdir trimmed_qc

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_trimmed.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/trimmed_qc
#SBATCH --error="script_error"
#SBATCH --output="output_script"

module load FastQC/0.11.9-Java-11

for file in /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/*.gz
do
fastqc $file --outdir /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/trimmed_qc
done
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_trimmed.sh

Submitted batch job 292337

d) Run MultiQC on trimmed data

module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/trimmed_qc
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/trimmed_fastqc

6) Align reads

a) Generate transcriptome index

HiSat2 Align reads to transcriptome

HiSat2 HiSat2 Github

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_transcriptome_build.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/refs
#SBATCH --error="script_error"
#SBATCH --output="output_script"

module load HISAT2/2.2.1-gompi-2022a

hisat2-build -f /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/stranded_assembly_cdhit.fasta Apul_trans_ref

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_transcriptome_build.sh

Submitted batch job 334001

b) Align reads to transcriptome

mkdir mapped
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_align2_trans.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3
#SBATCH --error="hisat2_script_error_trans"
#SBATCH --output="hisat2_output_trans"


module load HISAT2/2.2.1-gompi-2022a

#Aligning paired end reads
#Has the R1 in array1 because the sed in the for loop changes it to an R2. SAM files are of both forward and reverse reads
#-p increases threads on server, --rna-strandness RF forward and reverse, -dta downstream transcriptome analysis, -q reads are fastq files, -x location of ref -S where to output sam files

for read_pair in *_R1_001.fastq.gz; do
    base_name=$(basename "$read_pair" _R1_001.fastq.gz)
    hisat2 -p 48 --rna-strandness RF --dta -q -x /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/refs/Apul_trans_ref -1 "$read_pair" -2 "${base_name}_R2_001.fastq.gz" -S "/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_trans/${base_name}.sam"
done


sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_align2_trans.sh

Submitted batch job 334029

#Download alignment statistics information from mapping, will be the output from your script above, even though it is a mapped output, you are in the trimmed folder here

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/script_error /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/alignment_statistics

d) Sort and convert sam to bam and check number of mapped reads and mapping percentages

  • Explanation:
    • samtools sort -o sorted.bam aligned.sam: This command sorts the SAM file and creates a sorted BAM file (Trinity_aligned.sorted.bam).
    • samtools index .sorted.bam: This command creates an index for the sorted BAM file. The index file (sorted.bam.bai) is necessary for certain operations and viewers.
    • samtools flagstat .sorted.bam: This command generates statistics about the alignment, including the number of mapped reads and mapping percentages.
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/SAMtoBAM.sh

#There will be lots of .tmp file versions in your folder, this is normal while this script runs and they should delete at the end to make one sorted.bam file
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=8
#SBATCH --export=NONE
#SBATCH --mem=120GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_trans
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error"
#SBATCH --output="output_script"

module load SAMtools/1.16.1-GCC-11.3.0 #Preparation of alignment for assembly: SAMtools

# Directory containing SAM files
sam_directory="/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_trans/"

# Create a directory for the results
mkdir -p "${sam_directory}/results"

# Process each SAM file in the directory
for sam_file in "$sam_directory"/*.sam; do
    if [ -f "$sam_file" ]; then
        # Extract sample name from the file
        sample_name=$(basename "${sam_file%.sam}")

        # Step 1: Sort the SAM file and convert to BAM
        samtools sort -o "${sam_directory}/results/${sample_name}_sorted.bam" "$sam_file"

        # Step 2: Index the BAM file
        samtools index "${sam_directory}/results/${sample_name}_sorted.bam"

        # Step 3: Check Mapping Statistics
        samtools flagstat "${sam_directory}/results/${sample_name}_sorted.bam" > "${sam_directory}/results/${sample_name}_flagstat.txt"

        echo "Processed: ${sample_name}"
        echo "------------------------"
    fi
done

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/SAMtoBAM.sh

Submitted batch job 334083

Alignment statistics

interactive

module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools

for i in *.bam; do
    echo "${i}" >> mapped_reads_counts_Apul
    samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts_Apul
done

39551642 + 0 mapped (80.25% : N/A)
16_sorted.bam
38941088 + 0 mapped (80.49% : N/A)
20_sorted.bam
40660121 + 0 mapped (81.81% : N/A)
248_sorted.bam
42416575 + 0 mapped (79.82% : N/A)
249_sorted.bam
46000537 + 0 mapped (83.61% : N/A)
24_sorted.bam
43401506 + 0 mapped (82.22% : N/A)
250_sorted.bam
39573073 + 0 mapped (83.52% : N/A)
260_sorted.bam
49083838 + 0 mapped (84.34% : N/A)
261_sorted.bam
44856313 + 0 mapped (83.20% : N/A)
262_sorted.bam
46812244 + 0 mapped (83.11% : N/A)
5_sorted.bam
46899980 + 0 mapped (85.23% : N/A)
8_sorted.bam
42785716 + 0 mapped (81.67% : N/A)

Pretty high mapping percentages (>79%) for all samples!

e) Download mapping percentages to desktop

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_trans/results/mapped_reads_counts_Apul /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/data/mapping_percentages

8) Perform gene counts with stringTie

a) Create structural annotation files, like .gff3 files for predicted gene structures and .cds predicted coding sequences for our de novo transcriptome

#To include putative gene information in your Trinity analysis, you can use TransDecoder.LongOrfs and TransDecoder.Predict #First step identifies likely coding regions (long open reading frames or ORFs) in your Trinity transcripts and creates a file named Trinity.fasta.transdecoder_dir/longest_orfs.pep, which contains the predicted protein sequences.

#Second step predicts likely coding regions and identifies potential coding regions using the output from the LongOrfs step and generates several output files in the Trinity.fasta.transdecoder_dir/ directory, including Trinity.fasta.transdecoder.cds (predicted coding sequences) and Trinity.fasta.transdecoder.gff3 (predicted gene structures in GFF3 format).


nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/transdecode_predict.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --mem=100GB
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity/

module load TransDecoder/5.5.0-foss-2020b

TransDecoder.LongOrfs -t /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/stranded_assembly_cdhit.fasta

TransDecoder.Predict -t /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/stranded_assembly_cdhit.fasta

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/transdecode_predict.sh

Submitted batch job 335108

b) Make directory for stringTie

mkdir stringTie
cd stringTie

mkdir BAM GTF GTF_merge

c) Copy BAM files to stringTie folder

cd /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_trans/results/
mv *_sorted.bam /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM

d) Assemble and estimate reads, following vignette for transdecoder and transcriptome specific setttings.

cd BAM

nano stringTie_assemble.sh


#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH --error="stringtie_out_error"
#SBATCH --output="stringtie_out"

module load StringTie/2.2.1-GCC-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0

# Define the directory containing BAM files
F=/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM

# List all BAM files in the directory
array1=($(ls $F/*bam))

# Process each BAM file
for i in ${array1[@]}; do
    # Run StringTie with options -e -B -G for each BAM file
    stringtie -e -B -G /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity/stranded_assembly_cdhit.fasta.transdecoder.gff3 -o ${i}.gtf ${i}

    # Print the current BAM file name for tracking
    echo "${i}"
done

sbatch stringTie_assemble.sh

Submitted batch job 335278

e) Merge stringTie gtf results

mv *gtf ../GTF/

ls *gtf > apul_mergelist.txt
cat apul_mergelist.txt

module load StringTie/2.2.1-GCC-11.2.0

stringtie --merge -p 8 -G /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity/stranded_assembly_cdhit.fasta.transdecoder.gff3 -o stringtie_apul_merged.gtf apul_mergelist.txt

f) Assess assembly quality

module load GffCompare/0.12.6-GCC-11.2.0

gffcompare -r /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity/stranded_assembly_cdhit.fasta.transdecoder.gff3 -o Apul.merged stringtie_apul_merged.gtf

222476 reference transcripts loaded.
9006 duplicate reference transcripts discarded.
220533 query transfrags loaded.

#= Summary for dataset: stringtie_apul_merged.gtf
#     Query mRNAs :  220533 in  213470 loci  (0 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :  213470 in  213470 loci  (0 multi-exon)
# Super-loci w/ reference transcripts:   213470
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |   100.0    |
        Exon level:   100.0     |   100.0    |
  Transcript level:   100.0     |    96.8    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:       0
       Matching transcripts:  213470
              Matching loci:  213470

          Missed exons:       0/213470  (  0.0%)
           Novel exons:       0/213470  (  0.0%)
           Missed loci:       0/213470  (  0.0%)
            Novel loci:       0/213470  (  0.0%)

 Total union super-loci across all input datasets: 213470
220533 out of 220533 consensus transcripts written in Apul.merged.annotated.gtf (0 discarded as redundant)

g) Re-estimate assembly

nano stringTie_apul_re-assemble.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH --error="re-assemble_apul_out_error"
#SBATCH --output="re-assemble_apul_out"

module load StringTie/2.2.1-GCC-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0

# Define the directory containing BAM files
F=/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM

array1=($(ls $F/*bam))
for i in ${array1[@]}
do
stringtie -e -G /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trinity/stranded_assembly_cdhit.fasta.transdecoder.gff3 -o ${i}.merge.gtf ${i}
echo "${i}"
done

sbatch stringTie_apul_re-assemble.sh

Submitted batch job 335292

f) Correct gene_id naming in transcript .gtf files, since coming from a transcriptome, they have gene_id and transcript specific information, stringTie needs a unique gene_id identifer to bind together for per gene counts in next step

for file in *.gtf; do
    sed -E 's/gene_id "([^~]*)~~[^"]*"/gene_id "\1"/' "$file" > "${file%.gtf}_modified.gtf"
done

Make sure all files match after changes

for file in *_sorted.bam.merge.gtf; do
    mod_file="${file%.gtf}_modified.gtf"
    orig_count=$(awk '$2=="StringTie" {print $1}' "$file" | sort | uniq | wc -l)
    mod_count=$(awk '$2=="StringTie" {print $1}' "$mod_file" | sort | uniq | wc -l)
    echo "$file: $orig_count"
    echo "$mod_file: $mod_count"
    echo "---"
done

Yes they do!

12_sorted.bam.merge.gtf: 66975
12_sorted.bam.merge_modified.gtf: 66975
---
16_sorted.bam.merge.gtf: 66895
16_sorted.bam.merge_modified.gtf: 66895
---
20_sorted.bam.merge.gtf: 62152
20_sorted.bam.merge_modified.gtf: 62152
---
248_sorted.bam.merge.gtf: 48963
248_sorted.bam.merge_modified.gtf: 48963
---
249_sorted.bam.merge.gtf: 40468
249_sorted.bam.merge_modified.gtf: 40468
---
24_sorted.bam.merge.gtf: 58688
24_sorted.bam.merge_modified.gtf: 58688
---
250_sorted.bam.merge.gtf: 42326
250_sorted.bam.merge_modified.gtf: 42326
---
260_sorted.bam.merge.gtf: 50488
260_sorted.bam.merge_modified.gtf: 50488
---
261_sorted.bam.merge.gtf: 50974
261_sorted.bam.merge_modified.gtf: 50974
---
262_sorted.bam.merge.gtf: 51079
262_sorted.bam.merge_modified.gtf: 51079
---
5_sorted.bam.merge.gtf: 72359
5_sorted.bam.merge_modified.gtf: 72359
---
8_sorted.bam.merge.gtf: 62257
8_sorted.bam.merge_modified.gtf: 62257

Move files to folder with prepDE.py script for easy use

mv *_modified.gtf ../GTF_merge

h) Create gene matrix

module load StringTie/2.2.1-GCC-11.2.0

F=/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/GTF_merge/

array2=($(ls *_modified.gtf))

for i in ${array2[@]}
do
echo "${i} $F${i}" >> sample_list_apul.txt
done

python prepDE.py -g gene_count_apul_matrix.csv -i sample_list_apul.txt

g) Secure-copy gene counts onto local computer

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/GTF_merge/gene_count_apul_matrix.csv /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/data

9) Run RNAseq SNP calling and on transcripts for genetic differentiation

In our data we had three parental genotypes contributing to the heatwave treatment and five parental genotypes contributing to the ambient treatment. We now want to confirm the individual parental contributions to the larval pools. First, we want to confirm the genetic differentiation with Fst calculations of the adults in each treatment.

Step 1: Use SuperTranscripts in trinity for a context of genome-free de novo transcriptome assembly in that they provide a genome-like reference for studying aspects of the gene including differential transcript usage (aka. differential exon usage) and as a substrate for mapping reads and identifying allelic polymorphisms. A SuperTranscript is constructed by collapsing unique and common sequence regions among splicing isoforms into a single linear sequence.

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/super_tran.sh

#!/bin/bash
#SBATCH --job-name=super_tran            # Job name
#SBATCH --output=super_tran.out           # Output file
#SBATCH --error=super_tran.err            # Error file
#SBATCH --ntasks=1                        # Run on a single task
#SBATCH --cpus-per-task=8                 # Number of CPUs
#SBATCH --time=72:00:00                   # Time limit: 72 hours
#SBATCH --mail-type=BEGIN,END,FAIL        # Mail events (send mail on job start, end, or fail)
#SBATCH --mail-user=danielle_becker@uri.edu  # Where to send mail
#SBATCH --account=putnamlab               # Account to charge for resources
#SBATCH --mem=120G                        # Memory requirement in GB

# Load the Trinity module
module load Trinity/2.15.1-foss-2022a

# Define input Trinity fasta file path
TRINITY_FASTA=/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/stranded_assembly_cdhit.fasta

# Path to Trinity SuperTranscripts
SUPERTRANS_PATH=/opt/software/Trinity/2.15.1-foss-2022a/trinityrnaseq-v2.15.1/Analysis/SuperTranscripts

# Run Trinity SuperTranscripts script
$SUPERTRANS_PATH/Trinity_gene_splice_modeler.py --trinity_fasta $TRINITY_FASTA

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/super_tran.sh

Submitted batch job 338361

Step 2: Use Trinity variant calling workflow that follows the GATK pipeline for variant calling using the super transcripts.

First, the HISAT-aligned reads are sorted and marked for duplicates, variant calling is then performed with the GATK HaplotypeCaller tool. The GATK SelectVariants and VariantFiltration tools were used to filter the variants for each individual sample.

This step would not run on Andromeda due to RAM requirements, so data storage was moved to Unity and run there.

nano /project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/scripts/variant_calling.sh

#!/bin/bash
#SBATCH --job-name=variant_calling
#SBATCH --output=variant_calling.out
#SBATCH --error=variant_calling.err
#SBATCH --partition=cpu-long
#SBATCH --ntasks=1                       
#SBATCH --cpus-per-task=8                 
#SBATCH --time=96:00:00                  
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=pi_hputnam_uri_edu
#SBATCH --constraint=avx512
#SBATCH --mem=500GB

# Load necessary modules
module load uri/main Trinity/2.15.1-foss-2022a
module unload SAMtools/1.16.1-GCC-11.3.0
module load SAMtools/1.18-GCC-12.3.0
module load STAR/2.7.11b-GCC-12.3.0
module load GATK/4.3.0.0-GCCcore-11.3.0-Java-11
module load picard/2.25.1-Java-11

# Set paths for Trinity SuperTranscripts
export TRINITY_HOME=$EBROOTTRINITY
export PICARD_HOME=$EBROOTPICARD
export GATK_HOME=$EBROOTGATK

# Set SuperTranscript input paths
ST_FA=/project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/trinity_genes.fasta
ST_GTF=/project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/trinity_genes.gtf

# Output directory for variant calling results
OUTPUT_DIR=/project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/

# Directory for trimmed FASTQ files
FASTQ_DIR=/project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/data/trimmed

# Directory for trinity home
TRINITY_HOME=/modules/uri_apps/software/Trinity/2.15.1-foss-2022a/trinityrnaseq-v2.15.1

# Run the variant calling pipeline
$TRINITY_HOME/Analysis/SuperTranscripts/AllelicVariants/run_variant_calling.py \
    --st_fa $ST_FA \
    --st_gtf $ST_GTF \
    -p $FASTQ_DIR/12_R1_001.fastq.gz $FASTQ_DIR/12_R2_001.fastq.gz \
    -p $FASTQ_DIR/16_R1_001.fastq.gz $FASTQ_DIR/16_R2_001.fastq.gz \
    -p $FASTQ_DIR/20_R1_001.fastq.gz $FASTQ_DIR/20_R2_001.fastq.gz \
    -p $FASTQ_DIR/24_R1_001.fastq.gz $FASTQ_DIR/24_R2_001.fastq.gz \
    -p $FASTQ_DIR/248_R1_001.fastq.gz $FASTQ_DIR/248_R2_001.fastq.gz \
    -p $FASTQ_DIR/249_R1_001.fastq.gz $FASTQ_DIR/249_R2_001.fastq.gz \
    -p $FASTQ_DIR/250_R1_001.fastq.gz $FASTQ_DIR/250_R2_001.fastq.gz \
    -p $FASTQ_DIR/260_R1_001.fastq.gz $FASTQ_DIR/260_R2_001.fastq.gz \
    -p $FASTQ_DIR/261_R1_001.fastq.gz $FASTQ_DIR/261_R2_001.fastq.gz \
    -p $FASTQ_DIR/262_R1_001.fastq.gz $FASTQ_DIR/262_R2_001.fastq.gz \
    -p $FASTQ_DIR/5_R1_001.fastq.gz $FASTQ_DIR/5_R2_001.fastq.gz \
    -p $FASTQ_DIR/8_R1_001.fastq.gz $FASTQ_DIR/8_R2_001.fastq.gz \
    -o $OUTPUT_DIR \
    -t 16 \
    -m 500000000000

sbatch /project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/scripts/variant_calling.sh

Submitted batch job 25043477

This script ran but only processed one sample, trying to run all samples in parallel

nano /project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/scripts/variant_calling_parallel.sh

#!/bin/bash
#SBATCH --job-name=variant_calling_array
#SBATCH --output=variant_calling_array_%A_%a.out
#SBATCH --error=variant_calling_array_%A_%a.err
#SBATCH --partition=cpu-long
#SBATCH --ntasks=1                       
#SBATCH --cpus-per-task=8                 
#SBATCH --time=20:00:00                  
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=pi_hputnam_uri_edu
#SBATCH --constraint=avx512
#SBATCH --mem=500GB
#SBATCH --array=1-12%12  # Adjust based on the number of samples

# Load necessary modules
module load uri/main Trinity/2.15.1-foss-2022a
module unload SAMtools/1.16.1-GCC-11.3.0
module load SAMtools/1.18-GCC-12.3.0
module load STAR/2.7.11b-GCC-12.3.0
module load GATK/4.3.0.0-GCCcore-11.3.0-Java-11
module load picard/2.25.1-Java-11

# Set paths for Trinity SuperTranscripts
export TRINITY_HOME=$EBROOTTRINITY
export PICARD_HOME=$EBROOTPICARD
export GATK_HOME=$EBROOTGATK

# Set SuperTranscript input paths
ST_FA=/project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/trinity_genes.fasta
ST_GTF=/project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/trinity_genes.gtf

# Output directory for variant calling results
OUTPUT_DIR=/project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/

# Directory for trinity home
TRINITY_HOME=/modules/uri_apps/software/Trinity/2.15.1-foss-2022a/trinityrnaseq-v2.15.1

# Directory for trimmed FASTQ files
FASTQ_DIR=/project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/data/trimmed

# Get the input sample files based on the array task ID
INPUT_FILE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" input.txt)
R1=$(echo $INPUT_FILE | cut -d' ' -f1)
R2=$(echo $INPUT_FILE | cut -d' ' -f2)

# Extract a unique identifier from R1 (sample ID)
SAMPLE_ID=$(echo $R1 | cut -d'_' -f1)

# Create a unique output directory for each sample
SAMPLE_OUTPUT_DIR=$OUTPUT_DIR/$SAMPLE_ID

# Ensure the output directory exists
mkdir -p $SAMPLE_OUTPUT_DIR

echo "Processing $R1 and $R2..."

$TRINITY_HOME/Analysis/SuperTranscripts/AllelicVariants/run_variant_calling.py \
      --st_fa $ST_FA \
      --st_gtf $ST_GTF \
      -p $FASTQ_DIR/$R1 $FASTQ_DIR/$R2 \
      -o $SAMPLE_OUTPUT_DIR \
      -t 16 \
      -m 500000000000


sbatch /project/pi_hputnam_uri_edu/dbecks/Heatwave_A.pul_2022Project/scripts/variant_calling_parallel.sh

Submitted batch job 25054610

This script ran! However, this pipeline only output individual .vcf files when we want one .vcf file. Merging and concatenating scripts were unsuccessful as the output from the Trinity pipeline is made for individual .vcf generation. Doing more research for de novo transcriptome variant calling without a reference genome, the bcftools mpileup was more recommended and cited in the literature: Howe et al. 2013 and [Lopez-Maestre et al. 2016] (https://academic.oup.com/nar/article-pdf/44/19/e148/25368087/gkw655.pdf)to get a .vcf output with all combined samples. Also, bcftools mpileup has been suggested for use over GATK HaplotypeCaller for non-human studies as it shown to perform better in terms of recovery rate and accuracy. Running this script below.

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/SNP_calling.sh
#!/bin/bash
#SBATCH --job-name=snp_calling
#SBATCH --output=snp_calling.out
#SBATCH --error=snp_calling.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=96:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH --mem=200G

# Load required modules
module load BCFtools/1.17-GCC-12.2.0
module load SAMtools/1.9-foss-2018b

# Set input and output directories
INPUT_DIR="/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM"
OUTPUT_DIR="/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/snp_BAM"

# Reference genome
REF_TRANS="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/stranded_assembly_cdhit.fasta"

# Index the reference transcriptome
echo "Indexing reference transcriptome..."
samtools faidx $REF_TRANS

# Index BAM files
echo "Indexing BAM files..."
for bam in $INPUT_DIR/*_sorted.bam
do
    samtools index $bam
done

# Call SNPs and include AD (allele depth) in the output
echo "Calling SNPs..."
bcftools mpileup -Ou -f $REF_TRANS --annotate FORMAT/AD,FORMAT/DP --max-depth 1000000 $INPUT_DIR/*_sorted.bam | \
bcftools call -mv -Ov -o $OUTPUT_DIR/variants.vcf

# Filter SNPs with additional criteria
echo "Filtering SNPs..."
bcftools filter -O z -o $OUTPUT_DIR/filtered_variants.vcf.gz -i 'QUAL>20 && FORMAT/DP>10' $OUTPUT_DIR/variants.vcf

# Index the filtered VCF file
echo "Indexing filtered VCF file..."
bcftools index $OUTPUT_DIR/filtered_variants.vcf.gz

# Filter out monomorphic sites
bcftools view -c 2:minor $OUTPUT_DIR/filtered_variants.vcf.gz -o $OUTPUT_DIR/filtered_polymorphic.vcf.gz

# Generate variant statistics
echo "Generating variant statistics..."
bcftools stats $OUTPUT_DIR/filtered_variants.vcf.gz > $OUTPUT_DIR/variant_stats.txt


echo "SNP calling complete. Output files are in $OUTPUT_DIR"

# List the output files
echo "Output files:"
ls -lh $OUTPUT_DIR
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/SNP_calling.sh

Submitted batch job 340769

Filter .vcf just for the adult samples for downstream analysis

interactive
module load BCFtools/1.17-GCC-12.2.0

nano adult_samples.txt

/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/5_sorted.bam
/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/8_sorted.bam
/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/12_sorted.bam
/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/16_sorted.bam
/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/20_sorted.bam
/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/24_sorted.bam

bcftools view -S adult_samples.txt -Oz -o filtered_adult_samples.vcf.gz /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/snp_BAM/filtered_polymorphic.vcf.gz

Filter .vcf just for the larval samples for downstream analysis

interactive
module load BCFtools/1.17-GCC-12.2.0

nano larval_samples.txt

/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/260_sorted.bam
/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/261_sorted.bam
/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/262_sorted.bam
/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/248_sorted.bam
/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/249_sorted.bam
/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/250_sorted.bam

bcftools view -S larval_samples.txt -Oz -o filtered_larval_samples.vcf.gz /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/snp_BAM/filtered_polymorphic.vcf.gz

Download filtered_polymorphic.vcf.gz and to Desktop to load into R for Fst calculation Rscript.

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/snp_BAM/filtered_polymorphic.vcf.gz /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/data/

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/snp_BAM/filtered_adult_samples.vcf.gz /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/data/

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/snp_BAM/filtered_larval_samples.vcf.gz /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/data/
Written on January 1, 2024