Pocillopora effusa RNA-seq Bioinformatic Workflow
Alignment comparison of the Pocillopora effusa genome to the Pocillopora verrucosa and Pocillopora meandrina genomes for E5 molecular RNASeq data
Goal
The following document contains the bioinformatic pipeline used for cleaning, aligning and assembling our raw RNA sequences. The goal is to compare the alignment statistics when using the Mo’ore’a Pocillopora effusa genome compared to the Red Sea Pocillopora verrucosa genome and the Hawaii Pocillopora meandrina genome. These commands were compiled into bash scripts to run on the URI HPC Andromeda server.
Project overview
Bioinformatic tools used in analysis:
- Quality check: FastQC, MultiQC
- Quality trimming: Fastp
- Alignment to reference genome: HISAT2
- Preparation of alignment for assembly: SAMtools
- Transcript assembly and quantification: StringTie
Check for required software on Andromeda
Update software to latest version
- fastqc: FastQC/0.11.9-Java-11
- MultiQC: MultiQC/1.9-intel-2020a-Python-3.8.2
- fastp: fastp/0.19.7-foss-2018b
- HISAT2: HISAT2/2.2.1-foss-2019b
- SAMtools: SAMtools/1.9-foss-2018b
- StringTie: StringTie/2.2.1-GCC-11.2.0
Prepare work space
- Upload raw reads and reference genome to server
- Assess that your files have all uploaded correctly
- Prepare your working directory
- Install all necessary programs
Upload raw reads and reference genome to server
This is done with the scp
or “secure copy” linux command. SCP allows the secure transferring of files between a local host and a remote host or between two remote hosts using ssh authorization.
Secure Copy (scp) Options:
- -r - Recursively copy entire directories
scp -r xxxx <path_to_raw_reads> danielle_becker@bluewaves.uri.edu:<path_to_storage>
scp -r /Users/Danielle/Downloads/Pocillopora_effusa_v3.annot_proteins_fasta.pep.fa ssh danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/REFS/Peff/genome_assembly
1) Obtain Reference Genome and Create Folder Structure
I am using the Mo’ore’a Pocillopora effusa genome to compare alignment statistics.
Location on Andromeda, the HPC server for URI:
cd /data/putnamlab/REFS/
mkdir Peff
All of the genome files (i.e., genome scaffolds, CDS, proteins, GFF, and functional annotations) were downloaded from the Coral Genoscope Database.
I put them on the Putnam Lab HPC Andromeda server for future reference:
cp -R /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/refs /data/putnamlab/REFS/Peff
pwd /data/putnamlab/REFS/Peff/refs
Make folder structure in personal pathway on Andromeda
pwd /data/putnamlab/dbecks/Becker_E5/
mkdir Peff_RNASeq_compare
cd Peff_RNASeq_compare
mkdir data/
mkdir scripts/
Download files into data folder
Path where we stored the RAW fastq.gz files
/data/putnamlab/KITT/hputnam/20201209_Becker_RNASeq_combo/combo
Use code to copy files from one directory to the other
cp -R /data/putnamlab/KITT/hputnam/20201209_Becker_RNASeq_combo/combo/*.gz /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/raw
2) Check file integrity
a) Count all files to make sure all downloaded
cd data/raw
ls -1 | wc -l
#64 is the right total
b) Verify data transfer integrity with md5sum
md5sum *.fastq.gz > URIcheckmd5.md5
md5sum -c URIcheckmd5.md5
#all okay
Should output ‘OK’ next to each file name
d) Count number of reads per file using the code after @ in fastq.gz files (e.g.,@GWNJ).
zgrep -c "@GWNJ" *.gz > rawread.counts.txt
C17_R1_001.fastq.gz:25158606
C17_R2_001.fastq.gz:25158606
C18_R1_001.fastq.gz:22733345
C18_R2_001.fastq.gz:22733345
C19_R1_001.fastq.gz:24846067
C19_R2_001.fastq.gz:24846067
C20_R1_001.fastq.gz:24030431
C20_R2_001.fastq.gz:24030431
C21_R1_001.fastq.gz:16484060
C21_R2_001.fastq.gz:16484060
C22_R1_001.fastq.gz:22990550
C22_R2_001.fastq.gz:22990550
C23_R1_001.fastq.gz:20905338
C23_R2_001.fastq.gz:20905338
C24_R1_001.fastq.gz:22578178
C24_R2_001.fastq.gz:22578178
C25_R1_001.fastq.gz:29417106
C25_R2_001.fastq.gz:29417106
C26_R1_001.fastq.gz:23267238
C26_R2_001.fastq.gz:23267238
C27_R1_001.fastq.gz:24990687
C27_R2_001.fastq.gz:24990687
C28_R1_001.fastq.gz:23396439
C28_R2_001.fastq.gz:23396439
C29_R1_001.fastq.gz:17900262
C29_R2_001.fastq.gz:17900262
C30_R1_001.fastq.gz:26361873
C30_R2_001.fastq.gz:26361873
C31_R1_001.fastq.gz:24642131
C31_R2_001.fastq.gz:24642131
C32_R1_001.fastq.gz:27076800
C32_R2_001.fastq.gz:27076800
E10_R1_001.fastq.gz:29131006
E10_R2_001.fastq.gz:29131006
E11_R1_001.fastq.gz:18568153
E11_R2_001.fastq.gz:18568153
E12_R1_001.fastq.gz:27805087
E12_R2_001.fastq.gz:27805087
E13_R1_001.fastq.gz:24455094
E13_R2_001.fastq.gz:24455094
E14_R1_001.fastq.gz:22630044
E14_R2_001.fastq.gz:22630044
E15_R1_001.fastq.gz:23796710
E15_R2_001.fastq.gz:23796710
E16_R1_001.fastq.gz:29523059
E16_R2_001.fastq.gz:29523059
E1_R1_001.fastq.gz:25368402
E1_R2_001.fastq.gz:25368402
E2_R1_001.fastq.gz:23770610
E2_R2_001.fastq.gz:23770610
E3_R1_001.fastq.gz:25641209
E3_R2_001.fastq.gz:25641209
E4_R1_001.fastq.gz:21751368
E4_R2_001.fastq.gz:21751368
E5_R1_001.fastq.gz:16381619
E5_R2_001.fastq.gz:16381619
E6_R1_001.fastq.gz:24937261
E6_R2_001.fastq.gz:24937261
E7_R1_001.fastq.gz:24020166
E7_R2_001.fastq.gz:24020166
E8_R1_001.fastq.gz:23675842
E8_R2_001.fastq.gz:23675842
E9_R1_001.fastq.gz:25068848
E9_R2_001.fastq.gz:25068848
=======
/data/putnamlab/KITT/hputnam/20201209_Becker_RNASeq_combo/combo/md5sum_list.txt
3) Run FastQC
a) Make folders for raw FastQC results and scripts
cd Peff_RNASeq_compare/data
mkdir fastqc_results
b) Write script for checking quality with FastQC and submit as job on Andromeda
nano /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/fastqc_raw.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file
module load FastQC/0.11.9-Java-11
for file in /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/raw/*.gz
do
fastqc $file --outdir /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/fastqc_results/
done
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/fastqc_raw.sh
Submitted batch job 262746 at 13:48, will take ~ 5 hours
c) Make sure all files were processed
ls -1 | wc -l
#128
Combined QC output into 1 file with MultiQC, do not need a script due to fast computational time
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/fastqc_results/*fastqc.zip -o /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/fastqc_results/multiqc/
c) Copy MultiQC and FastQC files to local computer
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/fastqc_results/multiqc/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/RNASeq/P.effusa/original_fastqc
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/fastqc_results/*.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/RNASeq/P.effusa/original_fastqc
4) Trim and clean reads
a) Make trimmed reads folder in all other results folders
mkdir data/trimmed
cd trimmed
c) Write script for Trimming and run on Andromeda
#Run fastp on files #Trims 20bp from 5’ end of all reads #Trims poly G, if present
nano /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/trim.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file
module load fastp/0.19.7-foss-2018b
array1=($(ls *R1*.fastq.gz)) #Make an array of sequences to trim
for i in ${array1[@]}; do
fastp --in1 ${i} --in2 $(echo ${i}|sed s/_R1/_R2/) --detect_adapter_for_pe --trim_poly_g --trim_front1 20 --trim_front2 20 --out1 ../trimmed/${i} --out2 ../trimmed/$(echo ${i}|sed s/_R1/_R2/)
done
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/trim.sh
Submitted batch job 263692 on 20230620 at 8:30am
5) Check quality of trimmed files
a) Check number of files in /trimmed directory
ls -1 | wc -l
#64
b) Check number of reads in /trimmed directory
zgrep -c "@GWNJ" *.gz > trimmed_seq_counts
C17_R1_001.fastq.gz:24074769
C17_R2_001.fastq.gz:24074769
C18_R1_001.fastq.gz:22161558
C18_R2_001.fastq.gz:22161558
C19_R1_001.fastq.gz:24105572
C19_R2_001.fastq.gz:24105572
C20_R1_001.fastq.gz:23305604
C20_R2_001.fastq.gz:23305604
C21_R1_001.fastq.gz:15898378
C21_R2_001.fastq.gz:15898378
C22_R1_001.fastq.gz:22275366
C22_R2_001.fastq.gz:22275366
C23_R1_001.fastq.gz:20262323
C23_R2_001.fastq.gz:20262323
C24_R1_001.fastq.gz:21738911
C24_R2_001.fastq.gz:21738911
C25_R1_001.fastq.gz:28399410
C25_R2_001.fastq.gz:28399410
C26_R1_001.fastq.gz:22355910
C26_R2_001.fastq.gz:22355910
C27_R1_001.fastq.gz:23740312
C27_R2_001.fastq.gz:23740312
C28_R1_001.fastq.gz:22661445
C29_R1_001.fastq.gz:17263262
C29_R2_001.fastq.gz:17263262
C30_R1_001.fastq.gz:25438934
C30_R2_001.fastq.gz:25438934
C31_R1_001.fastq.gz:23850142
C31_R2_001.fastq.gz:23850142
C32_R1_001.fastq.gz:25903448
C32_R2_001.fastq.gz:25903448
E10_R1_001.fastq.gz:28093089
E10_R2_001.fastq.gz:28093089
E11_R1_001.fastq.gz:17952847
E11_R2_001.fastq.gz:17952847
E12_R1_001.fastq.gz:26995262
E12_R2_001.fastq.gz:26995262
E13_R1_001.fastq.gz:23731898
E13_R2_001.fastq.gz:23731898
E14_R1_001.fastq.gz:21828746
E14_R2_001.fastq.gz:21828746
E15_R1_001.fastq.gz:22613396
E15_R2_001.fastq.gz:22613396
E16_R1_001.fastq.gz:28773827
E16_R2_001.fastq.gz:28773827
E1_R1_001.fastq.gz:24464129
E1_R2_001.fastq.gz:24464129
E2_R1_001.fastq.gz:23119297
E2_R2_001.fastq.gz:23119297
E3_R1_001.fastq.gz:25022842
E3_R2_001.fastq.gz:25022842
E4_R1_001.fastq.gz:21162083
E4_R2_001.fastq.gz:21162083
E5_R1_001.fastq.gz:15835633
E5_R2_001.fastq.gz:15835633
E6_R1_001.fastq.gz:24282809
E6_R2_001.fastq.gz:24282809
E7_R1_001.fastq.gz:23375734
E7_R2_001.fastq.gz:23375734
E8_R1_001.fastq.gz:22969558
E8_R2_001.fastq.gz:22969558
E9_R1_001.fastq.gz:24269213
E9_R2_001.fastq.gz:24269213
c) Run FastQC on trimmed data
mkdir trimmed_qc
nano /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/fastqc_trimmed.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/trimmed/trimmed_qc
#SBATCH --error="script_error"
#SBATCH --output="output_script"
module load FastQC/0.11.8-Java-1.8
for file in /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/trimmed/*.gz
do
fastqc $file --outdir /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/trimmed/trimmed_qc
done
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/fastqc_trimmed.sh
Submitted batch job 266006 on 20230621 at 5:02pm
d) Run MultiQC on trimmed data, Combined QC output into 1 file with MultiQC, do not need a script due to fast computational time
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/trimmed/trimmed_qc/*fastqc.zip -o /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/trimmed/trimmed_qc/trimmed_multiqc
e) Copy multiqc and fastqc to computer, use terminal window fro desktop not in server
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/trimmed/trimmed_qc/trimmed_multiqc/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/RNASeq/P.effusa/trimmed_fastqc
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/trimmed/trimmed_qc/*.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/RNASeq/P.effusa/trimmed_fastqc
6) Align reads
a) Generate genome build
HiSat2 Align reads to refernece genome
nano /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/Hisat2_genome_build.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/refs
#SBATCH --error="script_error"
#SBATCH --output="output_script"
module load HISAT2/2.2.1-foss-2019b #Alignment to reference genome: HISAT2
# index the reference genome for Peff output index to working directory
hisat2-build -f /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/refs/Pocillopora_effusa_v3_genome_contigs.fa ./Peff_ref # called the reference genome (scaffolds)
echo "Reference genome indexed. Starting alignment" $(date)
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/Hisat2_genome_build.sh
Submitted batch job 266036
b) Align reads to genome
mkdir data/mapped
nano /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/Hisat2_align2.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/trimmed
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error"
#SBATCH --output="output_script"
module load HISAT2/2.2.1-foss-2019b
#Aligning paired end reads
#Has the R1 in array1 because the sed in the for loop changes it to an R2. SAM files are of both forward and reverse reads
array1=($(ls *R1*.fastq.gz))
for i in ${array1[@]}; do
hisat2 -p 48 --rna-strandness RF --dta -q -x /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/refs/Peff_ref -1 /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/trimmed/${i} \
-2 /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/trimmed/$(echo ${i}|sed s/_R1/_R2/) -S /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped/${i}.sam
done
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/Hisat2_align2.sh
Submitted batch job 266041, took four hours
Sort and convert sam to bam and remove .sam files to make space
nano /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/SAMtoBAM.sh
#There will be lots of .tmp file versions in your folder, this is normal while this script runs and they should delete at the end to make one sorted.bam file
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=8
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error"
#SBATCH --output="output_script"
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
array1=($(ls *.sam))
for i in ${array1[@]}; do
samtools sort -o /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped/${i}.sorted.bam /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped/${i}
echo "${i} bam-ified!"
rm /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped/*.sam
done
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/SAMtoBAM.sh
Submitted batch job 266945
To view
#move alignment rates file from trimmed folder to mapped and download to computer
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped/alignment_rates /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/RNASeq/P.effusa
Check number of mapped reads and mapping percentages
Explanation for SAMtools functions for checking the mapped reads in a paired-end dataset found here: https://www.biostars.org/p/138116/
nano /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/mapped_read_counts.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=8
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped
#SBATCH --cpus-per-task=3
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
for i in *.bam; do
echo "${i}" >> mapped_reads_counts_Peff
samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts_Peff
done
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/mapped_read_counts.sh
Submitted batch job 267201
7) Perform gene counts with stringTie
mkdir counts
cd counts
b) Assemble and estimate reads
nano /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/StringTie_Assemble.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped
#SBATCH --cpus-per-task=3
module load StringTie/2.2.1-GCC-11.2.0
array1=($(ls *.bam))
for i in ${array1[@]}; do
stringtie -p 48 --rf -e -G /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/refs/Pocillopora_effusa_v3.annot_genes.gff -o /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/counts/${i}.gtf /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped/${i}
done
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/StringTie_Assemble.sh
c) Merge stringTie gtf results
#in this step we are making a file with all the gtf names and stringtie will merge them all together for a master list for your specific genes
ls *gtf > mergelist.txt
cat mergelist.txt
module load StringTie/2.2.1-GCC-11.2.0
stringtie --merge -p 8 -G /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/refs/Pocillopora_effusa_v3.annot_genes.gff -o stringtie_merged.gtf mergelist.txt
d) Assess assembly quality
nano /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/gffcompare.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=8
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/counts
#SBATCH --cpus-per-task=3
module load GffCompare/0.12.6-GCC-11.2.0
gffcompare -r /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/refs/Pocillopora_effusa_v3.annot_genes.gff -o merged stringtie_merged.gtf
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/gffcompare.sh
e) Re-estimate assembly
nano /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/re_estimate.assembly.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped
#SBATCH --cpus-per-task=3
module load StringTie/2.2.1-GCC-11.2.0
array1=($(ls *.bam))
for i in ${array1[@]}; do
stringtie -e -G /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/refs/Pocillopora_effusa_v3.annot_genes.gff -o ${i}.merge.gtf ${i}
echo "${i}"
done
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/re_estimate.assembly.sh
# move merged GTF files to their own folder
mkdir GTF_merge
mv *merge.gtf GTF_merge
f) Create gene matrix
#making a sample txt file with all gtf file names
F=/data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped/GTF_merge/
array2=($(ls *merge.gtf))
for i in ${array2[@]}
do
echo "${i} $F${i}" >> sample_list.txt
done
#sample_list.txt document output
#create gene matrix
nano /data/putnamlab/hputnam/Becker_E5/Peff_RNASeq_compare/scripts/GTFtoCounts.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped/GTF_merge
#SBATCH --cpus-per-task=3
module load StringTie/2.2.1-GCC-11.2.0
module load Python/2.7.18-GCCcore-9.3.0
python prepDE.py -g Peff_gene_count_matrix.csv -i sample_list.txt
sbatch /data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/scripts/GTFtoCounts.sh
g) Secure-copy gene counts onto local computer, make sure to open a seperate command shell outside of Andromeda on your own terminal
#copy gene count matrix
scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped/GTF_merge/Poc_gene_count_matrix.csv /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/RAnalysis/Data/RNA-seq/P.effusa
#copy transcript count matrix
scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Peff_RNASeq_compare/data/mapped/GTF_merge/transcript_count_matrix.csv /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/RAnalysis/Data/RNA-seq/P.effusa