Acropora pulchra RNA-seq Bioinformatic Workflow

Mapping sequences from Acropora pulchra coral samples from adult colonies exposed to marine heatwave conditions in Moorea, French Polynesia in March/April 2022 and their offspring (larvae) collected from coral spawning in October 2022 to an unpublished Acropora pulchra genome.

Goal

The following document contains the bioinformatic pipeline used for cleaning, aligning and assembling our raw RNA sequences. The goal is to map the adult and larval sequences to a compiled Acropora pulchra genome. All metadata and information for these projects can be found in this repository and in these notebook posts of the genome assembly and the extraction protocol for the samples used in the assembly. These commands were compiled into bash scripts to run on the URI HPC Andromeda server.

Metadata for submitted sequences

Sequence samples metadata

1) Obtain genome

I am using the Acropora pulchra genome assembled by Jill Ashey and Trinity Conn through the University of Rhode Island and Shedd Aquarium.

Location on Andromeda, the HPC server for URI:

cd /data/putnamlab/tconn/repeats/apul_softmasked/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa.masked

Make folder structure for sequence mapping analysis

cd /data/putnamlab/dbecks/

mkdir Heatwave_A.pul_2022Project
cd Heatwave_A.pul_2022Project

mkdir data
mkdir scripts

cd data
mkdir raw

Copy already downloaded raw sequence files from putnam lab main storage in Andromeda

cp -r /data/putnamlab/KITT/hputnam/20231214_Apulchra_RiboDepletion/* /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw

2) Check file integrity

a) Count all files to make sure all downloaded

ls -1 | wc -l

b) Verify data transfer integrity with md5sum

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/check_transfer.sh

#!/bin/bash ###creating slurm script
#SBATCH -t 24:00:00 ###give script 24 hours to run
#SBATCH --nodes=1 --ntasks-per-node=1 ###on server, different nodes you can use for processing power, node just do one task
#SBATCH --export=NONE
#SBATCH --mem=100GB ###in server allocate 100GB amount of memory
#SBATCH --account=putnamlab ###primary account
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw ###path


md5sum /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw*.gz > URIcheckmd5.md5

md5sum /data/putnamlab/KITT/hputnam/20231214_Apulchra_RiboDepletion/*.gz > URIcheckmd5.md5

sbatch check_transfer.sh
Submitted batch job 253687

Checksum from Genewiz

b4a85239e88aba5d1762d7e90a9d70d8  ./12_R1_001.fastq.gz
eab831185c3b8ca72cf3914f57fd7330  ./12_R2_001.fastq.gz
7f620547442556c977ed6e64322ad3b4  ./16_R1_001.fastq.gz
96343dd00c1ea31d0f35af312db01405  ./16_R2_001.fastq.gz
1c05cc100907fa24681921aa520721ff  ./20_R1_001.fastq.gz
5ba8e2613376b60d42d4a79de32eeafe  ./20_R2_001.fastq.gz
4b29ffe9c1a711864d2615ba2dafbe06  ./248_R1_001.fastq.gz
332babdf0cce128c46dddf2242923d8d  ./248_R2_001.fastq.gz
32d3be14d0c8eaff9f986a8198dbf101  ./249_R1_001.fastq.gz
955885f77f56a52cfec0f5dbe1bb5aa6  ./249_R2_001.fastq.gz
e970c9fef559d883400616144469740f  ./24_R1_001.fastq.gz
1ce920d6100cb0e97efdd7acb1206afe  ./24_R2_001.fastq.gz
a9dd31f2aca0656e855990b253a216d8  ./250_R1_001.fastq.gz
b6821db41d50b1d0c01f39e1ef9fd40e  ./250_R2_001.fastq.gz
25c2235a292e2a20bf6334d74c164855  ./260_R1_001.fastq.gz
f15e2c1bb8959aa686d4848db6535c53  ./260_R2_001.fastq.gz
d424bde716d0a388d4b82910490b16f9  ./261_R1_001.fastq.gz
723bcabf15e03805e3d45819d2c8a9cf  ./261_R2_001.fastq.gz
ea8944531c703a3ae2eb92b6add58944  ./262_R1_001.fastq.gz
ae433007a1a10464af2b41bef690d7ed  ./262_R2_001.fastq.gz
de405dc4452fb440b2bd4edbc9c30a87  ./5_R1_001.fastq.gz
1f53cd6cbb43668a3bffd094b73c9ad6  ./5_R2_001.fastq.gz
5d73e8c968f198e279b68668ef3fb50f  ./8_R1_001.fastq.gz
d2de35adcc456517be1e09bb097aeadb  ./8_R2_001.fastq.gz

c) Verify data integrity with md5sum

Cross-reference the checksum document from GENEWIZ with the data we have on our computer

With a small amount of files, able to first cross-check that the sequences matched between both files on the desktop

Use the code below in terminal to cross-check the files and compare for sanity check

in directory: /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw

md5sum -c Genewiz_Provided.md5

Should output ‘OK’ next to each file name

d) Count number of reads per file using the code after @ in fastq.gz files (e.g.,@GWNJ).

zgrep -c "@A01940" *.gz > raw_seq_counts

12_R1_001.fastq.gz:22077183
12_R2_001.fastq.gz:22077183
16_R1_001.fastq.gz:21806397
16_R2_001.fastq.gz:21806397
20_R1_001.fastq.gz:22366560
20_R2_001.fastq.gz:22366560
248_R1_001.fastq.gz:24234533
248_R2_001.fastq.gz:24234533
249_R1_001.fastq.gz:24655232
249_R2_001.fastq.gz:24655232
24_R1_001.fastq.gz:23762178
24_R2_001.fastq.gz:23762178
250_R1_001.fastq.gz:21345210
250_R2_001.fastq.gz:21345210
260_R1_001.fastq.gz:26117186
260_R2_001.fastq.gz:26117186
261_R1_001.fastq.gz:24168048
261_R2_001.fastq.gz:24168048
262_R1_001.fastq.gz:25536072
262_R2_001.fastq.gz:25536072
5_R1_001.fastq.gz:23140700
5_R2_001.fastq.gz:23140700
8_R1_001.fastq.gz:23792081
8_R2_001.fastq.gz:23792081

3) Run FastQC on sequences

a) Make folders for raw FastQC results and scripts

b) Write script for checking quality with FastQC and submit as job on Andromeda

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_raw.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file

module load FastQC/0.11.9-Java-11

for file in /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/*.gz
do
fastqc $file --outdir /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/qc
done

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_raw.sh

Submitted batch job 291946

c) Make sure all files were processed

ls -1 | wc -l

#48

Combined QC output into 1 file with MultiQC

module load MultiQC/1.9-intel-2020a-Python-3.8.2

multiqc /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/qc

c) Copy MultiQC files to local computer

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/qc/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/original_fastqc

4) Trim and clean reads

a) Make trimmed reads folder in all other results folders

mkdir data/trimmed
cd trimmed

c) Write script for Trimming and run on Andromeda

#Run fastp on files #Trims 5bp from 5’ end of all reads #Trims poly G, if present

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/trim.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file

module load fastp/0.23.2-GCC-11.2.0

array1=($(ls *R1*.fastq.gz)) #Make an array of sequences to trim
for i in ${array1[@]}; do
fastp --in1 ${i} --in2 $(echo ${i}|sed s/_R1/_R2/) --detect_adapter_for_pe --trim_poly_g --trim_front1 15 --trim_front2 15 --qualified_quality_phred 20 --unqualified_percent_limit 5 --out1 ../trimmed/${i} --out2 ../trimmed/$(echo ${i}|sed s/_R1/_R2/)  
done

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/trim.sh

Submitted batch job 292306

#first two runs with different trimming paramaters resulted in errors for per sequence qc content, so using --qualified_quality_phred 15 --unqualified_percent_limit 5 to --qualified_quality_phred sets the threshold for considering a base as high-quality (default is 15, you can adjust as needed).
--unqualified_percent_limit filters out reads with too many low-quality bases.

5) Check quality of trimmed files

a) Check number of files in /trimmed directory

ls -1 | wc -l
#24

b) Check number of reads in /trimmed directory

zgrep -c "@A01940" *.gz > trimmed_seq_counts

12_R1_001.fastq.gz:19797178
12_R2_001.fastq.gz:19797178
16_R1_001.fastq.gz:19557856
16_R2_001.fastq.gz:19557856
20_R1_001.fastq.gz:19986198
20_R2_001.fastq.gz:19986198
248_R1_001.fastq.gz:21556118
248_R2_001.fastq.gz:21556118
249_R1_001.fastq.gz:22043273
249_R2_001.fastq.gz:22043273
24_R1_001.fastq.gz:21277072
24_R2_001.fastq.gz:21277072
250_R1_001.fastq.gz:19038866
250_R2_001.fastq.gz:19038866
260_R1_001.fastq.gz:23396050
260_R2_001.fastq.gz:23396050
261_R1_001.fastq.gz:21707283
261_R2_001.fastq.gz:21707283
262_R1_001.fastq.gz:22781251
262_R2_001.fastq.gz:22781251
5_R1_001.fastq.gz:20876524
5_R2_001.fastq.gz:20876524
8_R1_001.fastq.gz:21284631
8_R2_001.fastq.gz:21284631

c) Run FastQC on trimmed data

mkdir trimmed_qc

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_trimmed.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/trimmed_qc
#SBATCH --error="script_error"
#SBATCH --output="output_script"

module load FastQC/0.11.9-Java-11

for file in /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/*.gz
do
fastqc $file --outdir /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/trimmed_qc
done

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_trimmed.sh

Submitted batch job 292337

d) Run MultiQC on trimmed data

module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/trimmed_qc

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/trimmed_fastqc

6) Align reads

a) Generate genome index

HiSat2 Align reads to genome

HiSat2 HiSat2 Github

Make directories for genome mapping

mkdir hisat2_mapped_genome
mkdir hisat2_refs_genome

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_genome_build.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_refs_genome
#SBATCH --error="script_error_build"
#SBATCH --output="output_script_build"

module load HISAT2/2.2.1-gompi-2022a

hisat2-build -f /data/putnamlab/tconn/repeats/apul_softmasked/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa.masked Apul_genome_ref

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_genome_build.sh

Submitted batch job 338266

b) Align reads to genome

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_align2_genome.sh

#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mem=120GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_align"
#SBATCH --output="output_script_align"


module load HISAT2/2.2.1-gompi-2022a

#Aligning paired end reads
#Has the R1 in array1 because the sed in the for loop changes it to an R2. SAM files are of both forward and reverse reads
#-p increases threads on server, --rna-strandness RF forward and reverse, -dta downstream genome analysis, -q reads are fastq files, -x location of ref -S where to output sam files

for read_pair in *_R1_001.fastq.gz; do
    base_name=$(basename "$read_pair" _R1_001.fastq.gz)
    hisat2 -p 48 --rna-strandness RF --dta -q -x /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_refs_genome/Apul_genome_ref -1 "$read_pair" -2 "${base_name}_R2_001.fastq.gz" -S "/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_genome/${base_name}.sam"
done

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_align2_genome.sh

Submitted batch job 338358

#Download alignment statistics information from mapping, will be the output from your script above, even though it is a mapped output, you are in the trimmed folder here

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/script_error_align_genome /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/data

d) Sort and convert sam to bam and check number of mapped reads and mapping percentages

Explanation:
- samtools sort -o sorted.bam aligned.sam: This command sorts the SAM file and creates a sorted BAM file (Trinity_aligned.sorted.bam).
- samtools index .sorted.bam: This command creates an index for the sorted BAM file. The index file (sorted.bam.bai) is necessary for certain operations and viewers.
- samtools flagstat .sorted.bam: This command generates statistics about the alignment, including the number of mapped reads and mapping percentages.

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/SAMtoBAM_genome.sh

#There will be lots of .tmp file versions in your folder, this is normal while this script runs and they should delete at the end to make one sorted.bam file

#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=8
#SBATCH --export=NONE
#SBATCH --mem=120GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_genome
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_sam_genomme"
#SBATCH --output="output_script_sam_genome"

module load SAMtools/1.16.1-GCC-11.3.0 #Preparation of alignment for assembly: SAMtools

# Directory containing SAM files
sam_directory="/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_genome/"

# Create a directory for the results
mkdir -p "${sam_directory}/results"

# Process each SAM file in the directory
for sam_file in "$sam_directory"/*.sam; do
    if [ -f "$sam_file" ]; then
        # Extract sample name from the file
        sample_name=$(basename "${sam_file%.sam}")

        # Step 1: Sort the SAM file and convert to BAM
        samtools sort -o "${sam_directory}/results/${sample_name}_sorted.bam" "$sam_file"

        # Step 2: Index the BAM file
        samtools index "${sam_directory}/results/${sample_name}_sorted.bam"

        # Step 3: Check Mapping Statistics
        samtools flagstat "${sam_directory}/results/${sample_name}_sorted.bam" > "${sam_directory}/results/${sample_name}_flagstat.txt"

        echo "Processed: ${sample_name}"
        echo "------------------------"
    fi
done

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/SAMtoBAM_genome.sh

Submitted batch job 338493

Alignment statistics

interactive

module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools

for i in *.bam; do
    echo "${i}" >> mapped_reads_counts_genome_Apul
    samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts_genome_Apul
done

12_sorted.bam
138010934 + 0 mapped (96.43% : N/A)
16_sorted.bam
137288283 + 0 mapped (96.33% : N/A)
20_sorted.bam
143943875 + 0 mapped (97.04% : N/A)
248_sorted.bam
157752576 + 0 mapped (97.39% : N/A)
249_sorted.bam
179799724 + 0 mapped (98.67% : N/A)
24_sorted.bam
159014625 + 0 mapped (97.32% : N/A)
250_sorted.bam
155177825 + 0 mapped (98.59% : N/A)
260_sorted.bam
186770220 + 0 mapped (98.90% : N/A)
261_sorted.bam
167907087 + 0 mapped (98.61% : N/A)
262_sorted.bam
177725506 + 0 mapped (98.72% : N/A)
5_sorted.bam
118825396 + 0 mapped (90.16% : N/A)
8_sorted.bam
154489579 + 0 mapped (97.08% : N/A)

Pretty high mapping percentages (>90%) for all samples!

e) Download mapping percentages to desktop

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_genome/results/mapped_reads_counts_genome_Apul /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/data/mapping_percentages_genome

7) Perform gene counts with stringTie

a) Load structural annotation for genome

/data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3

b) Make directory for stringTie

mkdir stringTie_genome
cd stringTie_genome

mkdir BAM_genome GTF_genome GTF_merge_genome

c) Copy BAM files to stringTie folder

cd /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_genome/results/
mv *_sorted.bam /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM_genome

d) Assemble and estimate reads, following vignette for transdecoder and transcriptome specific setttings.

cd BAM_genome

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/stringTie_assemble_genome.sh


#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH --error="stringtie_genome_out_error"
#SBATCH --output="stringtie_genome_out"

module load StringTie/2.2.1-GCC-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0

# Define the directory containing BAM files
F=/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM_genome

# List all BAM files in the directory
array1=($(ls $F/*bam))

# Process each BAM file
for i in ${array1[@]}; do
    # Run StringTie with options -e -B -G for each BAM file
    stringtie -e -B -G /data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3 -o ${i}.gtf ${i}

    # Print the current BAM file name for tracking
    echo "${i}"
done

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/stringTie_assemble_genome.sh

Submitted batch job 335278

e) Merge stringTie gtf results

mv *gtf ../GTF_genome/

ls *gtf > apul_mergelist.txt
cat apul_mergelist.txt

module load StringTie/2.2.1-GCC-11.2.0

stringtie --merge -p 8 -G /data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3 -o stringtie_apul_merged_genome.gtf apul_mergelist_genome.txt

f) Assess assembly quality

module load GffCompare/0.12.6-GCC-11.2.0

gffcompare -r /data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3 -o Apul.merged_genome stringtie_apul_merged_genome.gtf

g) Re-estimate assembly if necessary!

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/stringTie_apul_re-assemble_genome.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH --error="re-assemble_apul_genome_out_error"
#SBATCH --output="re-assemble_apul_genome_out"

module load StringTie/2.2.1-GCC-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0

# Define the directory containing BAM files
F=/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie_genome/BAM

array1=($(ls $F/*bam))
for i in ${array1[@]}
do
stringtie -e -G /data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3 -o ${i}.merge.gtf ${i}
echo "${i}"
done

sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/stringTie_apul_re-assemble_genome.sh

# move merged GTF files to their own folder
mkdir GTF_merge_genome

mv *merge.gtf GTF_merge_genome

f) Create gene matrix

#making a sample txt file with all gtf file names

F=/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie_genome/GTF_merge_genome/

array2=($(ls *merge.gtf))
for i in ${array2[@]}
do
echo "${i} $F${i}" >> sample_list.txt
done

#sample_list.txt document output

#create gene matrix

nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/GTFtoCounts_genome.sh

#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/
#SBATCH --cpus-per-task=3


module load StringTie/2.2.1-GCC-11.2.0
module load Python/2.7.18-GCCcore-9.3.0

python prepDE.py -g Apul_genome_gene_count_matrix.csv -i sample_list.txt

sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_RNASeq_compare/scripts/GTFtoCounts.sh

g) Secure-copy gene counts onto local computer, make sure to open a seperate command shell outside of Andromeda on your own terminal

#copy gene count matrix

scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_RNASeq_compare/data/GTF_merge/Poc_gene_count_matrix.csv /Users/Danielle/Desktop/Putnam_Lab/

Written on January 1, 2024