Acropora pulchra RNA-seq Bioinformatic Workflow
Mapping sequences from Acropora pulchra coral samples from adult colonies exposed to marine heatwave conditions in Moorea, French Polynesia in March/April 2022 and their offspring (larvae) collected from coral spawning in October 2022 to an unpublished Acropora pulchra genome.
Goal
The following document contains the bioinformatic pipeline used for cleaning, aligning and assembling our raw RNA sequences. The goal is to map the adult and larval sequences to a compiled Acropora pulchra genome. All metadata and information for these projects can be found in this repository and in these notebook posts of the genome assembly and the extraction protocol for the samples used in the assembly. These commands were compiled into bash scripts to run on the URI HPC Andromeda server.
Metadata for submitted sequences
1) Obtain genome
I am using the Acropora pulchra genome assembled by Jill Ashey and Trinity Conn through the University of Rhode Island and Shedd Aquarium.
Location on Andromeda, the HPC server for URI:
cd /data/putnamlab/tconn/repeats/apul_softmasked/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa.masked
Make folder structure for sequence mapping analysis
cd /data/putnamlab/dbecks/
mkdir Heatwave_A.pul_2022Project
cd Heatwave_A.pul_2022Project
mkdir data
mkdir scripts
cd data
mkdir raw
Copy already downloaded raw sequence files from putnam lab main storage in Andromeda
cp -r /data/putnamlab/KITT/hputnam/20231214_Apulchra_RiboDepletion/* /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw
2) Check file integrity
a) Count all files to make sure all downloaded
ls -1 | wc -l
b) Verify data transfer integrity with md5sum
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/check_transfer.sh
#!/bin/bash ###creating slurm script
#SBATCH -t 24:00:00 ###give script 24 hours to run
#SBATCH --nodes=1 --ntasks-per-node=1 ###on server, different nodes you can use for processing power, node just do one task
#SBATCH --export=NONE
#SBATCH --mem=100GB ###in server allocate 100GB amount of memory
#SBATCH --account=putnamlab ###primary account
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw ###path
md5sum /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw*.gz > URIcheckmd5.md5
md5sum /data/putnamlab/KITT/hputnam/20231214_Apulchra_RiboDepletion/*.gz > URIcheckmd5.md5
sbatch check_transfer.sh
Submitted batch job 253687
Checksum from Genewiz
b4a85239e88aba5d1762d7e90a9d70d8 ./12_R1_001.fastq.gz
eab831185c3b8ca72cf3914f57fd7330 ./12_R2_001.fastq.gz
7f620547442556c977ed6e64322ad3b4 ./16_R1_001.fastq.gz
96343dd00c1ea31d0f35af312db01405 ./16_R2_001.fastq.gz
1c05cc100907fa24681921aa520721ff ./20_R1_001.fastq.gz
5ba8e2613376b60d42d4a79de32eeafe ./20_R2_001.fastq.gz
4b29ffe9c1a711864d2615ba2dafbe06 ./248_R1_001.fastq.gz
332babdf0cce128c46dddf2242923d8d ./248_R2_001.fastq.gz
32d3be14d0c8eaff9f986a8198dbf101 ./249_R1_001.fastq.gz
955885f77f56a52cfec0f5dbe1bb5aa6 ./249_R2_001.fastq.gz
e970c9fef559d883400616144469740f ./24_R1_001.fastq.gz
1ce920d6100cb0e97efdd7acb1206afe ./24_R2_001.fastq.gz
a9dd31f2aca0656e855990b253a216d8 ./250_R1_001.fastq.gz
b6821db41d50b1d0c01f39e1ef9fd40e ./250_R2_001.fastq.gz
25c2235a292e2a20bf6334d74c164855 ./260_R1_001.fastq.gz
f15e2c1bb8959aa686d4848db6535c53 ./260_R2_001.fastq.gz
d424bde716d0a388d4b82910490b16f9 ./261_R1_001.fastq.gz
723bcabf15e03805e3d45819d2c8a9cf ./261_R2_001.fastq.gz
ea8944531c703a3ae2eb92b6add58944 ./262_R1_001.fastq.gz
ae433007a1a10464af2b41bef690d7ed ./262_R2_001.fastq.gz
de405dc4452fb440b2bd4edbc9c30a87 ./5_R1_001.fastq.gz
1f53cd6cbb43668a3bffd094b73c9ad6 ./5_R2_001.fastq.gz
5d73e8c968f198e279b68668ef3fb50f ./8_R1_001.fastq.gz
d2de35adcc456517be1e09bb097aeadb ./8_R2_001.fastq.gz
c) Verify data integrity with md5sum
Cross-reference the checksum document from GENEWIZ with the data we have on our computer
With a small amount of files, able to first cross-check that the sequences matched between both files on the desktop
Use the code below in terminal to cross-check the files and compare for sanity check
in directory: /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw
md5sum -c Genewiz_Provided.md5
Should output ‘OK’ next to each file name
d) Count number of reads per file using the code after @ in fastq.gz files (e.g.,@GWNJ).
zgrep -c "@A01940" *.gz > raw_seq_counts
12_R1_001.fastq.gz:22077183
12_R2_001.fastq.gz:22077183
16_R1_001.fastq.gz:21806397
16_R2_001.fastq.gz:21806397
20_R1_001.fastq.gz:22366560
20_R2_001.fastq.gz:22366560
248_R1_001.fastq.gz:24234533
248_R2_001.fastq.gz:24234533
249_R1_001.fastq.gz:24655232
249_R2_001.fastq.gz:24655232
24_R1_001.fastq.gz:23762178
24_R2_001.fastq.gz:23762178
250_R1_001.fastq.gz:21345210
250_R2_001.fastq.gz:21345210
260_R1_001.fastq.gz:26117186
260_R2_001.fastq.gz:26117186
261_R1_001.fastq.gz:24168048
261_R2_001.fastq.gz:24168048
262_R1_001.fastq.gz:25536072
262_R2_001.fastq.gz:25536072
5_R1_001.fastq.gz:23140700
5_R2_001.fastq.gz:23140700
8_R1_001.fastq.gz:23792081
8_R2_001.fastq.gz:23792081
3) Run FastQC on sequences
a) Make folders for raw FastQC results and scripts
b) Write script for checking quality with FastQC and submit as job on Andromeda
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_raw.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file
module load FastQC/0.11.9-Java-11
for file in /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/*.gz
do
fastqc $file --outdir /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/qc
done
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_raw.sh
Submitted batch job 291946
c) Make sure all files were processed
ls -1 | wc -l
#48
Combined QC output into 1 file with MultiQC
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/qc
c) Copy MultiQC files to local computer
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw/qc/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/original_fastqc
4) Trim and clean reads
a) Make trimmed reads folder in all other results folders
mkdir data/trimmed
cd trimmed
c) Write script for Trimming and run on Andromeda
#Run fastp on files #Trims 5bp from 5’ end of all reads #Trims poly G, if present
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/trim.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file
module load fastp/0.23.2-GCC-11.2.0
array1=($(ls *R1*.fastq.gz)) #Make an array of sequences to trim
for i in ${array1[@]}; do
fastp --in1 ${i} --in2 $(echo ${i}|sed s/_R1/_R2/) --detect_adapter_for_pe --trim_poly_g --trim_front1 15 --trim_front2 15 --qualified_quality_phred 20 --unqualified_percent_limit 5 --out1 ../trimmed/${i} --out2 ../trimmed/$(echo ${i}|sed s/_R1/_R2/)
done
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/trim.sh
Submitted batch job 292306
#first two runs with different trimming paramaters resulted in errors for per sequence qc content, so using --qualified_quality_phred 15 --unqualified_percent_limit 5 to --qualified_quality_phred sets the threshold for considering a base as high-quality (default is 15, you can adjust as needed).
--unqualified_percent_limit filters out reads with too many low-quality bases.
5) Check quality of trimmed files
a) Check number of files in /trimmed directory
ls -1 | wc -l
#24
b) Check number of reads in /trimmed directory
zgrep -c "@A01940" *.gz > trimmed_seq_counts
12_R1_001.fastq.gz:19797178
12_R2_001.fastq.gz:19797178
16_R1_001.fastq.gz:19557856
16_R2_001.fastq.gz:19557856
20_R1_001.fastq.gz:19986198
20_R2_001.fastq.gz:19986198
248_R1_001.fastq.gz:21556118
248_R2_001.fastq.gz:21556118
249_R1_001.fastq.gz:22043273
249_R2_001.fastq.gz:22043273
24_R1_001.fastq.gz:21277072
24_R2_001.fastq.gz:21277072
250_R1_001.fastq.gz:19038866
250_R2_001.fastq.gz:19038866
260_R1_001.fastq.gz:23396050
260_R2_001.fastq.gz:23396050
261_R1_001.fastq.gz:21707283
261_R2_001.fastq.gz:21707283
262_R1_001.fastq.gz:22781251
262_R2_001.fastq.gz:22781251
5_R1_001.fastq.gz:20876524
5_R2_001.fastq.gz:20876524
8_R1_001.fastq.gz:21284631
8_R2_001.fastq.gz:21284631
c) Run FastQC on trimmed data
mkdir trimmed_qc
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_trimmed.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/trimmed_qc
#SBATCH --error="script_error"
#SBATCH --output="output_script"
module load FastQC/0.11.9-Java-11
for file in /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/*.gz
do
fastqc $file --outdir /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/trimmed_qc
done
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/fastqc_trimmed.sh
Submitted batch job 292337
d) Run MultiQC on trimmed data
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/trimmed_qc
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/trimmed_fastqc
6) Align reads
a) Generate genome index
HiSat2 Align reads to genome
Make directories for genome mapping
mkdir hisat2_mapped_genome
mkdir hisat2_refs_genome
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_genome_build.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_refs_genome
#SBATCH --error="script_error_build"
#SBATCH --output="output_script_build"
module load HISAT2/2.2.1-gompi-2022a
hisat2-build -f /data/putnamlab/tconn/repeats/apul_softmasked/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa.masked Apul_genome_ref
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_genome_build.sh
Submitted batch job 338266
b) Align reads to genome
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_align2_genome.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mem=120GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_align"
#SBATCH --output="output_script_align"
module load HISAT2/2.2.1-gompi-2022a
#Aligning paired end reads
#Has the R1 in array1 because the sed in the for loop changes it to an R2. SAM files are of both forward and reverse reads
#-p increases threads on server, --rna-strandness RF forward and reverse, -dta downstream genome analysis, -q reads are fastq files, -x location of ref -S where to output sam files
for read_pair in *_R1_001.fastq.gz; do
base_name=$(basename "$read_pair" _R1_001.fastq.gz)
hisat2 -p 48 --rna-strandness RF --dta -q -x /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_refs_genome/Apul_genome_ref -1 "$read_pair" -2 "${base_name}_R2_001.fastq.gz" -S "/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_genome/${base_name}.sam"
done
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/Hisat2_align2_genome.sh
Submitted batch job 338358
#Download alignment statistics information from mapping, will be the output from your script above, even though it is a mapped output, you are in the trimmed folder here
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/trimmed/script_error_align_genome /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/data
d) Sort and convert sam to bam and check number of mapped reads and mapping percentages
- Explanation:
- samtools sort -o sorted.bam aligned.sam: This command sorts the SAM file and creates a sorted BAM file (Trinity_aligned.sorted.bam).
- samtools index .sorted.bam: This command creates an index for the sorted BAM file. The index file (sorted.bam.bai) is necessary for certain operations and viewers.
- samtools flagstat .sorted.bam: This command generates statistics about the alignment, including the number of mapped reads and mapping percentages.
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/SAMtoBAM_genome.sh
#There will be lots of .tmp file versions in your folder, this is normal while this script runs and they should delete at the end to make one sorted.bam file
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=8
#SBATCH --export=NONE
#SBATCH --mem=120GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_genome
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_sam_genomme"
#SBATCH --output="output_script_sam_genome"
module load SAMtools/1.16.1-GCC-11.3.0 #Preparation of alignment for assembly: SAMtools
# Directory containing SAM files
sam_directory="/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_genome/"
# Create a directory for the results
mkdir -p "${sam_directory}/results"
# Process each SAM file in the directory
for sam_file in "$sam_directory"/*.sam; do
if [ -f "$sam_file" ]; then
# Extract sample name from the file
sample_name=$(basename "${sam_file%.sam}")
# Step 1: Sort the SAM file and convert to BAM
samtools sort -o "${sam_directory}/results/${sample_name}_sorted.bam" "$sam_file"
# Step 2: Index the BAM file
samtools index "${sam_directory}/results/${sample_name}_sorted.bam"
# Step 3: Check Mapping Statistics
samtools flagstat "${sam_directory}/results/${sample_name}_sorted.bam" > "${sam_directory}/results/${sample_name}_flagstat.txt"
echo "Processed: ${sample_name}"
echo "------------------------"
fi
done
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/SAMtoBAM_genome.sh
Submitted batch job 338493
Alignment statistics
interactive
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
for i in *.bam; do
echo "${i}" >> mapped_reads_counts_genome_Apul
samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts_genome_Apul
done
12_sorted.bam
138010934 + 0 mapped (96.43% : N/A)
16_sorted.bam
137288283 + 0 mapped (96.33% : N/A)
20_sorted.bam
143943875 + 0 mapped (97.04% : N/A)
248_sorted.bam
157752576 + 0 mapped (97.39% : N/A)
249_sorted.bam
179799724 + 0 mapped (98.67% : N/A)
24_sorted.bam
159014625 + 0 mapped (97.32% : N/A)
250_sorted.bam
155177825 + 0 mapped (98.59% : N/A)
260_sorted.bam
186770220 + 0 mapped (98.90% : N/A)
261_sorted.bam
167907087 + 0 mapped (98.61% : N/A)
262_sorted.bam
177725506 + 0 mapped (98.72% : N/A)
5_sorted.bam
118825396 + 0 mapped (90.16% : N/A)
8_sorted.bam
154489579 + 0 mapped (97.08% : N/A)
Pretty high mapping percentages (>90%) for all samples!
e) Download mapping percentages to desktop
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_genome/results/mapped_reads_counts_genome_Apul /Users/Danielle/Desktop/Putnam_Lab/A.pul_Heatwave/bioinformatics/data/mapping_percentages_genome
7) Perform gene counts with stringTie
a) Load structural annotation for genome
/data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3
b) Make directory for stringTie
mkdir stringTie_genome
cd stringTie_genome
mkdir BAM_genome GTF_genome GTF_merge_genome
c) Copy BAM files to stringTie folder
cd /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/hisat2_mapped_genome/results/
mv *_sorted.bam /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM_genome
d) Assemble and estimate reads, following vignette for transdecoder and transcriptome specific setttings.
cd BAM_genome
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/stringTie_assemble_genome.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH --error="stringtie_genome_out_error"
#SBATCH --output="stringtie_genome_out"
module load StringTie/2.2.1-GCC-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0
# Define the directory containing BAM files
F=/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM_genome
# List all BAM files in the directory
array1=($(ls $F/*bam))
# Process each BAM file
for i in ${array1[@]}; do
# Run StringTie with options -e -B -G for each BAM file
stringtie -e -B -G /data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3 -o ${i}.gtf ${i}
# Print the current BAM file name for tracking
echo "${i}"
done
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/stringTie_assemble_genome.sh
Submitted batch job 335278
e) Merge stringTie gtf results
mv *gtf ../GTF_genome/
ls *gtf > apul_mergelist.txt
cat apul_mergelist.txt
module load StringTie/2.2.1-GCC-11.2.0
stringtie --merge -p 8 -G /data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3 -o stringtie_apul_merged_genome.gtf apul_mergelist_genome.txt
f) Assess assembly quality
module load GffCompare/0.12.6-GCC-11.2.0
gffcompare -r /data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3 -o Apul.merged_genome stringtie_apul_merged_genome.gtf
g) Re-estimate assembly if necessary!
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/stringTie_apul_re-assemble_genome.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH --error="re-assemble_apul_genome_out_error"
#SBATCH --output="re-assemble_apul_genome_out"
module load StringTie/2.2.1-GCC-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0
# Define the directory containing BAM files
F=/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie_genome/BAM
array1=($(ls $F/*bam))
for i in ${array1[@]}
do
stringtie -e -G /data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3 -o ${i}.merge.gtf ${i}
echo "${i}"
done
sbatch /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/stringTie_apul_re-assemble_genome.sh
# move merged GTF files to their own folder
mkdir GTF_merge_genome
mv *merge.gtf GTF_merge_genome
f) Create gene matrix
#making a sample txt file with all gtf file names
F=/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie_genome/GTF_merge_genome/
array2=($(ls *merge.gtf))
for i in ${array2[@]}
do
echo "${i} $F${i}" >> sample_list.txt
done
#sample_list.txt document output
#create gene matrix
nano /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/scripts/GTFtoCounts_genome.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/
#SBATCH --cpus-per-task=3
module load StringTie/2.2.1-GCC-11.2.0
module load Python/2.7.18-GCCcore-9.3.0
python prepDE.py -g Apul_genome_gene_count_matrix.csv -i sample_list.txt
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_RNASeq_compare/scripts/GTFtoCounts.sh
g) Secure-copy gene counts onto local computer, make sure to open a seperate command shell outside of Andromeda on your own terminal
#copy gene count matrix
scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_RNASeq_compare/data/GTF_merge/Poc_gene_count_matrix.csv /Users/Danielle/Desktop/Putnam_Lab/