Workflow for Acropora pulchra de novo transcriptome

Designing a workflow to create a de novo transcriptome for Acropora pulchra

Goal:

Important notes about de novo transcriptomes

Raghavan et al. 2022
- “A simple guide to de novo transcriptome assembly and annotation”
- De novo transcriptome assembly, in contrast, is ‘reference-free’. The process is de novo (Latin for ‘from the beginning’) as there is no external information available to guide the reconstruction process. It must be accomplished using the information contained in the reads alone.
- This approach is useful when a genome is unavailable, or when a reference-guided assembly is undesirable.
- For instance, in opposition to a de novo assembler successfully producing a transcript, a reference-guided approach might not be able to reconstruct it correctly if it were to correspond to a region on the reference containing sequencing or assembly gaps [15, 16].

Acropora genomes

Shinzato et al. 2011
- First Acropora genome sequenced was Acropora digitifera in “Using the Acropora digitifera genome to understand coral responses to environmental change”
Shinzato et al. 2020
- Sequenced genomes of 15 Acropora species (A. acuminata, A. awi, A. cytherea, A. digitifera, A. echinata, A. florida, A. gemmifera, A. hyacinthus, A. intermedia, A. microphthalma, A. muricata, A. nasta, A. selago, A. tenuis, and A. yongei)

Acropora de novo assemblies and notes

Oldach and Vize 2018
- De novo assembly and annotation of the Acropora gemmifera transcriptome
- Used Trintiy to assemble 31.6 million combined raw reads and built into 104,000 contigs
- Kitchen et al. 2015
  - De Novo Assembly and Characterization of Four Anthozoan (Phylum Cnidaria) Transcriptomes

Other de novo transcriptome resources

Wong and Putnam Porites astreoides genome
- GitHub
- Structural annotation of the P. astreoides genome was completed on the University of Rhode Island High Performance Computer ‘Andromeda’. As input for MAKER v3.01.03 (RRID:SCR_005309) [64] we used an existing P. astreoides transcriptome from samples collected in the Florida Keys, USA [32] and existing congener P. lutea peptide sequences from a sample collected in Australia [57].
Chui et al. 2020
- “De novo transcriptome assembly from the gonads of a scleractinian coral, Euphyllia ancora: molecular mechanisms underlying scleractinian gametogenesis”

Trinity Resources and Vignette

Full-length transcriptome assembly from RNA-seq data without a reference genome. Grabherr et al. 2011

De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Haas et al. 2013

Software, documentation, and demonstrations

Acropora pulchra Gametogenesis transcriptome data files on URI andromeda:

Location on Andromeda, the HPC server for URI:

  cd /data/putnamlab/KITT/hputnam/20230825_Bermuda_Reference_Transcriptomes/

  ACRP_R1_001.fastq.gz
  ACRP_R1_001.fastq.gz.md5
  ACRP_R2_001.fastq.gz
  ACRP_R2_001.fastq.gz.md5

Copied all data files to new location on Andromeda

  cp -r /data/putnamlab/KITT/hputnam/20230825_Bermuda_Reference_Transcriptomes/ACRP* /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw

  ACRP_R1_001.fastq.gz
  ACRP_R1_001.fastq.gz.md5
  ACRP_R2_001.fastq.gz
  ACRP_R2_001.fastq.gz.md5

Acropora pulchra E5 Rules of Life project transcriptome data files on University of Washington OWL data storage HPC database

Location on OWL, the HPC server for UW:

  https://owl.fish.washington.edu/nightingales/A_pulchra/30-789513166/

Downloaded all files to Andromeda URI HPC location

  cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/

  wget -r -nd -A .fastq.gz https://owl.fish.washington.edu/nightingales/A_pulchra/30-789513166/
  wget -r -nd -A .md5 https://owl.fish.washington.edu/nightingales/A_pulchra/30-789513166/

  ACR-140_TP2_R1_001.fastq.gz
  ACR-140_TP2_R2_001.fastq.gz
  ACR-145_TP2_R1_001.fastq.gz
  ACR-145_TP2_R2_001.fastq.gz
  ACR-150_TP2_R1_001.fastq.gz
  ACR-150_TP2_R2_001.fastq.gz
  ACR-173_TP2_R1_001.fastq.gz
  ACR-173_TP2_R2_001.fastq.gz
  ACR-178_TP2_R1_001.fastq.gz
  ACR-178_TP2_R2_001.fastq.gz
  ACR-140_TP2_R1_001.fastq.gz.md5
  ACR-140_TP2_R2_001.fastq.gz.md5
  ACR-145_TP2_R1_001.fastq.gz.md5
  ACR-145_TP2_R2_001.fastq.gz.md5
  ACR-150_TP2_R1_001.fastq.gz.md5
  ACR-150_TP2_R2_001.fastq.gz.md5
  ACR-173_TP2_R1_001.fastq.gz.md5
  ACR-173_TP2_R2_001.fastq.gz.md5
  ACR-178_TP2_R1_001.fastq.gz.md5
  ACR-178_TP2_R2_001.fastq.gz.md5

Samples QC and Qubit Results

Tube Label	RNA_ng_µl	RNA_µl	RNA µg	Link to notebook post1
ACR-140_TP2	12	90	1.08	https://kterpis.github.io/Putnam_Lab_Notebook/20211012-RNA-DNA-extractions-from-E5-project/
ACR-145_TP2	20.8	87	1.8096	https://kterpis.github.io/Putnam_Lab_Notebook/20211012-RNA-DNA-extractions-from-E5-project/
ACR-150_TP2	13	87	1.131	https://github.com/Kterpis/Putnam_Lab_Notebook/blob/master/_posts/2021-09-03-20210903-RNA-DNA-extractions-from-E5-project.md
ACR-173_TP2	11.4	87	0.9918	https://kterpis.github.io/Putnam_Lab_Notebook/20211102-RNA-DNA-extractions-from-E5-project/
ACR-178_TP2	12.2	87	1.0614	https://github.com/Kterpis/Putnam_Lab_Notebook/blob/master/_posts/2021-09-02-20210902-RNA-DNA-extractions-from-E5-project.md
ACRP-CON	43.2	24	1.0368	https://github.com/daniellembecker/DanielleBecker_Lab_Notebook/blob/master/_posts/2023-04-25-Acropora-pulchra-transcriptome-extraction-concentration.md

Workflow Steps

Trim ACRP-CON reads.

Reads for samples ACR-140, ACR-145, ACR-150, ACR-173, and ACR-178 were trimmed using the built-in version of Trimmomatic with the default settings, following the 9FastQ QC and Trimming - E5 Coral RNA-seq Data for A.pulchra protocol. All fast

Downloaded all files to Andromeda URI HPC location

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed

wget -r -nd -A .fastq.gz https://gannet.fish.washington.edu/Atumefaciens/20230519-E5_coral-fastqc-fastp-multiqc-RNAseq/A_pulchra/trimmed/

RNA-ACR-140-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-140-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-145-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-145-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-150-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-150-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-173-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-173-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
RNA-ACR-178-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
RNA-ACR-178-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz

1) Run FastQC on ACRP_R1 and ACRP_R2

a) Make folders for raw FastQC results and scripts

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/

mkdir fastqc_results

b) Write script for checking quality with FastQC and submit as job on Andromeda

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/fastqc_raw.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file

module load FastQC/0.11.9-Java-11

for file in /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/*ACRP
do
fastqc $file --outdir /data/putnamlab/dbecks/DeNovo_transcriptxome/2023_A.pul/data/fastqc_results/
done

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/fastqc_raw.sh

Submitted batch job 281440

Combined QC output into 1 file with MultiQC, do not need a script due to fast computational time

module load MultiQC/1.9-intel-2020a-Python-3.8.2

multiqc /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/fastqc_results/*fastqc.zip -o /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/fastqc_results/multiqc/

c) Copy MultiQC and FastQC files to local computer

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/fastqc_results/multiqc/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/Gametogenesis/bioinformatics/transcriptome/original_fastqc

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/fastqc_results/*.html /Users/Danielle/Desktop/Putnam_Lab/Gametogenesis/bioinformatics/transcriptome/original_fastqc

4) Trim and clean reads

a) Make trimmed reads folder in all other results folders

mkdir data/trimmed
cd trimmed

c) Write script for Trimming and run on Andromeda

#Run fastp on files #Trims 20bp from 5’ end of all reads #Trims poly G, if present

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/trim.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file

module load fastp/0.19.7-foss-2018b

array1=($(ls *ACRP_R1*.fastq.gz)) #Make an array of sequences to trim
for i in ${array1[@]}; do
fastp --in1 ${i} --in2 $(echo ${i}|sed s/_R1/_R2/) --detect_adapter_for_pe --trim_poly_g --trim_front1 20 --trim_front2 20 --out1 ../trimmed/${i} --out2 ../trimmed/$(echo ${i}|sed s/_R1/_R2/)  
done

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/trim.sh

Submitted batch job 281915

5) Check quality of trimmed files

a) Check number of files in /trimmed directory

ls -1 | wc -l
#12

b) Check number of reads in /trimmed directory and in /raw directory

# look at raw reads

zgrep -c "@A01587" ACRP* > seq_counts

ACRP_R1_001.fastq.gz:178057412
ACRP_R2_001.fastq.gz:178057412

# look at trimmed reads

zgrep -c "@A01587" ACRP* > trimmed_seq_counts

ACRP_R1_001.fastq.gz:173194930
ACRP_R2_001.fastq.gz:173194930

c) Run FastQC on trimmed data

mkdir trimmed_qc

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/fastqc_trimmed.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trimmed_qc
#SBATCH --error="script_error"
#SBATCH --output="output_script"

module load FastQC/0.11.8-Java-1.8

for file in /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/ACRP*
do
fastqc $file --outdir /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trimmed_qc
done

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/fastqc_trimmed.sh

Submitted batch job 281933

d) Run MultiQC on trimmed data, Combined QC output into 1 file with MultiQC, do not need a script due to fast computational time

module load MultiQC/1.9-intel-2020a-Python-3.8.2

multiqc /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trimmed_qc/*fastqc.zip -o /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trimmed_qc/trimmed_multiqc

e) Copy multiqc and fastqc to computer, use terminal window fro desktop not in server

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trimmed_qc/trimmed_multiqc/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/Gametogenesis/bioinformatics/transcriptome/trimmed_fastqc

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trimmed_qc/*.html /Users/Danielle/Desktop/Putnam_Lab/Gametogenesis/bioinformatics/transcriptome/trimmed_fastqc

6) Run Trinity with paired-end sequences

Trinity commands as found on Trinity vignette and Roberts Lab Guide:

Trinity = runs trinity program
seqType fq = specifys FASTQ file format
SS_lib_type RF = Trinity has the option (–SS_lib_type) to specify whether or not the sequences you’re assembly are “stranded”. User should specify this in the following fasion as on option in the Trinity command (example specifies typical stranded libraries): –SS_lib_type RF
max_memory 100G = max memory for trinity 100G should not be changed, per communications with the developer
CPU 36 = match however many CPUs your computing cluster has access to, maximum number of parallel processes (for andromeda the maximum number of cores is 36. On Unity, the nodes in the uri-cpu partition can go up to 64, and in the cpu and cpu-preempt partitions it can go to 128)
left = paired end reads filenames
right = paired end reads filenames
For non-stranded assembly, you do not provide any information about the directionality of the reads, allowing Trinity to assume an equal distribution of reads from both strands.

a) Originally ran both stranded and non-stranded in same assembly, running them separate beneath this code

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/trinity.sh

#!/bin/bash
#SBATCH --job-name=20230925_trinity
#SBATCH --time=30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file


# Load Trinity module

module load Trinity/2.15.1-foss-2022a

# Run Trinity

Trinity \
--seqType fq \
--SS_lib_type RF \
--max_memory 100G \
--CPU 36 \
--left \
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/ACRP_R1_001.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-140-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-145-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-150-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-173-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-178-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz \
--right \
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/ACRP_R2_001.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-140-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-145-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-150-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-173-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-178-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/trinity.sh

Submitted batch job 282369 on 20230927 at 15:56

Trinity finished 20231011 with a FailedCommands.txt file. I submitted a GitHub issues with my output script to the Trinity GitHub Issues.

Response to issue:

It looks like there's 18 failed commands.  I'm seeing the errors like this:

/glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_8/CBin_8455/c845648.trinity.reads.fa.out/inchworm.fa.tmp':


If you find the failed_commands.txt file (or similar name) in the trinity
output directory, it should have the list of the commands it's having a
hard time with.   Each should have an output directory, like in the above
example, it would be:

/glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_8/CBin_8455/c845648.trinity.reads.fa.out

If you can remove those output directories and then rerun your Trinity
command, it should hopefully process everything just fine (unless you ran
out of disk space...)

I then removed all the corrupt files (18) listed in the FailedCommands.txt file and re-ran Trinity on 20231012 at 16:04 after changing the amount of CPUs to 20 because a number of solutions in the developer’s GitHub Issues suggest that sometimes the CPUs process things too quickly for the computer’s file system to keep up, leading to problems

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_8/CBin_8455/c845648.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_6/CBin_6008/c600976.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_8/CBin_8295/c829640.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_7/CBin_7785/c778615.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_0/CBin_139/c13942.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_7/CBin_7922/c792325.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_4/CBin_4006/c400692.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_3/CBin_3881/c388169.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_8/CBin_8455/c845648.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_5/CBin_5650/c565150.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_5/CBin_5326/c532719.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_6/CBin_6986/c698736.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_6/CBin_6598/c659988.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_6/CBin_6003/c600479.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_4/CBin_4526/c452745.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_2/CBin_2558/c255895.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_2/CBin_2424/c242423.trinity.reads.fa.out

rm -rf /glfs/brick01/gv0/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/trinity_out_dir/read_partitions/Fb_7/CBin_7230/c723106.trinity.reads.fa.out

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/trinity.sh

Submitted batch job 285355

Trinity completed successfully!!!! When Trinity completes, it creates a ‘trinity_out_dir.Trinity.fasta’ output file (or prefix based on the output directory you specify).

Trinity groups transcripts into clusters based on shared sequence content. Such a transcript cluster is very loosely referred to as a ‘gene’. This information is encoded in the Trinity fasta accession. An example Fasta entry for one of the transcripts is formatted like so:

>TRINITY_DN1000_c115_g5_i1 len=247 path=[31015:0-148 23018:149-246]
 AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC
 ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA
 AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC
 CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA
 TAAAGCA

b) Run stranded assembly to provide Trinity with information about the directionality of the reads, allowing it to consider this information during the assembly process

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/trinity_stranded.sh

#!/bin/bash
#SBATCH --job-name=20240418_trinity_stranded
#SBATCH --time=30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output
#SBATCH --error="script_error_stranded" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_stranded" #once your job is completed, any final job report comments will be put in this file


# Load Trinity module

module load Trinity/2.15.1-foss-2022a

# Run Trinity

Trinity \
--seqType fq \
--SS_lib_type RF \
--max_memory 100G \
--CPU 36 \
--left \
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-140-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-145-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-150-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-173-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-178-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz \
--right \
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-140-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-145-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-150-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-173-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz,\
/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/RNA-ACR-178-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/trinity_stranded.sh

Submitted batch job 312460 on 20240418 14:24

b) Run non-stranded assembly, do not provide any information about the directionality of the reads, allowing Trinity to assume an equal distribution of reads from both strands

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/trinity_nonstranded.sh

#!/bin/bash
#SBATCH --job-name=20240418_trinity_nonstranded
#SBATCH --time=30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output
#SBATCH --error="script_error_nonstranded" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_nonstranded" #once your job is completed, any final job report comments will be put in this file

# Load Trinity module

module load Trinity/2.15.1-foss-2022a

# Run Trinity

Trinity \
--seqType fq \
--max_memory 100G \
--CPU 36 \
--left /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/ACRP_R1_001.fastq.gz \
--right /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trimmed/ACRP_R2_001.fastq.gz \

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/trinity_nonstranded.sh

Submitted batch job 312458 on 20240418 14:00

7) Assessing quality of the assessment

Use Trinity toolkit utilities for a assembly quality assessment

a) Run TrinityStats.pl script for stats output, need to use the path of trinity module and then add the util folder before you can access the .pl script

This script will compute the contig Nx statistics (eg. the contig N50 value), in addition to a modification of the Nx statistic that takes into consideration transcript expression (read support) data, which we call the ExN50 statistic.

Based on the lengths of the assembled transcriptome contigs, we can compute the conventional Nx length statistic, such that at least x% of the assembled transcript nucleotides are found in contigs that are at least of Nx length. The traditional method is computing N50, such that at least half of all assembled bases are in transcript contigs of at least the N50 length value.

Ran on original assembly:

/opt/software/Trinity/2.15.1-foss-2022a/trinityrnaseq-v2.15.1/util/TrinityStats.pl trinity_out_dir.Trinity.fasta > trinity_assembly_stats

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  1019918
Total trinity transcripts:      1476390
Percent GC: 44.78

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 4395
        Contig N20: 2790
        Contig N30: 1956
        Contig N40: 1417
        Contig N50: 1021

        Median contig length: 376
        Average contig: 675.99
        Total assembled bases: 998026397

The N10 through N50 values are shown computed based on all assembled contigs. In this example, 10% of the assembled bases are found in transcript contigs at least 4,395 bases in length (N10 value), and the N50 value indicates that at least half the assembled bases are found in contigs that are at least 1,021 bases in length.

The contig N50 values can often be exaggerated due to an assembly program generating too many transcript isoforms, especially for the longer transcripts. To mitigate this effect, the script will also compute the Nx values based on using only the single longest isoform per ‘gene’:

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3840
        Contig N20: 2221
        Contig N30: 1479
        Contig N40: 1026
        Contig N50: 731

        Median contig length: 335
        Average contig: 565.18
        Total assembled bases: 576438524

You can see that the Nx values based on the single longest isoform per gene are lower than the Nx stats based on all assembled contigs, as expected, and even though the Nx statistic is really not a reliable indicator of the quality of a transcriptome assembly, the Nx value based on using the longest isoform per gene is perhaps better for reasons described above.

Ran on nonstranded and stranded assembly:

/opt/software/Trinity/2.15.1-foss-2022a/trinityrnaseq-v2.15.1/util/TrinityStats.pl trinity_out_dir.Trinity.fasta > trinity_assembly_stats_nonstranded

/opt/software/Trinity/2.15.1-foss-2022a/trinityrnaseq-v2.15.1/util/TrinityStats.pl trinity_out_dir.Trinity.fasta > trinity_assembly_stats_stranded

N50 nonstranded results:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  379192
Total trinity transcripts:      574858
Percent GC: 43.90

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 5001
        Contig N20: 3273
        Contig N30: 2346
        Contig N40: 1722
        Contig N50: 1256

        Median contig length: 415
        Average contig: 771.09
        Total assembled bases: 443268594

## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3857
        Contig N20: 2295
        Contig N30: 1577
        Contig N40: 1120
        Contig N50: 815

        Median contig length: 360
        Average contig: 606.70
        Total assembled bases: 230057536

N50 Results for stranded assembly

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  670597
Total trinity transcripts:      982351
Percent GC: 44.73

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 4414
        Contig N20: 2858
        Contig N30: 2047
        Contig N40: 1498
        Contig N50: 1073

        Median contig length: 367
        Average contig: 681.05
        Total assembled bases: 669031487


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3950
        Contig N20: 2359
        Contig N30: 1575
        Contig N40: 1074
        Contig N50: 747

        Median contig length: 321
        Average contig: 560.45
        Total assembled bases: 375834024

b) Run BUSCO ( Benchmarking Universal Single-Copy Orthologs)

Commands and overview for running BUSCO here: https://busco.ezlab.org/busco_userguide.html
uses highly conserved single-copy orthologs; evolutionary informed expectations of gene content.
appears that youu can focus a BUSCO analysis to orthologs related to your target taxa.
image below shows a BUSCO analysis comparing the crayfish targetted for tde novo transcriptome assembly to 44 other arthropod species assemblies and a single vertebrate assembly:

Theissinger et al. 2016 https://www.sciencedirect.com/science/article/abs/pii/S1874778716300137

Citation: Theissinger, K., Falckenhayn, C., Blande, D., Toljamo, A., Gutekunst, J., Makkonen, J., … & Kokko, H. (2016). De Novo assembly and annotation of the freshwater crayfish Astacus astacus transcriptome. Marine Genomics, 28, 7-10.

My first run of BUSCO for original assembly, I was getting this repeated errors in my slurm output error file:

“Message: BatchFatalError(AttributeError(“‘NoneType’ object has no attribute ‘hmmer_results_lines’”))”

The issue was that the above script was referencing busco-config.ini which had old versions (based on the old modules) instead of using the installed copy $EBROOTBUSCO/config/config.ini which has the correct paths. So I changed the line in the script below to use the new correct path and it began to run!

There was also another error that the metazoa_odb10 file was an old version and to make sure BUSCO always updates these file versions, we need to use the command -l metazoa_odb10 instead of a set file path for BUSCO to automatically update the versions. Further explanation could be found in the download and automatically update lineaages BUSCO guide section.

Run updated for original assembly:

#!/bin/bash

#SBATCH --job-name="busco"
#SBATCH --time="100:00:00"
#SBATCH --nodes 1 --ntasks-per-node=20
#SBATCH --mem=250G
##SBATCH --output="busco-%u-%x-%j"
##SBATCH --account=putnamlab
##SBATCH --export=NONE

echo "START" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/dbecks/DeNovo_transcriptome/2023_A.pul/data/trinity_out_dir/trinity_out_dir.Trinity.fasta" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${dbecks}"
busco --config "$EBROOTBUSCO/config/config.ini"  -f -c 20 --long -i "${query}" -l metazoa_odb10 -o busco_output -m transcriptome

echo "STOP" $(date)

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/busco.sh

Submitted batch job 288779

BUSCO Results for original assembly

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/busco/busco_output
less short_summary.specific.metazoa_odb10.busco_output.txt

# BUSCO version is: 5.2.2
# The lineage dataset is: metazoa_odb10 (Creation date: 2021-02-17, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trinity_out_dir/trinity_out_dir.Trinity.fasta
# BUSCO was run in mode: transcriptome

        ***** Results: *****

        C:99.9%[S:21.0%,D:78.9%],F:0.1%,M:0.0%,n:954       
        953     Complete BUSCOs (C)                        
        200     Complete and single-copy BUSCOs (S)        
        753     Complete and duplicated BUSCOs (D)         
        1       Fragmented BUSCOs (F)                      
        0       Missing BUSCOs (M)                         
        954     Total BUSCO groups searched               

BUSCO completeness looks great. Completeness looks for the presence or absence of highly conserved genes in an assembly. The aim is to have the highest percentage of genes identified in your assembly, with a BUSCO complete score above 95% considered good (we have 99.9% so yay!).

However, the complete and duplicated BUSCOs are high. Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. So, next step suggestions:

Filter for isoforms
Map to closest genome (Acropora millepora)
Filter symbiont genes to check if it helps duplication

Want to also run BUSCO for nonstranded and stranded assemblies separate:

Calculate non-stranded BUSCO scores:

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/busco_nonstranded.sh

#!/bin/bash

#SBATCH --job-name="busco_nonstranded"
#SBATCH --time="100:00:00"
#SBATCH --nodes 1 --ntasks-per-node=20
#SBATCH --mem=250G
##SBATCH --output="busco-%u-%x-%j"
##SBATCH --account=putnamlab
##SBATCH --export=NONE

echo "START" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/trinity_out_dir.Trinity.fasta" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/"
busco --config "$EBROOTBUSCO/config/config.ini"  -f -c 20 --long -i "${query}" -l metazoa_odb10 -o busco_output -m transcriptome

echo "STOP" $(date)

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/busco_nonstranded.sh

Submitted batch job 312932

BUSCO Results for nonstranded assembly

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/busco_nonstranded/busco_output
less short_summary.specific.metazoa_odb10.busco_output.txt

# BUSCO version is: 5.2.2
# The lineage dataset is: metazoa_odb10 (Creation date: 2021-02-17, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/trinity_out_dir/trinity_out_dir.Trinity.fasta
# BUSCO was run in mode: transcriptome

***** Results: *****

C:97.3%[S:18.9%,D:78.4%],F:1.8%,M:0.9%,n:954       
928     Complete BUSCOs (C)                        
180     Complete and single-copy BUSCOs (S)        
748     Complete and duplicated BUSCOs (D)         
17      Fragmented BUSCOs (F)                      
9       Missing BUSCOs (M)                         
954     Total BUSCO groups searched               

Calculate stranded BUSCO scores:

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/busco_stranded.sh

#!/bin/bash

#SBATCH --job-name="busco_stranded"
#SBATCH --time="100:00:00"
#SBATCH --nodes 1 --ntasks-per-node=20
#SBATCH --mem=250G
##SBATCH --output="busco-%u-%x-%j"
##SBATCH --account=putnamlab
##SBATCH --export=NONE

echo "START" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/"
busco --config "$EBROOTBUSCO/config/config.ini"  -f -c 20 --long -i "${query}" -l metazoa_odb10 -o busco_output -m transcriptome

echo "STOP" $(date)

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/busco_stranded.sh

Submitted batch job 313009

BUSCO Results for stranded assembly

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/busco_nonstranded/busco_output
less short_summary.specific.metazoa_odb10.busco_output.txt

# BUSCO version is: 5.2.2
# The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta
# BUSCO was run in mode: transcriptome

        ***** Results: *****

        C:99.8%[S:23.9%,D:75.9%],F:0.2%,M:0.0%,n:954       
        952     Complete BUSCOs (C)                        
        228     Complete and single-copy BUSCOs (S)        
        724     Complete and duplicated BUSCOs (D)         
        2       Fragmented BUSCOs (F)                      
        0       Missing BUSCOs (M)                         
        954     Total BUSCO groups searched    

c) Assess the completeness and unique transcripts between the non-stranded and stranded de novo transcriptomes using NCBI nucleotide BLAST to compare the nucleotide sequences of our transcriptomes

Create a script to run the blastn process on both transcriptomes:

nano blast_transcriptomes.sh

#!/bin/bash
#SBATCH --job-name="trans_blastn"
#SBATCH --time="04:00:00"
#SBATCH --nodes 1 --ntasks-per-node=20
#SBATCH --mem=250G
##SBATCH --output="blastn-%u-%x-%j"
##SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/ref_trans_blast
##SBATCH --export=NONE

module load BLAST+/2.15.0-gompi-2023a

# Create BLAST databases
makeblastdb -in /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/trinity_out_dir.Trinity.fasta -dbtype nucl -out non_stranded_db

makeblastdb -in /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta -dbtype nucl -out stranded_db

# Run BLAST searches
blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/trinity_out_dir.Trinity.fasta -db stranded_db -out nonstranded_blast.out -outfmt 6 -evalue 1e-5

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta -db non_stranded_db -out stranded_blast.out -outfmt 6 -evalue 1e-5


Submitted batch job 323906

Create a Python script to parse the BLAST results and identify unique and common transcripts:

Enter interactive mode on ssh$ by typing interactive

Enter: module load SciPy-bundle

Then run script below: python parse_blast_results.py

nano parse_blast_results.py

import pandas as pd

# Load BLAST results
nonstranded = pd.read_csv('nonstranded_blast.out', sep='\t', header=None)
stranded = pd.read_csv('stranded_blast.out', sep='\t', header=None)

# Rename columns for clarity
columns = ['qseqid', 'sseqid', 'pident', 'length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore']
nonstranded.columns = columns
stranded.columns = columns

# Identify unique transcripts in non-stranded transcriptome
unique_non_stranded = set(nonstranded['qseqid']) - set(stranded['sseqid'])

# Identify unique transcripts in stranded transcriptome
unique_stranded = set(stranded['qseqid']) - set(nonstranded['sseqid'])

# Output results
with open('unique_non_stranded.txt', 'w') as f:
    for transcript in unique_non_stranded:
        f.write(transcript + '\n')

with open('unique_stranded.txt', 'w') as f:
    for transcript in unique_stranded:
        f.write(transcript + '\n')

# Print results
print("Unique non-stranded transcripts:
{}".format(len(unique_non_stranded)))
print("Unique stranded transcripts: {}".format(len(unique_stranded)))

# Print results
print "Unique non-stranded transcripts: 341

print "Unique stranded transcripts: 1049

8) Remove contaminants from both stranded and nonstranded assembly

My next step is to remove any contaminant reads from the reference transcriptomes (stranded and nonstranded). Jill already did this so following her instructions from her notebook post. From Young et al. 2024: “Raw HiFi reads first underwent a contamination screening, following the methodology in [68], using BLASTn [32, 68] against the assembled mitochondrial O. faveolata genome and the following databases: common eukaryote contaminant sequences (ftp.ncbi.nlm.nih. gov/pub/kitts/contam_in_euks.fa.gz), NCBI viral (ref_ viruses_rep_genomes) and prokaryote (ref_prok_rep_ genomes) representative genome sets”.

Create a folder for output data from removing contaminants:

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/

mkdir contam_remove

Download contaminant data sets:

Eukaryote contaminants:

wget ftp.ncbi.nlm.nih.gov/pub/kitts/contam_in_euks.fa.gz

gunzip contam_in_euks.fa.gz

For the viral and prokaryote datasets:
There’s a file /data/shared/ncbi-db/.ncbirc that gets updated to point to the current directory, which the blast* tools will automatically pick up, so you can just do -db ref_prok_rep_ genomes, for example.

For other tools that can read the ncbi databases, you can use blastdb_path -db  ref_viruses_rep_genomes to get the path, although for some reason with the nr database you need to specify -dbtype prot, i.e., blastdb_path -db  nr -dbtype prot.

Now I ran run a script that blasts the stranded and nonstranded references against these sequences. In /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts, nano blast_contam_euk.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BLAST+/2.13.0-gompi-2022a

echo "BLASTing reference transcriptomes against eukaryote contaminant sequences" $(date)

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta -subject /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/contam_in_euks.fa -task megablast -outfmt 6 -evalue 4 -perc_identity 90 -num_threads 15 -out /stranded/contaminant_hits_euks_rr_stranded.txt

echo "BLAST complete, remove contaminant seqs from stranded" $(date)

awk '{ if( ($4 >= 50 && $4 <= 99 && $3 >=98 ) ||
         ($4 >= 100 && $4 <= 199 && $3 >= 94 ) ||
         ($4 >= 200 && $3 >= 90) )  {print $0}
    }' contaminant_hits_euks_rr_stranded.txt > contaminants_pass_filter_euks_rr_stranded.txt

echo "Contaminant seqs removed from stranded assembly" $(date)

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/trinity_out_dir.Trinity.fasta -subject /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/contam_in_euks.fa -task megablast -outfmt 6 -evalue 4 -perc_identity 90 -num_threads 15 -out /nonstranded/contaminant_hits_euks_rr_nonstranded.txt

echo "BLAST complete, remove contaminant seqs from nonstranded" $(date)

awk '{ if( ($4 >= 50 && $4 <= 99 && $3 >=98 ) ||
         ($4 >= 100 && $4 <= 199 && $3 >= 94 ) ||
         ($4 >= 200 && $3 >= 90) )  {print $0}
    }' contaminant_hits_euks_rr_nonstranded.txt > contaminants_pass_filter_euks_rr_nonstranded.txt

echo "Contaminant seqs removed from nonstranded assembly" $(date)

Submitted batch job 328533 at 13:36 07/02, ran in about three minutes

I looked at the contamination hits in R. See code here.

Now I can move forward with the blasting against viral and prok genomes. In the scripts folder: nano blastn_viral.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BLAST+/2.13.0-gompi-2022a

echo "Blasting stranded transcriptome reference against viral genomes to look for contaminants" $(date)

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta -db ref_viruses_rep_genomes -outfmt 6 -evalue 1e-4 -perc_identity 90 -out stranded/viral_contaminant_hits_rr_stranded.txt

echo "Blast complete!" $(date)

echo "Blasting nonstranded transcriptome reference against viral genomes to look for contaminants" $(date)

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/trinity_out_dir.Trinity.fasta -db ref_viruses_rep_genomes -outfmt 6 -evalue 1e-4 -perc_identity 90 -out nonstranded/viral_contaminant_hits_rr_nonstranded.txt

echo "Blast complete!" $(date)

Submitted batch job 328556 at 14:05 07/02, took about thirty minutes

I looked at the contamination hits in R. See code here.

In the scripts folder: nano blastn_prok.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BLAST+/2.13.0-gompi-2022a

echo "Blasting stranded transcriptome reference against prokaryote genomes to look for contaminants" $(date)

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta -db ref_prok_rep_genomes -outfmt 6 -evalue 1e-4 -perc_identity 90 -out stranded/prok_contaminant_hits_rr_stranded.txt

echo "Blast complete!" $(date)

echo "Blasting nonstranded transcriptome reference against prokaryote genomes to look for contaminants" $(date)

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/trinity_out_dir.Trinity.fasta -db ref_prok_rep_genomes -outfmt 6 -evalue 1e-4 -perc_identity 90 -out nonstranded/prok_contaminant_hits_rr_nonstranded.txt

echo "Blast complete!" $(date)

Submitted batch job 328594 at 14:05 07/02 took about thirty minutes

Cat the prok and viral results together and remove anything that has a bit score <1000.

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/stranded

cat viral_contaminant_hits_rr_stranded.txt prok_contaminant_hits_rr_stranded.txt > all_contaminant_hits_rr_stranded.txt
awk '$12 > 1000 {print $0}' all_contaminant_hits_rr_stranded.txt > contaminant_hits_pv_passfilter_rr_stranded.txt

wc -l contaminant_hits_pv_passfilter_rr_stranded.txt
#154867

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/nonstranded

cat viral_contaminant_hits_rr_nonstranded.txt prok_contaminant_hits_rr_nonstranded.txt > all_contaminant_hits_rr_nonstranded.txt
awk '$12 > 1000 {print $0}' all_contaminant_hits_rr_nonstranded.txt > contaminant_hits_pv_passfilter_rr_nonstranded.txt

wc -l contaminant_hits_pv_passfilter_rr_stranded.txt
#1979

Similarly to what I did above, I looked at the contamination hits in R. See code here.

Submitted batch job 328558 at 14:11 07/02

I also need to blast the symbiont genome information. Based on the ITS2 data, the Acropora spp from the Manava site have mostly A1 and D1 symbionts, so I’ll be using the A1 genome and the D1 genome.

Going to blast to sym genomes now. First download them:

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw
wget http://smic.reefgenomics.org/download/Smic.genome.scaffold.final.fa.gz
wget https://marinegenomics.oist.jp/symbd/download/102_symbd_genome_scaffold.fa.gz

gunzip Smic.genome.scaffold.final.fa.gz
gunzip 102_symbd_genome_scaffold.fa.gz

In the assembly scripts folder: nano blastn_sym.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BLAST+/2.13.0-gompi-2022a

echo "Build A1 seq db" $(date)

makeblastdb -in /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/Smic.genome.scaffold.final.fa -dbtype nucl -out /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/A1_db

echo "Build D1 seq db" $(date)

makeblastdb -in /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/102_symbd_genome_scaffold.fa -dbtype nucl -out /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/D1_db

echo "Blasting stranded transcriptome reference reads against symbiont A1 genome to look for contaminants" $(date)

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta -db /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/A1_db -outfmt 6 -evalue 1e-4 -perc_identity 90 -out stranded/sym_A1_contaminant_hits_rr_stranded.txt

echo "A1 blast complete! Now blasting stranded transcriptome reference reads against symbiont D1 genome to look for contaminants" $(date)

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta -db /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/D1_db -outfmt 6 -evalue 1e-4 -perc_identity 90 -out stranded/sym_D1_contaminant_hits_rr_stranded.txt

echo "Blasting nonstranded transcriptome reference reads against symbiont A1 genome to look for contaminants" $(date)

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/trinity_out_dir.Trinity.fasta -db /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/A1_db -outfmt 6 -evalue 1e-4 -perc_identity 90 -out nonstranded/sym_A1_contaminant_hits_rr_nonstranded.txt

echo "A1 blast complete! Now blasting nonstranded transcriptome reference reads against symbiont D1 genome to look for contaminants" $(date)

blastn -query /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/trinity_out_dir.Trinity.fasta -db /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/raw/D1_db -outfmt 6 -evalue 1e-4 -perc_identity 90 -out nonstranded/sym_D1_contaminant_hits_rr_nonstranded.txt

echo "D1 blast complete!"$(date)

Submitted batch job 328568

Cat the sym blast results together for stranded and nonstranded and remove anything that has a bit score <1000.

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/stranded

cat sym_A1_contaminant_hits_rr_stranded.txt sym_D1_contaminant_hits_rr_stranded.txt > sym_contaminant_hits_rr_stranded.txt

awk '$12 > 1000 {print $0}' sym_contaminant_hits_rr_stranded.txt > contaminant_hits_sym_passfilter_rr_stranded.txt

wc -l contaminant_hits_sym_passfilter_rr_stranded.txt
# 10742 contaminant_hits_sym_passfilter_rr_stranded.txt

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/nonstranded

cat sym_A1_contaminant_hits_rr_nonstranded.txt sym_D1_contaminant_hits_rr_nonstranded.txt > sym_contaminant_hits_rr_nonstranded.txt

awk '$12 > 1000 {print $0}' sym_contaminant_hits_rr_nonstranded.txt > contaminant_hits_sym_passfilter_rr_nonstranded.txt

wc -l contaminant_hits_sym_passfilter_rr_nonstranded.txt
# 11681 contaminant_hits_sym_passfilter_rr_nonstranded.txt

Make transcript length .txt files for stranded and nonstranded references:

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output

awk 'BEGIN {OFS="\t"} /^>/ {if (seq_length > 0) print transcript_id, seq_length; split($1, a, " "); transcript_id = substr(a[1], 2); seq_length = 0; next} {seq_length += length($0)} END {if (seq_length > 0) print transcript_id, seq_length}' trinity_out_dir.Trinity.fasta | sort -k2 -nr > transcript_lengths_stranded.txt

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output

awk 'BEGIN {OFS="\t"} /^>/ {if (seq_length > 0) print transcript_id, seq_length; split($1, a, " "); transcript_id = substr(a[1], 2); seq_length = 0; next} {seq_length += length($0)} END {if (seq_length > 0) print transcript_id, seq_length}' trinity_out_dir.Trinity.fasta | sort -k2 -nr > transcript_lengths_nonstranded.txt

Pretty clean when bit scores <1000 are removed. Copy this data onto my computer and remove the contaminants in the R script.

Copy the files all_contam_rem_good_strand_read_list.txt and all_contam_rem_good_nonstrand_read_list.txt that were generated from the R code mentioned above. These specific files were written starting on line 257 and 509. It contains the reads that have passed contamination filtering. I copied each file into cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/stranded and cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/nonstranded

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/stranded

wc -l all_contam_rem_good_strand_read_list.txt
962995 all_contam_rem_good_strand_read_list.txt

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/nonstranded

wc -l all_contam_rem_good_nonstrand_read_list.txt
566,929 all_contam_rem_good_nonstrand_read_list.txt

The vast majority of the transcripts are retained after contamination filtering, which is a good sign of high quality sequencing. My next step is to subset the raw stranded and nonstranded transcriptome fasta files to remove the contaminants identified above. I can do this with the seqtk subseq command. In the scripts folder: nano subseq.sh

But first, I need to remove the length information from the all_contam_rem_good_strand_read_list.txt and all_contam_rem_good_nonstrand_read_list.txt files

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/stranded

awk '{$2=""; print $0}' all_contam_rem_good_strand_read_list.txt > output_file_strand.txt

wc -l 962,995 output_file_strand.txt

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/nonstranded

awk '{$2=""; print $0}' all_contam_rem_good_nonstrand_read_list.txt > output_file_nonstrand.txt

wc -l 566929 output_file_nonstrand.txt

Write script to subset the raw stranded and nonstranded fasta files to remove the contaminants identified above.

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts
nano subseq.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load seqtk/1.3-GCC-9.3.0

echo "Subsetting stranded transcripts that passed contamination filtering" $(date)

awk 'NR==FNR{ids[$1]; next} /^>/{f=0; for(i in ids) if(index($1, i)) {f=1; break}} f' \
    /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/stranded/output_file_strand.txt \
    /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta \
    | seqtk seq -l0 > /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/strand_rr_allcontam_rem.fasta

echo "Subsetting complete stranded!" $(date)

echo "Subsetting nonstranded transcripts that passed contamination filtering" $(date)

awk 'NR==FNR{ids[$1]; next} /^>/{f=0; for(i in ids) if(index($1, i)) {f=1; break}} f' \
    /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/contam_remove/nonstranded/output_file_nonstrand.txt \
    /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/trinity_out_dir.Trinity.fasta \
    | seqtk seq -l0 > /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/nonstrand_rr_allcontam_rem.fasta

echo "Subsetting complete nonstranded!" $(date)

Submitted batch job 329322 at 9:48

Identify the unique transcript IDs found in the nonstranded assemmbly not found in the stranded assembly

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/

mkdir final_assembly

cd final_assembly

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/concat.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL # email notifications
#SBATCH --mail-user=danielle_becker@uri.edu # your email
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

# Load necessary modules
module load SeqKit/2.3.1

# Define file paths
UNIQUE_IDS="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/ref_trans_blast/unique_non_stranded.txt"
NONSTRANDED_FASTA="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/nonstranded_output/nonstrand_rr_allcontam_rem.fasta"
STRANDED_FASTA="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/strand_rr_allcontam_rem.fasta"
OUTPUT_FASTA="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/combined_assemblies.fasta"

# Extract unique non-stranded sequences
seqkit grep -f $UNIQUE_IDS $NONSTRANDED_FASTA > unique_non_stranded_sequences.fasta

# Concatenate the unique non-stranded sequences with the stranded sequences
cat unique_non_stranded_sequences.fasta $STRANDED_FASTA > $OUTPUT_FASTA

# Clean up intermediate files
rm unique_non_stranded_sequences.fasta

echo "Unique non-stranded sequences have been extracted and concatenated with the stranded sequences."

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/concat.sh

Submitted batch job 332872

# look for any duplicate IDs in newly assembled .fasta:

grep "^>" /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/combined_assemblies.fasta | sort | uniq -d

# two duplicates found:
>TRINITY_DN189409_c0_g1_i1 len=211 path=[0:0-210]
>TRINITY_DN297413_c0_g1_i1 len=237 path=[0:0-236]

# removed two duplicates:

awk '/^>/ {if(seen[$0]++) next} {print}' combined_assemblies.fasta > combined_assemblies_cleaned.fasta

# remove duplicates with different lengths, the one with length 277 came from nonstranded assembly:

awk '/^>/ {if ($0 == ">TRINITY_DN113515_c0_g1_i1 len=277 path=[0:0-276]") {p=0} else {p=1}} p' combined_assemblies.fasta > cleaned_assemblies.fasta

# check number of transcripts in final assembly:

grep -c "^>" /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/combined_assemblies.fasta

963413 which makes sense because stranded output fasta had 963075 and 341 unique nonstranded minus the three duplicates

Deciding to move forward with just the stranded assembly, as not many more being added with the nonstranded, which can cause downstream errors later in alignment when stranded needs to be specified.

Run BUSCO for final stranded assembly

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/busco_final.sh

#!/bin/bash

#SBATCH --job-name="busco_final"
#SBATCH --time="100:00:00"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH --mem=120G
##SBATCH --output="busco-%u-%x-%j"
##SBATCH --account=putnamlab
##SBATCH --export=NONE

echo "START" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/strand_rr_allcontam_rem.fasta" # set this to the query (genome/transcriptome) you are running

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/"

# Run BUSCO with the --offline flag and specify the download path
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 -i "${query}" -l metazoa_odb10 -o busco_output -m transcriptome --offline --download_path "${busco_shared}/downloads/"

echo "STOP" $(date)

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/busco_final.sh

Submitted batch job 334841

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/busco_output
less short_summary.specific.metazoa_odb10.busco_output.txt

***** Results: *****

C:99.8%[S:23.8%,D:76.0%],F:0.1%,M:0.1%,n:954       
952     Complete BUSCOs (C)                        
227     Complete and single-copy BUSCOs (S)        
725     Complete and duplicated BUSCOs (D)         
1       Fragmented BUSCOs (F)                      
1       Missing BUSCOs (M)                         
954     Total BUSCO groups searched

9) Use CD-HIT to reduce redundancy and improve transcriptome assembly quality

a) Run CD-HIT-EST

CD-HIT-EST stands for Cluster Database at High Identity with Tolerance for ESTs (Expressed Sequence Tags). The program (cd-hit-est) takes a FASTA format sequence database as input and produces a set of ‘non-redundant’ (nr) representative sequences as output. The goal is to reduce the overall size of the database while preserving sequence information by clustering sequences that are highly similar. This process eliminates redundancy by grouping similar sequences into clusters, with only one representative sequence retained per cluster. The resulting database is called non-redundant (nr) as it contains unique representatives of similar sequences, thus simplifying the dataset and focusing on diversity within the data. Essentially, cd-hit-est organizes sequences into clusters, reducing redundancy and making it easier to analyze large sequence datasets.

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/cd-hit-est.sh

#!/bin/bash
#SBATCH -t 6-00:00:00          # Time limit in days-hours:minutes:seconds
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT
#SBATCH --cpus-per-task=8      # Adjusted to match -T
#SBATCH --output=slurm-%A.out

# Load CD-HIT module
module load CD-HIT/4.8.1-GCC-11.3.0

# Define input and output file paths
INPUT_FILE="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/strand_rr_allcontam_rem.fasta"
OUTPUT_FILE="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/stranded_assembly_cdhit.fasta"

# Run CD-HIT
cd-hit-est -i $INPUT_FILE -o $OUTPUT_FILE -c 0.95 -n 10 -d 0 -M 16000 -T 8

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/cd-hit-est.sh

Submitted batch job 334849

CD-HIT-EST Output file is: stranded_assembly_cdhit.fasta

total seq: 963075
longest and shortest : 51090 and 164
Total letters: 625126251
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 759M
Buffer          : 8 X 34M = 275M
Table           : 2 X 32M = 64M
Miscellaneous   : 17M
Total           : 1116M

b) Asses CD-HIT-EST output with N50 and BUSCO

Run N50 on cdhit assembly:

/opt/software/Trinity/2.15.1-foss-2022a/trinityrnaseq-v2.15.1/util/TrinityStats.pl stranded_assembly_cdhit.fasta > trinity_cdhit_assembly_stats

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  626101
Total trinity transcripts:      780970
Percent GC: 44.37

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3595
        Contig N20: 2345
        Contig N30: 1642
        Contig N40: 1154
        Contig N50: 812

        Median contig length: 343
        Average contig: 590.67
        Total assembled bases: 461293650

The N10 through N50 values are shown computed based on all assembled contigs. In this example, 10% of the assembled bases are found in transcript contigs at least 3,594 bases in length (N10 value), and the N50 value indicates that at least half the assembled bases are found in contigs that are at least 812 bases in length.

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3374
        Contig N20: 2085
        Contig N30: 1393
        Contig N40: 950
        Contig N50: 673

        Median contig length: 321
        Average contig: 535.54
        Total assembled bases: 335300440

Run BUSCO on cdhit assembly:

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/busco_cdhit.sh

#!/bin/bash

#SBATCH --job-name="busco_final"
#SBATCH --time="100:00:00"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=36
#SBATCH --mem=120G
##SBATCH --output="busco-%u-%x-%j"
##SBATCH --account=putnamlab
##SBATCH --export=NONE

echo "START" $(date)

export NUMEXPR_MAX_THREADS=36

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/stranded_assembly_cdhit.fasta" # set this to the query (genome/transcriptome) you are running

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/"

# Run BUSCO with the --offline flag and specify the download path
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 -i "${query}" -l metazoa_odb10 -o busco_output -m transcriptome --offline --download_path "${busco_shared}/downloads/"

echo "STOP" $(date)

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/busco_cdhit.sh

Submitted batch job 334850

cd /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/busco_output
less short_summary.specific.metazoa_odb10.busco_output.txt

--------------------------------------------------
|Results from dataset metazoa_odb10               |
--------------------------------------------------
|C:99.8%[S:29.4%,D:70.4%],F:0.1%,M:0.1%,n:954     |
|952    Complete BUSCOs (C)                       |
|280    Complete and single-copy BUSCOs (S)       |
|672    Complete and duplicated BUSCOs (D)        |
|1      Fragmented BUSCOs (F)                     |
|1      Missing BUSCOs (M)                        |
|954    Total BUSCO groups searched               |
--------------------------------------------------

10) Map to Acropora pulchra incomplete reference genome

a) Obtain Reference Genome and Create Folder Structure

I am using the Moorea, French Polynesia incomplete Acropora pulchra genome to compare alignment statistics.

Location on Andromeda, the HPC server for URI:

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa

Bioinformatic workflow made by Jill Ashey as she works on reference genome assembly

b) Generate genome build

HiSat2 Align reads to refernece transcriptome

HiSat2 HiSat2 Github

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/Hisat2_genome_build_Apul.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/ref_genome_Apul
#SBATCH --error="script_error"
#SBATCH --output="output_script"

module load HISAT2/2.2.1-gompi-2022a

# index the reference genome for Apul output index to working directory
hisat2-build -f /data/putnamlab/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa ./Apul_ref # called the reference genome (scaffolds)
echo "Reference genome indexed. Starting alignment" $(date)

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/Hisat2_genome_build_Apul.sh

Submitted batch job 333445

c) Align transcriptome to genome

mkdir data/mapped

Trinity assembly in FASTA format and you want to align it to a reference genome, you would generally use a different tool like BLAT, GMAP, or minimap2, as they are designed for aligning longer sequences like transcripts or assembled contigs.
Using minimap2
- In this example, -ax splice specifies that the input is spliced transcripts, -uf specifies the reference genome, and the output is redirected to a SAM file (Trinity_aligned.sam).

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/minimap_align2_Apul.sh

#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/ref_genome_Apul/mapped
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error"
#SBATCH --output="output_script"


module load minimap2/2.24-GCCcore-11.3.0

#Aligning to Trinity output file

minimap2 -d /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/ref_genome_Apul/reference_index.idx /data/putnamlab/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa

minimap2 -ax splice -uf /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/ref_genome_Apul/reference_index.idx /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/final_assembly/CD-HIT/stranded_assembly_cdhit.fasta > trinity_aligned.sam

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/minimap_align2_Apul.sh

Submitted batch job 333831

d) Sort and convert sam to bam and check number of mapped reads and mapping percentages

Explanation:
- samtools sort -o Trinity_aligned.sorted.bam Trinity_aligned.sam: This command sorts the SAM file and creates a sorted BAM file (Trinity_aligned.sorted.bam).
- samtools index Trinity_aligned.sorted.bam: This command creates an index for the sorted BAM file. The index file (Trinity_aligned.sorted.bam.bai) is necessary for certain operations and viewers.
- samtools flagstat Trinity_aligned.sorted.bam: This command generates statistics about the alignment, including the number of mapped reads and mapping percentages.

nano /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/SAMtoBAM_Apul.sh

#There will be lots of .tmp file versions in your folder, this is normal while this script runs and they should delete at the end to make one sorted.bam file

#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=8
#SBATCH --export=NONE
#SBATCH --mem=120GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/ref_genome_Apul/mapped
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error"
#SBATCH --output="output_script"

module load SAMtools/1.16.1-GCC-11.3.0 #Preparation of alignment for assembly: SAMtools

# Directory containing SAM files
sam_directory="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/data/ref_genome_Apul/mapped"

# Process each SAM file in the directory
for sam_file in "$sam_directory"/*.sam; do
    if [ -f "$sam_file" ]; then
        # Extract sample name from the file
        sample_name=$(basename "${sam_file%.sam}")

        # Step 1: Sort the SAM file and convert to BAM
        samtools sort -o "${sam_directory}/${sample_name}_sorted.bam" "$sam_file"

        # Step 2: Index the BAM file
        samtools index "${sam_directory}/${sample_name}_sorted.bam"

        # Step 3: Check Mapping Statistics
        samtools flagstat "${sam_directory}/${sample_name}_sorted.bam" > "${sam_directory}/${sample_name}_flagstat.txt"

        echo "Processed: ${sample_name}"
        echo "------------------------"
    fi
done

sbatch /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/scripts/SAMtoBAM_Apul.sh

Submitted batch job 334118

Mapping Percentages

interactive

module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools

for i in *sorted.bam; do
    echo "${i}" >> mapped_reads_counts_trinity
    samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts_trinity
done

trinity_aligned_sorted.bam
756286 + 0 mapped (67.93% : N/A)

Transcriptome reference assembly complete! Next step is to make the functional annotation, all steps outlined in the Acropora pulchra functional annotation workflow and complete transcript expression quantification using RSEM for gene counts in this workflow.

Written on August 31, 2023