Pocillopora meandrina WGBS Bioinformatic Workflow

Alignment comparison of the Pocillopora meandrina genome to the Pocillopora verrucosa genome for E5 molecular WGBS data

Goal

The following document contains the bioinformatic pipeline used for cleaning, aligning and assembling our raw genomic DNA sequences for WGBS. The goal is to compare the alignment statistics when using the Hawaiian Pocillopora meandrina genome compared to the Red Sea Pocillopora verrucosa genome. These commands were compiled into bash scripts to run on the URI HPC Andromeda server. We will be using a subset of ~8 sequences to compare. Sequences were choosen from the highest alignment scores from the previous run.

Project overview

Bioinformatic tools used in analysis:

Check for required software on Andromeda

Update software to latest version

NF Core Methylseq
NEXTFLOW ~ version 20.04.1
nf-core/methylseq v1.5

Prepare work space

  • Upload raw reads and reference genome to server
  • Assess that your files have all uploaded correctly
  • Prepare your working directory
  • Install all necessary programs

Upload raw reads and reference genome to server

This is done with the scp or “secure copy” linux command. SCP allows the secure transferring of files between a local host and a remote host or between two remote hosts using ssh authorization.

Secure Copy (scp) Options:

  • -r - Recursively copy entire directories
scp -r xxxx <path_to_raw_reads> danielle_becker@bluewaves.uri.edu:<path_to_storage>

scp -r /Users/Danielle/Downloads/Pocillopora_meandrina_HIv1.assembly.fasta ssh danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/REFS/Pmean/genome_assembly

1) Obtain Reference Genome and Create Folder Structure

I am using the Hawaiian Pocillopora meandrina genome to compare alignment statistics.

Stephens, T. G., Lee, J., Jeong, Y., Yoon, H. S., Putnam, H. M., Majerová, E., & Bhattacharya, D. (2022). High-quality genome assembles from key Hawaiian coral species. GigaScience, 11.

Location on Andromeda, the HPC server for URI:

cd /data/putnamlab/REFS/

mkdir Pmean

All of the genome files (i.e., genome scaffolds, CDS, proteins, GFF, and functional annotations) were downloaded from the Rutgers online database.

I put them on the Putnam Lab HPC Andromeda server for future reference:

pwd /data/putnamlab/REFS/Pmean/

Make folder structure in personal pathway on Andromeda

pwd /data/putnamlab/dbecks/Becker_E5/

mkdir Pmean_WGBS_compare

cd Pmean_WGBS_compare

mkdir data/
mkdir scripts/

You can also directly download the genome files from the Rutgers online database:

#make folder for downlaoding refs in directory

cd /data/
mkdir refs

cd /refs/

wget http://cyanophora.rutgers.edu/Pocillopora_meandrina/Pocillopora_meandrina_HIv1.assembly.fasta.gz

wget http://cyanophora.rutgers.edu/Pocillopora_meandrina/Pocillopora_meandrina_HIv1.genes.gff3.gz

Unzip gff and genome file

gunzip Pocillopora_meandrina_HIv1.assembly.fasta.gz
gunzip Pocillopora_meandrina_HIv1.genes.gff3.gz

Download select files into data/raw folder

# Path where we stored the RAW fastq.gz files

/data/putnamlab/KITT/hputnam/20201206_Becker_WGBS

#make new directory for raw sequences we are using for this analysis

cd data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data
mkdir raw
cd raw/

Use code to copy files from one directory to the other

# I selected 8 files with the highest alignment rates to the P. verrucosa genome
# samples 21, 27, 12, 28, 32, 31, 1, 25

cp /data/putnamlab/KITT/hputnam/20201206_Becker_WGBS/27_* /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw

2) Check Genewiz Data for WGBS

grep -E 'gene structure "(gene)";' input_file.txt

mkdir /data/putnamlab/hputnam/Becker_E5

cd /data/putnamlab/hputnam/Becker_E5

#Sample Information



3) Check Data Transfer Fidelity

raw data - read only
/data/putnamlab/KITT/hputnam/20201206_Becker_WGBS
md5sum *.gz > URI_md5sum.txt
Genewiz md5 | sample id | URI md5
---|---|---|
de7c178629f722bbe869627bf580702a  |`1_R1_001.fastq.gz`|de7c178629f722bbe869627bf580702a
dea71bd12349e979cccdb18c634141fc  |`1_R2_001.fastq.gz`|dea71bd12349e979cccdb18c634141fc
cdf14cbb48a545f750d5c1d6dbfe838d  |`12_R1_001.fastq.gz`|cdf14cbb48a545f750d5c1d6dbfe838d
74c06455a1144bbac680f38414840c8b  |`12_R2_001.fastq.gz`|74c06455a1144bbac680f38414840c8b
1b6182d26cb3990a8c4c16704a74366f  |`21_R1_001.fastq.gz`|1b6182d26cb3990a8c4c16704a74366f
0a096ccce7a4074d0053b9a69543772b  |`21_R2_001.fastq.gz`|0a096ccce7a4074d0053b9a69543772b
8538fd3f3c5a0b28aac25a74880a0343  |`25_R1_001.fastq.gz`|8538fd3f3c5a0b28aac25a74880a0343
0c0f94d7849d7b880ae6d7a7ca1d78b9  |`25_R2_001.fastq.gz`|0c0f94d7849d7b880ae6d7a7ca1d78b9
88287eba1a92a9c694aeee9dfcf37e82  |`27_R1_001.fastq.gz`|88287eba1a92a9c694aeee9dfcf37e82
c9ff359677ded2aaa9f179681cc2cba5  |`27_R2_001.fastq.gz`|c9ff359677ded2aaa9f179681cc2cba5
9d9a8f0c4197243aac7e053857563058  |`28_R1_001.fastq.gz`|9d9a8f0c4197243aac7e053857563058
71c90daaddade7271bb793d92972510e  |`28_R2_001.fastq.gz`|71c90daaddade7271bb793d92972510e
7ed718e6214db23df2d415f887d35e25  |`31_R1_001.fastq.gz`|7ed718e6214db23df2d415f887d35e25
94a0c47fea951786bf06f93e3101f9a8  |`31_R2_001.fastq.gz`|94a0c47fea951786bf06f93e3101f9a8
0be0e18cebf3b0c52711a1cd15535bb8  |`32_R1_001.fastq.gz`|0be0e18cebf3b0c52711a1cd15535bb8
92afdc0db53a918b1bfd15c2c2ff4943  |`32_R2_001.fastq.gz`|92afdc0db53a918b1bfd15c2c2ff4943

Count the number of reads per sample

zcat *fastq.gz | echo $((`wc -l`/4)) > rawread.counts.txt
This counts reads in goups of 4 lines per read
This should match with the Genewiz summary

4) Run NF Core MethylSeq

NF Core Methylseq
NEXTFLOW ~ version 20.04.1
nf-core/methylseq v1.5

Make an input csv file following new nf-core mehthylseq directions with the path information for your files and genome.

Example input file:

| sample       | fastq_1                                                                          | fastq_2                                                                          | genome                                                                                                  |
|--------------|----------------------------------------------------------------------------------|----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| C21_control  | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/21_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/21_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| C25_control  | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/25_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/25_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| C27_control  | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/27_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/27_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| C28_control  | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/28_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/28_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| C31_control  | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/31_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/31_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| C32_control  | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/32_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/32_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| E12_enriched | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/12_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/12_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| E1_enriched  | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/1_R1_001.fastq.gz  | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/1_R2_001.fastq.gz  | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |

Load to Andromeda

scp -r /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Data/WGBS/input_methylseq.csv danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw

Run the nfcore methylseq pipeline following parameters:

nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/methylseq.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3

# load modules needed

module load Nextflow/22.10.1

# run nextflow methylseq

nextflow run nf-core/methylseq -profile singularity \
--aligner bismark \
--fasta /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta \
--save_reference \
--input '/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/input_methylseq.csv' \
--zymo \
--trim_poly_g \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir Pmean_WGBS \

sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/methylseq.sh
Submitted batch job 244168 20230322

Combined QC output into 1 file with MultiQC, do not need a script due to fast computational time from trimgalore and fastqc

module load MultiQC/1.9-intel-2020a-Python-3.8.2

multiqc /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/fastqc/zips/*fastqc.zip -o /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/fastqc/multiqc/

module load MultiQC/1.9-intel-2020a-Python-3.8.2

multiqc /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/trimgalore/fastqc/zips/*fastqc.zip -o /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/trimgalore/multiqc

Download multiqc report full and for trimmed multiqc to desktop

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/trimgalore/multiqc/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/WGBS/P.meandrina

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/multiqc/bismark/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/WGBS/P.meandrina

Download alignment statistics and bismark reports

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/bismark/summary/bismark_summary_report.txt /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/WGBS/P.meandrina

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/bismark/summary/bismark_summary_report.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/WGBS/P.meandrina

I noticed that the multiqc file that was exported did not have the Bismark or Qualimap info like usual and saw this error in the slurm output:

Invalid value for '-c' / '--config': Path 'multiqc_config.yml' does not exist.

Which means that the config file was in an unidentified spot, so I need to change the config file to add /net to the singularity operation section:

nano ~/.nextflow/assets/nf-core/methylseq/nextflow.config

singularity.runOptions = '-B /glfs -B /net'

Add the -resume command to nfcore methylseq to continue and re-try the multiqc output with the new settings:

nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/methylseq.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3

# load modules needed

module load Nextflow/22.10.1

# run nextflow methylseq

nextflow run nf-core/methylseq -resume \
-profile singularity \
--aligner bismark \
--fasta /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta \
--save_reference \
--input '/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/input_methylseq.csv' \
--zymo \
--trim_poly_g \
--non_directional \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir Pmean_WGBS \

sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/methylseq.sh
Submitted batch job 244141 on 20230321

IT WORKED :)

Download multiqc report to desktop for trimmed multiqc

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/multiqc/bismark/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/WGBS/P.meandrina

4) Merge strands

The Bismark coverage2cytosine command re-reads the genome-wide report and merges methylation evidence of both top and bottom strand.

nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/cov_to_cyto.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3

# load modules needed

module load Bismark/0.23.1-foss-2021b

# run coverage2cytosine merge of strands

 find /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/bismark/methylation_calls/methylation_coverage/*deduplicated.bismark.cov.gz \
 | xargs basename -s _R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz \
 | xargs -I{} coverage2cytosine \
 --genome_folder /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/bismark/reference_genome/BismarkIndex \
 -o {} \
 --merge_CpG \
 --zero_based \
/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/bismark/methylation_calls/methylation_coverage/{}_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz

sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/cov_to_cyto.sh

Scaffolds were not organized in same arrangemnet within tab files, needed to take merged files and sort them before for loop, tested on two files and it worked!

Test steps to make sure this worked on two files before writing larger for loop:

#make sorted test .cov files

bedtools sort -i 10.CpG_report.merged_CpG_evidence.cov > 10.CpG_report.merged_CpG_evidence_sorted.cov

bedtools sort -i 11.CpG_report.merged_CpG_evidence.cov > 11.CpG_report.merged_CpG_evidence_sorted.cov

#run loop to filter CpGs for 10x sorted test coverage files

for f in *merged_CpG_evidence_sorted.cov
do
  STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence_sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4, $5, $6}}' \
  > "${STEM}"_10x_sorted_test.tab
done

#Create a file with positions found in all samples at specified coverage for two test files

module load BEDTools/2.27.1-foss-2018b

multiIntersectBed -i \
/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/10_10x_sorted_test.tab  /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/11_10x_sorted_test.tab   > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.2samps.10x_sorted_test.bed

cat /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.2samps.10x_sorted_test.bed | awk '$4 ==2' > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.filt.2samps.10x_sorted_test.bed

The test worked! Going to sort all .tab files so scaffold order is consistent.

Sorting the merged files so scaffolds are all in the same order and multiIntersectBed will run correctly

#run for loop using bedtools to sort all .tab files

nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/bedtools.sort.sh

#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/
#SBATCH --cpus-per-task=3

module load BEDTools/2.27.1-foss-2018b

for f in *merged_CpG_evidence.cov
do
  STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
  bedtools sort -i "${f}" \
  > "${STEM}"_sorted.cov
done

sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/bedtools.sort.sh


5) Create files for statistical analysis

Above commands were run on Hollies server folder, further commands were run on mine

Run loop to filter CpGs for 5x coverage, creating tab files with raw count for glms

for f in *_sorted.cov
do
  STEM=$(basename "${f}" _sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4, $5, $6}}' \
  > "${STEM}"_5x_sorted.tab
done

Run loop to filter CpGs for 10x coverage, creating tab files with raw count for glms


for f in *_sorted.cov
do
  STEM=$(basename "${f}" _sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4, $5, $6}}' \
  > "${STEM}"_10x_sorted.tab
done
wc -l *5x_sorted.tab




wc -l *10x_sorted.tab


6) Create a file with positions found in all samples at specified coverage

nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersect.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/
#SBATCH --cpus-per-task=3


# load modules needed

module load BEDTools/2.27.1-foss-2018b

multiIntersectBed -i \
/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/1_5x_sorted.tab  /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/3_5x_sorted.tab  /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/4_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/5_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/6_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/7_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/8_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/9_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/10_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/11_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/12_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/13_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/14_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/15_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/17_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/18_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/20_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/21_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/22_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/23_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/24_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/25_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/26_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/27_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/28_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/29_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/30_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/31_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/32_5x_sorted.tab > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.29samps.5x_sorted.bed

cat /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.29samps.5x_sorted.bed | awk '$4 ==29' > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.filt.29samps.5x_sorted.bed

sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersect.sh

nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersect.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/
#SBATCH --cpus-per-task=3


# load modules needed

module load BEDTools/2.27.1-foss-2018b

multiIntersectBed -i \
/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/1_10x_sorted.tab  /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/3_10x_sorted.tab  /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/4_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/5_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/6_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/7_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/8_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/9_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/10_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/11_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/12_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/13_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/14_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/15_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/17_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/18_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/20_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/21_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/22_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/23_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/24_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/25_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/26_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/27_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/28_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/29_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/30_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/31_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/32_10x_sorted.tab > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.29samps.10x_sorted.bed

cat /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.29samps.10x_sorted.bed | awk '$4 ==29' > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.filt.29samps.10x_sorted.bed

sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersect.sh


7) Create bedgraphs post merge

for f in *_sorted.cov
do
  STEM=$(basename "${f}" _sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \
  > "${STEM}"_10x_sorted.bedgraph
done


for f in *_sorted.cov
do
  STEM=$(basename "${f}" _sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4}}' \
  > "${STEM}"_5x_sorted.bedgraph
done

8) Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to annotated genes

Modified Pver_genome_assembly file to only include genes in R markdown on desktop

#filtered gff3 to only include gene positions modified.gff3 > gene.gff3

awk '{if ($3 == "gene") {print}}' /data/putnamlab/REFS/Pverr/Pver_genome_assembly_v1.0_modified.gff3 > /data/putnamlab/REFS/Pverr/Pver_genome_assembly_v1.0.gene.gff3

Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to annotated genes

nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersectBed.sh
## wb: Print all lines in the second file
## a: file that ends in pos Only
## b: annotated gene list


#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/
#SBATCH --cpus-per-task=3


# load modules needed

module load BEDTools/2.27.1-foss-2018b


for i in *5x_sorted.tab
do
  intersectBed \
  -wb \
  -a ${i} \
  -b /data/putnamlab/REFS/Pverr/Pver_genome_assembly_v1.0.gene.gff3 \
  > ${i}_gene
done


sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersectBed.sh


nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersectBed.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/
#SBATCH --cpus-per-task=3


# load modules needed

module load BEDTools/2.27.1-foss-2018b

for i in *10x_sorted.tab
do
  intersectBed \
  -wb \
  -a ${i} \
  -b /data/putnamlab/REFS/Pverr/Pver_genome_assembly_v1.0.gene.gff3 \
  > ${i}_gene
done

sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersectBed.sh


Intersect with file to subset only those positions found in all samples

nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersect_final.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/
#SBATCH --cpus-per-task=3


# load modules needed

module load BEDTools/2.27.1-foss-2018b

for i in *5x_sorted.tab_gene
do
  intersectBed \
  -a ${i} \
  -b CpG.filt.29samps.5x_sorted.bed \
  > ${i}_CpG_5x_enrichment.bed
done
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersect_final.sh

nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersect_final.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/
#SBATCH --cpus-per-task=3


# load modules needed

module load BEDTools/2.27.1-foss-2018b

for i in *10x_sorted.tab_gene
do
  intersectBed \
  -a ${i} \
  -b CpG.filt.29samps.10x_sorted.bed \
  > ${i}_CpG_10x_enrichment.bed
done
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersect_final.sh

wc -l *5x_enrichment.bed

477425 10_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 11_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 12_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 13_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 14_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 15_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 1_5x_sorted.tab_gene_CpG_5x_enrichment.bed
288120 16_5x_sorted.tab_gene_CpG_5x_enrichment.bed #not using for downstream analysis, low coverage
477425 17_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 18_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    1077 19_5x_sorted.tab_gene_CpG_5x_enrichment.bed #not using for downstream analysis, low coverage
477425 20_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 21_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 22_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 23_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 24_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 25_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    94912 2_5x_sorted.tab_gene_CpG_5x_enrichment.bed #not using for downstream analysis, low coverage
477425 26_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 27_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 28_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 29_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 30_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 31_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 32_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 3_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 4_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 5_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 6_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 7_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 8_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 9_5x_sorted.tab_gene_CpG_5x_enrichment.bed
  14229434 total



wc -l *10x_enrichment.bed

38637 10_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 1_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 11_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 12_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 13_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 14_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 15_10x_sorted.tab_gene_CpG_10x_enrichment.bed
25858 16_10x_sorted.tab_gene_CpG_10x_enrichment.bed #not using for downstream analysis, low coverage
38637 17_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 18_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    210 19_10x_sorted.tab_gene_CpG_10x_enrichment.bed #not using for downstream analysis, low coverage
38637 20_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    9866 2_10x_sorted.tab_gene_CpG_10x_enrichment.bed #not using for downstream analysis, low coverage
38637 21_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 22_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 23_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 24_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 25_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 26_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 27_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 28_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 29_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 30_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 3_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 31_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 32_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 4_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 5_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 6_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 7_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 8_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 9_10x_sorted.tab_gene_CpG_10x_enrichment.bed
  1156407 total

9) Download final .bed files to desktop for statistical analysis! :)

scp -r danielle_becker@bluewaves.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/*_enrichment.bed /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/RAnalysis/Data/WGBS
Written on March 16, 2023