Pocillopora meandrina WGBS Bioinformatic Workflow
Alignment comparison of the Pocillopora meandrina genome to the Pocillopora verrucosa genome for E5 molecular WGBS data
Goal
The following document contains the bioinformatic pipeline used for cleaning, aligning and assembling our raw genomic DNA sequences for WGBS. The goal is to compare the alignment statistics when using the Hawaiian Pocillopora meandrina genome compared to the Red Sea Pocillopora verrucosa genome. These commands were compiled into bash scripts to run on the URI HPC Andromeda server. We will be using a subset of ~8 sequences to compare. Sequences were choosen from the highest alignment scores from the previous run.
Project overview
Bioinformatic tools used in analysis:
- Quality check: FastQC, MultiQC
- Quality trimming: Fastp
- Alignment to reference genome: HISAT2
- Preparation of alignment for assembly: SAMtools
- Transcript assembly and quantification: StringTie
Check for required software on Andromeda
Update software to latest version
NF Core Methylseq
NEXTFLOW ~ version 20.04.1
nf-core/methylseq v1.5
Prepare work space
- Upload raw reads and reference genome to server
- Assess that your files have all uploaded correctly
- Prepare your working directory
- Install all necessary programs
Upload raw reads and reference genome to server
This is done with the scp
or “secure copy” linux command. SCP allows the secure transferring of files between a local host and a remote host or between two remote hosts using ssh authorization.
Secure Copy (scp) Options:
- -r - Recursively copy entire directories
scp -r xxxx <path_to_raw_reads> danielle_becker@bluewaves.uri.edu:<path_to_storage>
scp -r /Users/Danielle/Downloads/Pocillopora_meandrina_HIv1.assembly.fasta ssh danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/REFS/Pmean/genome_assembly
1) Obtain Reference Genome and Create Folder Structure
I am using the Hawaiian Pocillopora meandrina genome to compare alignment statistics.
Location on Andromeda, the HPC server for URI:
cd /data/putnamlab/REFS/
mkdir Pmean
All of the genome files (i.e., genome scaffolds, CDS, proteins, GFF, and functional annotations) were downloaded from the Rutgers online database.
I put them on the Putnam Lab HPC Andromeda server for future reference:
pwd /data/putnamlab/REFS/Pmean/
Make folder structure in personal pathway on Andromeda
pwd /data/putnamlab/dbecks/Becker_E5/
mkdir Pmean_WGBS_compare
cd Pmean_WGBS_compare
mkdir data/
mkdir scripts/
You can also directly download the genome files from the Rutgers online database:
#make folder for downlaoding refs in directory
cd /data/
mkdir refs
cd /refs/
wget http://cyanophora.rutgers.edu/Pocillopora_meandrina/Pocillopora_meandrina_HIv1.assembly.fasta.gz
wget http://cyanophora.rutgers.edu/Pocillopora_meandrina/Pocillopora_meandrina_HIv1.genes.gff3.gz
Unzip gff and genome file
gunzip Pocillopora_meandrina_HIv1.assembly.fasta.gz
gunzip Pocillopora_meandrina_HIv1.genes.gff3.gz
Download select files into data/raw folder
# Path where we stored the RAW fastq.gz files
/data/putnamlab/KITT/hputnam/20201206_Becker_WGBS
#make new directory for raw sequences we are using for this analysis
cd data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data
mkdir raw
cd raw/
Use code to copy files from one directory to the other
# I selected 8 files with the highest alignment rates to the P. verrucosa genome
# samples 21, 27, 12, 28, 32, 31, 1, 25
cp /data/putnamlab/KITT/hputnam/20201206_Becker_WGBS/27_* /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw
2) Check Genewiz Data for WGBS
grep -E 'gene structure "(gene)";' input_file.txt
mkdir /data/putnamlab/hputnam/Becker_E5
cd /data/putnamlab/hputnam/Becker_E5
#Sample Information
3) Check Data Transfer Fidelity
raw data - read only
/data/putnamlab/KITT/hputnam/20201206_Becker_WGBS
md5sum *.gz > URI_md5sum.txt
Genewiz md5 | sample id | URI md5
---|---|---|
de7c178629f722bbe869627bf580702a |`1_R1_001.fastq.gz`|de7c178629f722bbe869627bf580702a
dea71bd12349e979cccdb18c634141fc |`1_R2_001.fastq.gz`|dea71bd12349e979cccdb18c634141fc
cdf14cbb48a545f750d5c1d6dbfe838d |`12_R1_001.fastq.gz`|cdf14cbb48a545f750d5c1d6dbfe838d
74c06455a1144bbac680f38414840c8b |`12_R2_001.fastq.gz`|74c06455a1144bbac680f38414840c8b
1b6182d26cb3990a8c4c16704a74366f |`21_R1_001.fastq.gz`|1b6182d26cb3990a8c4c16704a74366f
0a096ccce7a4074d0053b9a69543772b |`21_R2_001.fastq.gz`|0a096ccce7a4074d0053b9a69543772b
8538fd3f3c5a0b28aac25a74880a0343 |`25_R1_001.fastq.gz`|8538fd3f3c5a0b28aac25a74880a0343
0c0f94d7849d7b880ae6d7a7ca1d78b9 |`25_R2_001.fastq.gz`|0c0f94d7849d7b880ae6d7a7ca1d78b9
88287eba1a92a9c694aeee9dfcf37e82 |`27_R1_001.fastq.gz`|88287eba1a92a9c694aeee9dfcf37e82
c9ff359677ded2aaa9f179681cc2cba5 |`27_R2_001.fastq.gz`|c9ff359677ded2aaa9f179681cc2cba5
9d9a8f0c4197243aac7e053857563058 |`28_R1_001.fastq.gz`|9d9a8f0c4197243aac7e053857563058
71c90daaddade7271bb793d92972510e |`28_R2_001.fastq.gz`|71c90daaddade7271bb793d92972510e
7ed718e6214db23df2d415f887d35e25 |`31_R1_001.fastq.gz`|7ed718e6214db23df2d415f887d35e25
94a0c47fea951786bf06f93e3101f9a8 |`31_R2_001.fastq.gz`|94a0c47fea951786bf06f93e3101f9a8
0be0e18cebf3b0c52711a1cd15535bb8 |`32_R1_001.fastq.gz`|0be0e18cebf3b0c52711a1cd15535bb8
92afdc0db53a918b1bfd15c2c2ff4943 |`32_R2_001.fastq.gz`|92afdc0db53a918b1bfd15c2c2ff4943
Count the number of reads per sample
zcat *fastq.gz | echo $((`wc -l`/4)) > rawread.counts.txt
This counts reads in goups of 4 lines per read
This should match with the Genewiz summary
4) Run NF Core MethylSeq
NF Core Methylseq
NEXTFLOW ~ version 20.04.1
nf-core/methylseq v1.5
Make an input csv file following new nf-core mehthylseq directions with the path information for your files and genome.
Example input file:
| sample | fastq_1 | fastq_2 | genome |
|--------------|----------------------------------------------------------------------------------|----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| C21_control | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/21_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/21_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| C25_control | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/25_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/25_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| C27_control | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/27_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/27_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| C28_control | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/28_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/28_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| C31_control | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/31_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/31_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| C32_control | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/32_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/32_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| E12_enriched | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/12_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/12_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
| E1_enriched | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/1_R1_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/1_R2_001.fastq.gz | /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta |
Load to Andromeda
scp -r /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Data/WGBS/input_methylseq.csv danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw
Run the nfcore methylseq pipeline following parameters:
nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/methylseq.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3
# load modules needed
module load Nextflow/22.10.1
# run nextflow methylseq
nextflow run nf-core/methylseq -profile singularity \
--aligner bismark \
--fasta /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta \
--save_reference \
--input '/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/input_methylseq.csv' \
--zymo \
--trim_poly_g \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir Pmean_WGBS \
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/methylseq.sh
Submitted batch job 244168 20230322
Combined QC output into 1 file with MultiQC, do not need a script due to fast computational time from trimgalore and fastqc
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/fastqc/zips/*fastqc.zip -o /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/fastqc/multiqc/
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/trimgalore/fastqc/zips/*fastqc.zip -o /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/trimgalore/multiqc
Download multiqc report full and for trimmed multiqc to desktop
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/trimgalore/multiqc/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/WGBS/P.meandrina
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/multiqc/bismark/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/WGBS/P.meandrina
Download alignment statistics and bismark reports
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/bismark/summary/bismark_summary_report.txt /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/WGBS/P.meandrina
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/bismark/summary/bismark_summary_report.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/WGBS/P.meandrina
I noticed that the multiqc file that was exported did not have the Bismark or Qualimap info like usual and saw this error in the slurm output:
Invalid value for '-c' / '--config': Path 'multiqc_config.yml' does not exist.
Which means that the config file was in an unidentified spot, so I need to change the config file to add /net to the singularity operation section:
nano ~/.nextflow/assets/nf-core/methylseq/nextflow.config
singularity.runOptions = '-B /glfs -B /net'
Add the -resume command to nfcore methylseq to continue and re-try the multiqc output with the new settings:
nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/methylseq.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3
# load modules needed
module load Nextflow/22.10.1
# run nextflow methylseq
nextflow run nf-core/methylseq -resume \
-profile singularity \
--aligner bismark \
--fasta /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/refs/Pocillopora_meandrina_HIv1.assembly.fasta \
--save_reference \
--input '/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/data/raw/input_methylseq.csv' \
--zymo \
--trim_poly_g \
--non_directional \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir Pmean_WGBS \
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/methylseq.sh
Submitted batch job 244141 on 20230321
IT WORKED :)
Download multiqc report to desktop for trimmed multiqc
scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/multiqc/bismark/multiqc_report.html /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Bioinformatics/Output/WGBS/P.meandrina
4) Merge strands
The Bismark coverage2cytosine command re-reads the genome-wide report and merges methylation evidence of both top and bottom strand.
nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/cov_to_cyto.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3
# load modules needed
module load Bismark/0.23.1-foss-2021b
# run coverage2cytosine merge of strands
find /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/bismark/methylation_calls/methylation_coverage/*deduplicated.bismark.cov.gz \
| xargs basename -s _R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz \
| xargs -I{} coverage2cytosine \
--genome_folder /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/bismark/reference_genome/BismarkIndex \
-o {} \
--merge_CpG \
--zero_based \
/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/Pmean_WGBS/bismark/methylation_calls/methylation_coverage/{}_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/cov_to_cyto.sh
Scaffolds were not organized in same arrangemnet within tab files, needed to take merged files and sort them before for loop, tested on two files and it worked!
Test steps to make sure this worked on two files before writing larger for loop:
#make sorted test .cov files
bedtools sort -i 10.CpG_report.merged_CpG_evidence.cov > 10.CpG_report.merged_CpG_evidence_sorted.cov
bedtools sort -i 11.CpG_report.merged_CpG_evidence.cov > 11.CpG_report.merged_CpG_evidence_sorted.cov
#run loop to filter CpGs for 10x sorted test coverage files
for f in *merged_CpG_evidence_sorted.cov
do
STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence_sorted.cov)
cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4, $5, $6}}' \
> "${STEM}"_10x_sorted_test.tab
done
#Create a file with positions found in all samples at specified coverage for two test files
module load BEDTools/2.27.1-foss-2018b
multiIntersectBed -i \
/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/10_10x_sorted_test.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/11_10x_sorted_test.tab > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.2samps.10x_sorted_test.bed
cat /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.2samps.10x_sorted_test.bed | awk '$4 ==2' > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.filt.2samps.10x_sorted_test.bed
The test worked! Going to sort all .tab files so scaffold order is consistent.
Sorting the merged files so scaffolds are all in the same order and multiIntersectBed will run correctly
#run for loop using bedtools to sort all .tab files
nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/bedtools.sort.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=danielle_becker@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/
#SBATCH --cpus-per-task=3
module load BEDTools/2.27.1-foss-2018b
for f in *merged_CpG_evidence.cov
do
STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
bedtools sort -i "${f}" \
> "${STEM}"_sorted.cov
done
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/bedtools.sort.sh
5) Create files for statistical analysis
Above commands were run on Hollies server folder, further commands were run on mine
Run loop to filter CpGs for 5x coverage, creating tab files with raw count for glms
for f in *_sorted.cov
do
STEM=$(basename "${f}" _sorted.cov)
cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4, $5, $6}}' \
> "${STEM}"_5x_sorted.tab
done
Run loop to filter CpGs for 10x coverage, creating tab files with raw count for glms
for f in *_sorted.cov
do
STEM=$(basename "${f}" _sorted.cov)
cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4, $5, $6}}' \
> "${STEM}"_10x_sorted.tab
done
wc -l *5x_sorted.tab
wc -l *10x_sorted.tab
6) Create a file with positions found in all samples at specified coverage
nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersect.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/
#SBATCH --cpus-per-task=3
# load modules needed
module load BEDTools/2.27.1-foss-2018b
multiIntersectBed -i \
/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/1_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/3_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/4_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/5_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/6_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/7_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/8_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/9_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/10_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/11_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/12_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/13_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/14_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/15_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/17_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/18_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/20_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/21_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/22_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/23_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/24_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/25_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/26_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/27_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/28_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/29_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/30_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/31_5x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/32_5x_sorted.tab > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.29samps.5x_sorted.bed
cat /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.29samps.5x_sorted.bed | awk '$4 ==29' > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.filt.29samps.5x_sorted.bed
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersect.sh
nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersect.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/
#SBATCH --cpus-per-task=3
# load modules needed
module load BEDTools/2.27.1-foss-2018b
multiIntersectBed -i \
/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/1_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/3_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/4_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/5_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/6_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/7_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/8_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/9_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/10_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/11_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/12_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/13_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/14_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/15_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/17_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/18_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/20_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/21_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/22_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/23_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/24_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/25_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/26_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/27_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/28_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/29_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/30_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/31_10x_sorted.tab /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/32_10x_sorted.tab > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.29samps.10x_sorted.bed
cat /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.29samps.10x_sorted.bed | awk '$4 ==29' > /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/CpG.filt.29samps.10x_sorted.bed
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersect.sh
7) Create bedgraphs post merge
for f in *_sorted.cov
do
STEM=$(basename "${f}" _sorted.cov)
cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \
> "${STEM}"_10x_sorted.bedgraph
done
for f in *_sorted.cov
do
STEM=$(basename "${f}" _sorted.cov)
cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4}}' \
> "${STEM}"_5x_sorted.bedgraph
done
8) Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to annotated genes
Modified Pver_genome_assembly file to only include genes in R markdown on desktop
#filtered gff3 to only include gene positions modified.gff3 > gene.gff3
awk '{if ($3 == "gene") {print}}' /data/putnamlab/REFS/Pverr/Pver_genome_assembly_v1.0_modified.gff3 > /data/putnamlab/REFS/Pverr/Pver_genome_assembly_v1.0.gene.gff3
Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to annotated genes
nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersectBed.sh
## wb: Print all lines in the second file
## a: file that ends in pos Only
## b: annotated gene list
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/
#SBATCH --cpus-per-task=3
# load modules needed
module load BEDTools/2.27.1-foss-2018b
for i in *5x_sorted.tab
do
intersectBed \
-wb \
-a ${i} \
-b /data/putnamlab/REFS/Pverr/Pver_genome_assembly_v1.0.gene.gff3 \
> ${i}_gene
done
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersectBed.sh
nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersectBed.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/
#SBATCH --cpus-per-task=3
# load modules needed
module load BEDTools/2.27.1-foss-2018b
for i in *10x_sorted.tab
do
intersectBed \
-wb \
-a ${i} \
-b /data/putnamlab/REFS/Pverr/Pver_genome_assembly_v1.0.gene.gff3 \
> ${i}_gene
done
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersectBed.sh
Intersect with file to subset only those positions found in all samples
nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersect_final.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/
#SBATCH --cpus-per-task=3
# load modules needed
module load BEDTools/2.27.1-foss-2018b
for i in *5x_sorted.tab_gene
do
intersectBed \
-a ${i} \
-b CpG.filt.29samps.5x_sorted.bed \
> ${i}_CpG_5x_enrichment.bed
done
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/5x_intersect_final.sh
nano /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersect_final.sh
#!/bin/bash
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/
#SBATCH --cpus-per-task=3
# load modules needed
module load BEDTools/2.27.1-foss-2018b
for i in *10x_sorted.tab_gene
do
intersectBed \
-a ${i} \
-b CpG.filt.29samps.10x_sorted.bed \
> ${i}_CpG_10x_enrichment.bed
done
sbatch /data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/scripts/10x_intersect_final.sh
wc -l *5x_enrichment.bed
477425 10_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 11_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 12_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 13_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 14_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 15_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 1_5x_sorted.tab_gene_CpG_5x_enrichment.bed
288120 16_5x_sorted.tab_gene_CpG_5x_enrichment.bed #not using for downstream analysis, low coverage
477425 17_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 18_5x_sorted.tab_gene_CpG_5x_enrichment.bed
1077 19_5x_sorted.tab_gene_CpG_5x_enrichment.bed #not using for downstream analysis, low coverage
477425 20_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 21_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 22_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 23_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 24_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 25_5x_sorted.tab_gene_CpG_5x_enrichment.bed
94912 2_5x_sorted.tab_gene_CpG_5x_enrichment.bed #not using for downstream analysis, low coverage
477425 26_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 27_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 28_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 29_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 30_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 31_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 32_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 3_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 4_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 5_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 6_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 7_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 8_5x_sorted.tab_gene_CpG_5x_enrichment.bed
477425 9_5x_sorted.tab_gene_CpG_5x_enrichment.bed
14229434 total
wc -l *10x_enrichment.bed
38637 10_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 1_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 11_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 12_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 13_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 14_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 15_10x_sorted.tab_gene_CpG_10x_enrichment.bed
25858 16_10x_sorted.tab_gene_CpG_10x_enrichment.bed #not using for downstream analysis, low coverage
38637 17_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 18_10x_sorted.tab_gene_CpG_10x_enrichment.bed
210 19_10x_sorted.tab_gene_CpG_10x_enrichment.bed #not using for downstream analysis, low coverage
38637 20_10x_sorted.tab_gene_CpG_10x_enrichment.bed
9866 2_10x_sorted.tab_gene_CpG_10x_enrichment.bed #not using for downstream analysis, low coverage
38637 21_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 22_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 23_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 24_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 25_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 26_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 27_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 28_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 29_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 30_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 3_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 31_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 32_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 4_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 5_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 6_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 7_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 8_10x_sorted.tab_gene_CpG_10x_enrichment.bed
38637 9_10x_sorted.tab_gene_CpG_10x_enrichment.bed
1156407 total
9) Download final .bed files to desktop for statistical analysis! :)
scp -r danielle_becker@bluewaves.uri.edu:/data/putnamlab/dbecks/Becker_E5/Pmean_WGBS_compare/CovtoCyto/*_enrichment.bed /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/RAnalysis/Data/WGBS