BS-SNPer on WGBS data

Code is inspired by this post, this post, and this post.

In this notebook post, I’ll use BS-Snper to call SNP variants from the Pocillopora meandrina and grandis WGBS data from the Molecular Underpinnings project. Adult colonies originated from a control and nutrient enriched reef site in Moorea, French Polynesia. DNA was extracted from whole tissue for WGBS. Reads were aligned to the P. verrucosa genome through the nf-ccore methylseq pipeline.

Other lab members, Kevin Wong and Emma Strand have tried to run BS-SNPer on their data but keep encountering this error: Too many characters in one row! Try to split the long row into several short rows (fewer than 1000000 characters per row). Error! at /opt/software/BS-Snper/1.0-foss-2021b/bin/BS-Snper.pl line 110.

I am going to try an approach with my samples, with the methods used by Steven Roberts in this post.

1. Set working directory

cd /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS

mkdir BS-SNPer

2. Identify SNP variants

I will identify variants in individual files, as well as SNPs across all samples.

2a. Merge SNP variants

First I need to sort all the deduplicated bam files

cd /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/bismark_deduplicated

nano /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/sort_bam.sh

#!/bin/bash
#SBATCH --job-name="sort_bam"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=120GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/bismark_deduplicated

# load modules needed
module load SAMtools/1.16.1-GCC-11.3.0

# List of files to exclude
EXCLUDE_FILES=("2_R1_001_val_1_bismark_bt2_pe.deduplicated.bam" "16_R1_001_val_1_bismark_bt2_pe.deduplicated.bam" "19_R1_001_val_1_bismark_bt2_pe.deduplicated.bam")

for f in *.deduplicated.bam
do
  # Check if the current file is in the exclude list
  if [[ ! " ${EXCLUDE_FILES[@]} " =~ " ${f} " ]]; then
    STEM=$(basename "${f}" _R1_001_val_1_bismark_bt2_pe.deduplicated.bam)
    samtools sort "${f}" \
    -o "${STEM}".deduplicated_sorted.bam
  fi
done

sbatch /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/sort_bam.sh

Submitted batch job 302471

To identify SNPs across all samples, I need to merge my samples, then use that as the input file for BS-Snper.

nano /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/merge_bam.sh

#!/bin/bash
#SBATCH --job-name="merge_snp"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=120GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/bismark_deduplicated

# load modules needed

module load SAMtools/1.16.1-GCC-11.3.0

# Merge Samples with SAMtools

samtools merge \
BS_SNPer_merged.bam \
*sorted.bam

sbatch /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/merge_bam.sh

Submitted batch job 334278

View output file header

This feature is only in the older module of samtools

module load SAMtools/1.9-foss-2018b

samtools view BS-SNPer_merged.bam | head

GWNJ-1013:120:GW201202000:2:2318:21766:34992_1:N:0:GGCTTAAG+GGTCACGA	163	Pver_Sc0000000_size2095917	28	42	130M	=	240	342	ATATCTAACACATTAAAACTTCCTAAATCAAAAAAAAATTCCCTCTACATCAAACACAACAAACAACATTTTAATAAAAACACACATATATATTTCTTACCCCATTTTCAATAATAATAACAAATCAATA	FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFF:F,FFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFF	NM:i:24	MD:Z:7G1G1G2G16G2G0G10G5G8G10G3G0G5G25G0G1G2G2G4G1T0G0G1G0	XM:Z:.......h.z.z..h................h..hh..........x.....x........x..........h...hh.....z.........................zx.h..h..h....h..hh.hXR:Z:GA	XG:Z:GA

GWNJ-1013:120:GW201202000:2:1535:13349:28714_1:N:0:TTACAGGA+GCTTGTCA	99	Pver_Sc0000000_size2095917	42	42	24M1D106M	=	378	465	GAAATTTTTTAAATTAAGAAGGAATTTTTTTGTATTAGATATAATAGATAATATTTTGATAGGAATATGTATATATATTTTTTGTTTTATTTTTGGTGATGATGATAAGTTGGTGTTGTGTTATTTTAGT	FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFF:F:FFFFF,FFF::FF,FFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF	NM:i:31	MD:Z:4C2C0C5C9^T1C0C0C1C2C2C3C1C2C3C2C13C1C1C10C2A0C0C0C0C5C11C15C1C2C2C0	XM:Z:....h..hh.....h..........hhh.x..h..x...h.h..x...h..h.............h.z.h..........h...hhhh.....z...........h...............h.h..x..h	XR:Z:CT	XG:Z:CT

GWNJ-1013:120:GW201202000:2:1117:2908:33927_1:N:0:TCTCTACT+GAACCGCG	163	Pver_Sc0000000_size2095917	68	6	130M	=	196	259	CCCTCTACATCAAACACAACAAACAACATTTTAATAAAAACACACATATATATTTCTTACCCCATTTTCAATAATAATAACAAATCAATATTATATCACTTCAACAAATACCTTTATCAACATAAAAATA	FFF,F,FFFFFFFF::FFF:FFFFFFFF:FFFFF,FFFFFFFFF:FFF,F,F,FF,,FFFFFFF,FFFFFFFFFFFF:FFF:FF,FFF,FFFFFFF:FF,F:FFF,FFFFFF,FFF,FFF,:,FFFFFFF	NM:i:25	MD:Z:6G5G8G10G3G0G5G25G0G1G2G2G4G1T0G0G1G2G1G8G5G9G4G2G1G0	XM:Z:......x.....x........x..........h...hh.....z.........................zx.h..h..h....h..hh.h..h.h........x.....h.........x....h..h.hXR:Z:GA	XG:Z:GA

GWNJ-1013:120:GW201202000:2:2622:23312:3787_1:N:0:TTACAGGA+GCTTGTCA	163	Pver_Sc0000000_size2095917	77	21	129M	=	346	399	TCAAACACAACAAACAACATTTTAATAAAAACACACATATATATTTCTTACCCCATTTTCAATAATAATAACAAATCAATATTATGTCACTTCAACAAATACCTTTATCAACATAAAAATAACAAAATT	FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:	NM:i:24	MD:Z:3G8G10G3G0G5G25G0G1G2G2G4G1T0G0G1G2G10G5G9G4G2G1G2G5	XM:Z:...x........x..........h...hh.....z.........................zx.h..h..h....h..hh.h..h.H........x.....h.........x....h..h.h..z.....	XR:Z:GA	XG:Z:GA

2b. Identify SNPs

Options for the script are found here and instructions are taken from BS-SNPer GitHub descriptions.

Usage:

You can run BS-SNPer in Linux or MAC OS, using the command:

perl BS-Snper.pl <sorted_bam_file> --fa <reference_file> --output <snp_result_file> --methcg <meth_cg_result_file> --methchg <meth_chg_result_file> --methchh <meth_chh_result_file> --minhetfreq 0.1 --minhomfreq 0.85 --minquali 15 --mincover 10 --maxcover 1000 --minread2 2 --errorate 0.02 --mapvalue 20 >SNP.out 2>ERR.log

Attention:

Both of the input and output file arguments should be passed to BS-SNPer in the form of absolute paths. Providing absolute paths ensures that BS-SNPer can accurately locate and process the specified input and output files, regardless of the current working directory or other factors.

Options:

--fa: Reference genome file in fasta format
--output: Temporary file storing SNP candidates
--methcg: CpG methylation information
--methchg: CHG methylation information
--methchh: CHH methylation information
--minhetfreq: Threshold of frequency for calling heterozygous SNP
--minhomfreq: Threshold of frequency for calling homozygous SNP
--minquali: Threshold of base quality
--mincover: Threshold of minimum depth of covered reads
--maxcover: Threshold of maximum depth of covered reads
--minread2: Minimum mutation reads number
--errorate: Minimum mutation rate
--mapvalue: Minimum read mapping value
SNP.out: Final SNP result file
ERR.log: Log file

Write script for BS-SNPer approach for our data

nano /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/bs_snper_merged_steven.sh

Following Stevens script:

#!/bin/bash
#SBATCH --job-name="BS_snper"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=120GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output

module load BS-Snper/1.0-foss-2021b

perl /opt/software/BS-Snper/1.0-foss-2021b/bin/BS-Snper.pl \
/data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/bismark_deduplicated/BS_SNPer_merged.bam \
--fa /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr/Pver_genome_assembly_v1.0.fasta \
--output /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/SNP-candidates.out \
--methcg /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CpG-meth-info.tab \
--methchg /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CHG-meth-info.tab \
--methchh /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CHH-meth-info.tab \
--minhetfreq 0.1 \
--minhomfreq 0.85 \
--minquali 15 \
--mincover 10 \
--maxcover 1000 \
--minread2 2 \
--errorate 0.02 \
--mapvalue 20 \
> /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/SNP-results.vcf 2>/data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/merged.ERR.log

sbatch /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/bs_snper_merged_steven.sh

Submitted batch job 334300

Current issue:

Unknown option: input
FLAG: 1
refSeqFile = /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr/Pver_genome_assembly_v1.0.fasta.
bamFileName = /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/bismark_deduplicated/BS_SNPer_merged.bam.
snpFileName = /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/SNP-candidates.txt.
methCgFileName = /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CpG-meth-info.tab.
methChgFileName = /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CHG-meth-info.tab.
methChhFileName = /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CHH-meth-info.tab.
vQualMin = 15.
nLayerMax = 1000.
vSnpRate = 0.100000.
vSnpPerBase = 0.020000.
mapqThr = 20.
Too many characters in one row! Try to split the long row into several short rows (fewer than 1000000 characters per row).
Error! at /opt/software/BS-Snper/1.0-foss-2021b/bin/BS-Snper.pl line 110.

Possible reason for error: The limit for the number of characters per row is 1,00,000 and our input fasta file has a line that is 2,095,918. Research computing patched the module and BS-SNPer.pl script to accept 3,000,000. I reran the Steven approach script.

Python script used to determine number of characters:

python /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/max_ln.py /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr/Pver_genome_assembly_v1.0.fasta

sbatch /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/bs_snper_merged_steven.sh

Submitted batch job 303588

Received another error:

Unknown option: input
FLAG: 1
refSeqFile = /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr/Pver_genome_assembly_v1.0.fasta.
bamFileName = /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/bismark_deduplicated/BS_SNPer_merged.bam.
snpFileName = /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/SNP-candidates.txt.
methCgFileName = /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CpG-meth-info.tab.
methChgFileName = /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CHG-meth-info.tab.
methChhFileName = /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CHH-meth-info.tab.
vQualMin = 15.
nLayerMax = 1000.
vSnpRate = 0.100000.
vSnpPerBase = 0.020000.
mapqThr = 20.
Too many chromosomes!
Error! at /opt/software/BS-Snper/1.0-foss-2021b/bin/BS-Snper.pl line 110.

Possible reason for this error: The .pl script had a limit of 10k chromosomes. Research computing patched it to accept 30k (same multiplier as the line length). I then reran the script.

sbatch /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/bs_snper_merged_steven.sh

Submitted batch job 303600

IT IS RUNNING YAY!

Another error after the script ran:

At the end of `merged.ERR.log` it says it failed to call a copy of samtools it expected to be local to the BS-Snper install.

Research computing modified the install recipe to copy it to where it’s expecting to find it.

sbatch /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/bs_snper_merged_steven.sh

Submitted batch job 303619

The script ran without errors! The output .vcf files match the SNP output file standard VCF format.

The output files include an SNP output file and a methylation output file. The SNP output file has a standard VCF format. The methylation output file has a tab-separated format same as MethylExtract (https://sourceforge.net/projects/methylextract/files/MethylExtract_1.4/ManualMethylExtract.pdf/download):

  1. CHROM: Chromosome.
  2. POS: Sequence context most 5’ position on the Watson strand (1-based).
  3. CONTEXT: Sequence contexts with the SNVs annotated using the IUPAC nucleotide ambiguity code (referred to the Watson strand).
  4. Watson METH: The number of methyl-cytosines (referred to the Watson strand).
  5. Watson COVERAGE: The number of reads covering the cytosine in this sequence context (referred to the Watson strand).
  6. Watson QUAL: Average PHRED score for the reads covering the cytosine (referred to the Watson strand).
  7. Crick METH: The number of methyl-cytosines (referred to the Watson strand).
  8. Crick COVERAGE: The number of reads covering the guanine in this context (referred to the Watson strand).
  9. Crick QUAL: Average PHRED score for the reads covering the guanine (referred to the Watson strand).`

View SNP-results.vcf file:

head /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/SNP-results.vcf

##fileDate= 20240220 ##bssnperVersion=1.1 ##bssnperCommand=--fa /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr/Pver_genome_assembly_v1.0.fasta --input /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/bismark_deduplicated/BS_SNPer_merged.bam --output /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/SNP-candidates.out --methcg /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CpG-meth-info.tab --methchg /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CHG-meth-info.tab --methchh /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CHH-meth-info.tab --minhetfreq 0.1 --minhomfreq 0.85 --minquali 15 --mincover 10 --maxcover 1000 --minread2 2 --errorate 0.02 --mapvalue 20 ##reference=file:///data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr/Pver_genome_assembly_v1.0.fasta ##Bisulfite=directional> ##contig=<ID=Pver_Sc0000000_size2095917,length=2095917> ##contig=<ID=Pver_Sc0000001_size2081954,length=2081954> ##contig=<ID=Pver_Sc0000002_size1617595,length=1617595> ##contig=<ID=Pver_Sc0000003_size1576134,length=1576134>

Count number of lines in the SNP-results.vcf file:

wc -l /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/SNP-results.vcf

4,754,830

Following Stevens code, use the grep function to select for specific SNP mutations where the reference allele is C and the alternate allele is G to find CT SNPs from the output file:

grep $'C\tT' SNP-results.vcf > CT-SNP.vcf

wc -l CT-SNP.vcf

659,639 SNPs out of 4,754,830 = 13%

2c. Intersect VCF with SNP locations and CG motif track

Use BEDtools intersectBED to determine which SNPs are present at CG sites:

You first need to make CG motif file from your origin genome to filter for just CG motifs using EMBOSS fuzznuc function.

nano /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr/fuzznuc.sh

#!/bin/bash
#SBATCH --job-name="fuzznuc"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=120GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr

module load EMBOSS/6.6.0-foss-2018b

fuzznuc \
-sequence /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr/Pver_genome_assembly_v1.0.fasta \
-pattern CG \
-outfile 20240813_fuzznuc_pverr_CGmotif.gff \
-rformat gff

sbatch /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr/fuzznuc.sh

Submitted batch job 334354

Then use BEDtools intersectBED to determine which SNPs are present at CG sites:

nano /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/intersectbed.SNPs.sh

#!/bin/bash
#SBATCH --job-name="BEDtools"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=120GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output

module load BEDTools/2.30.0-GCC-11.3.0

intersectBed \
-u \
-a /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/SNP-results.vcf \
-b /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/data/refs/Pverr/20240813_fuzznuc_pverr_CGmotif.gff \
| grep "C	T" \
> /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CT-SNPs.tab

sbatch /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/intersectbed.SNPs.sh

Submitted batch job 334357

Look at CT SNPs output file:

head CT-SNPs.tab

Components of output file:

  1. Chromosome/Contig: The first column of the output file is likely to contain information about the chromosome or contig where the SNP is located. This information is often found in the VCF file (SNP-results.vcf).
  2. Position: The second column represents the position of the SNP on the chromosome or contig.
  3. ID/Name: The third column may contain an identifier or name for the SNP.
  4. Reference Allele: The fourth column shows the reference allele at the specified position.
  5. Alternate Allele: The fifth column shows the alternate allele at the specified position.
  6. Quality Score (QUAL): There might be a column indicating the quality score of the SNP. This information is often found in the VCF file.
  7. Filter Status (FILTER): This column could show whether the SNP passes certain quality filters.
  8. INFO fields:
    • DP: This field represents the total read depth at the position of the variant.
    • ADF (Allelic Depth Forward) and ADR (Allelic Depth Reverse): These fields represent the number of reads supporting the variant allele in the forward and reverse directions, respectively.
    • AD (Allelic Depth): The total number of reads supporting the variant allele (ADF + ADR).
  9. FORMAT field (e.g., “GT:DP:ADF:ADR”):
    • GT (Genotype): Represents the genotype of the variant. For example, “0/1” indicates a heterozygous variant.
    • DP: Same as in the INFO field, representing the total read depth.
    • ADF and ADR: Same as in the INFO field, representing allelic depths for forward and reverse reads, respectively.

``

Look at CT-SNPs specific tab file to see how many CT SNPs are in my dataset:

Pver_Sc0000000_size2095917 5189 . C T 1000 PASS DP=93;ADF=0,0;ADR=75,18;AD=75,18; GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR 0/1:93:0,0:75,18:75,18:0,0,73,0,0,18,75,0:0,0,36,0,0,37,36,0:0.806,0.194 Pver_Sc0000000_size2095917 6277 . C T 1000 PASS DP=10;ADF=0,0;ADR=0,10;AD=0,10; GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR 0/1:10:0,0:0,10:0,10:0,0,64,0,0,10,0,0:0,0,37,0,0,37,0,0:0.000,1.000 Pver_Sc0000000_size2095917 7044 . C T 1000 PASS DP=58;ADF=0,0;ADR=24,34;AD=24,34; GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR 0/1:58:0,0:24,34:24,34:0,0,28,0,0,34,24,0:0,0,36,0,0,37,37,0:0.414,0.586

wc -l CT-SNPs.tab

151,495 out of 4,754,830 = 3.18%

3. Filter SNP variants from 10x.bed files

Download CT-SNPs.tab results file that was filtered to CG motifs and CT-SNP.vcf file that was filtered for CT-SNP positions to Desktop from Andromeda

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CT-SNPs.tab /Users/Danielle/Downloads/

scp -r danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/CT-SNP.vcf /Users/Danielle/Downloads/

Filter SNPs from CT-SNPs.tab and CT-SNP.vcf output files from 10x .bed files created from this workflow and remaining steps performed in this Rscript

4. Calculate genetic relatedness among samples using PLINK.

First, we convert the VCF file to PLINK format using vcftools. This step creates PLINK files (.ped and .map) from the VCF file.

nano /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/vcf_to_plink.sh

#!/bin/bash
#SBATCH --job-name="vcf_to_plink"
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=10GB
#SBATCH --account=putnamlab
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/PLINK

# Load PLINK module
module load PLINK/2.00a3.7-gfbf-2023a

# Define file paths
VCF_FILE="/data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/SNP-results.vcf"
OUTPUT_PREFIX="snp_data"

# Convert VCF to PLINK format, splitting multiallelic variants
plink2 --vcf $VCF_FILE \
    --make-bed \
    --out ${OUTPUT_PREFIX}_multiallelic \
    --allow-extra-chr \
    --double-id \
    --set-missing-var-ids @:#:\$r:\$a \
    --new-id-max-allele-len 1000 missing

# Remove multiallelic variants
plink2 --bfile ${OUTPUT_PREFIX}_multiallelic \
    --max-alleles 2 \
    --make-bed \
    --out $OUTPUT_PREFIX \
    --allow-extra-chr

# Calculate genetic relationship matrix (GRM)
plink2 --bfile $OUTPUT_PREFIX \
    --make-grm-gz \
    --out relatedness \
    --allow-extra-chr

echo "PLINK processing complete. Output files are in the current directory."

sbatch /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/vcf_to_plink.sh

Submitted batch job 334475

Next, calculate the genetic relatedness using the PLINK files created in the previous step. This will generate a file containing the relatedness matrix.

nano /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/plink_relatedness.sh

#!/bin/bash
#SBATCH --job-name=calculate_grm
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=10GB
#SBATCH --account=putnamlab
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/Becker_WGBS/BS-SNPer/merged_SNP_output/PLINK

# Load PLINK module
module load PLINK/2.00a3.7-gfbf-2023a

# Step 1: Calculate allele frequencies
plink2 --bfile snp_data_split --freq --out allele_freqs --allow-extra-chr

# Step 2: Calculate GRM using the calculated frequencies
plink2 --bfile snp_data_split --read-freq allele_freqs.afreq --make-grm-list --out relatedness --allow-extra-chr

sbatch /data/putnamlab/dbecks/Becker_E5/WGBS_Becker_E5/scripts/plink_relatedness.sh

Submitted batch job 334479

Written on February 12, 2024