P. verrucosa KofamScan pipeline

Overview

Follow step 3: map KEGG terms to a genome in @echille’s lab notebook 2020-10-08-M-capitata-functional-annotation-pipeline.md post for general steps and explanations on the beginning of this process.

Also, see @echille GitHub for further KEGG ontology steps in R.

Step 3: Map Kegg terms to genome

KofamScan is the command-line version of the popular KofamKOALA web-based tool, used to map Kegg terms (containing pathway information) to a genes. KofamScan and KofamKoala work by using HMMER/HMMSEARCH to search against KOfam (a customized HMM database of KEGG Orthologs (KOs). Mappings are considered robust because each Kegg term has an individual pre-defined threshold that a score has to exceed in order to map to a gene. While all mappings are outputted, high scoring (significant) assignments are highlighted with an asterisk.

The commands that I used are below. In order to run KofamScan, you will need a fasta file of predicted protein sequences (preferably the same one used to run InterProScan).

General Protocol:

i) Download and inflate the Kofam database.

To get the most up-to-date Kofam database, download it just before running KofamScan. You will also need to download the profiles associated with the Kofam database containing threshold information.

curl -O ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz | gunzip > ko_list #download and unzip KO database
curl -O ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz | tar xf > profiles #download and inflate profiles

ii) Run KofamScan

exec_annotation: executes Kofamscan

To execute Kofamscan, use the exec_annotation script provided with program download. To run, use: exec_annotation {options} path_to_query_file.

Options (See KofamScan repo for more options):

  • -o - set name of output file
  • -k - path to ko database (downloaded above)
  • -p - path to profile database (downloaded above)
  • -E - set minimum expect value for significant mappings
  • -f - output format
  • –report-unannotated - returns names of sequences with no mapped KO terms
  • -pa - enables Kegg term mapping
/opt/software/kofam_scan/1.3.0-foss-2019b/exec_annotation -o Pver_KO_annot.tsv -k ./ko_list -p ./profiles/eukaryote.hal -E 0.00001 -f detail-tsv --report-unannotated /data/putnamlab/REFS/Pverr/Pver_proteins_names_v1.0.faa

Bluewaves Script:

#!/bin/bash
#SBATCH --job-name="KofamScan"
#SBATCH -t 30-00:00:00
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --mem=100GB
#SBATCH -D /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/KEGG/

echo "Loading modules" $(date)
module load kofam_scan/1.3.0-foss-2019b
module load libyaml/0.1.5
module unload HMMER/3.3.1-foss-2019b
module load HMMER/3.3.2-gompi-2019b
module list

#echo "Starting analysis... downloading KO database" $(date)
#wget ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz #download KO database
#wget ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
#gunzip ko_list.gz
#tar xf profiles.tar.gz

echo "Beginning mapping" $(date)
/opt/software/kofam_scan/1.3.0-foss-2019b/exec_annotation -o Pver_KO_annot.tsv -k ./ko_list -p ./profiles/eukaryote.hal -E 0.00001 -f detail-$

echo "Analysis complete!" $(date)

Step 4: Compilation of the output of different methods

Done in RStudio. See RMarkdown script.

Output files can be found here

Issues: GitHub Issue #21

I ran into an issue when downloading and using this part of @echille code to get the up-to-date Kofam database:

curl -O ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz | tar xf > profiles #download and inflate profiles

It seemed to not download/extract on of the HMM files for some reason which I realized after running the KofamScan script, adapted from @echille original found here.

In the slurm output, this error was written out:

hmmsearch was not run successfully

Error: File existence/permissions problem in trying to open HMM file
/data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/KEGG/profiles/K00637.hmm.
HMM file /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/KEGG/profiles/K00637.hmm not found (nor an .h3m binary of it)

Some troubleshooting was done to check this error, you can search the profiles folder to see if a .hmm file was extracted correctly:

#how to check for file extraction in profiles folder:

$ ls profiles/K00637*

ls: cannot access ‘profiles/K00637*’: No such file or directory

#how to extract .hmm file from profiles.tar.gz if they do not appear initially

$ tar tf profiles.tar.gz | grep K00637

profiles/K00637.hmm

We were able to figure out that another way to download and extract all of the HMM files was to use wget:

wget ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz

Make sure to unzip the profiles.tar.gz after:

tar -xf profiles.tar.gz > profiles

After I solved this issue, I got another error in the slurm output that read:

Error: Unknown KO: K00960

This error means that for some reason when extracting/downloading the Kofam database, some of the values (K00960) and a few others that had not been updated were still being found in the tmp > tabular output folder that is created during the download/extraction.

I used the code below to see the older versions of the values that are no longer used and then had to delete them before running the job again.

ls -ltr tmp/tabular/  | grep -v Aug.19

total 232797
drwxr-xr-x 3 danielle_becker putnamlab   4096 Jul 16 12:03 ..
-rw-r--r-- 1 danielle_becker putnamlab   2238 Jul 19 13:34 K00960
-rw-r--r-- 1 danielle_becker putnamlab  26732 Jul 19 13:36 K02471
-rw-r--r-- 1 danielle_becker putnamlab  18298 Jul 19 13:52 K09311
-rw-r--r-- 1 danielle_becker putnamlab  21632 Jul 19 13:53 K09376
-rw-r--r-- 1 danielle_becker putnamlab   1654 Jul 19 13:53 K09394
-rw-r--r-- 1 danielle_becker putnamlab   3422 Jul 19 14:12 K18122

These values did not appear to be in the new profiles.tar.gz file, so they had to have come from a previous version.

I removed them and re-ran the job.

If you are still getting an error, make sure to re-download the ko_list.gz as well before re-running

curl -O ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz | gunzip > ko_list #download and unzip KO database

Sometimes the gunzip on the end does not work, make sure to check that the ko_list has contents in it. If not, use:

gunzip ko_list.gz in a separate line to properly unzip the gz file.

After all of these steps, I re-ran the script and it worked! You should have a Pver_KO_annot.tsv file that looks like this:

Screen Shot 2021-08-23 at 2 29 15 PM

Written on September 6, 2021