P. verrucosa functional annotation pipeline
Overview
All methods for this protocol were adapted following the workflows created by @echillie and @JillAshey GitHub’s here and here.
Table of Contents
-
i) On a HPC server, download nr database from NCBI. Convert this database to be Diamond-readable
ii) Align query protein sequences against database
iii) Generate readable output files
iv) View output file
-
Assign gene ontology terms to sequences
i) Run InterProScan to obtain GO terms
ii) Run BLAST2GO to obtain GO terms
- a) Download BLAST2GO to personal computer and activate the Basic subscription plan
- b) Load the XML files generated from DIAMOND BLAST
- c) Map GO terms
- d) Annotate GO terms
- e) Export annotated sequences and info
- f) View output file
iii) Run Uniprot to obtain GO terms
- a) Make a list of identifiers found by DIAMOND BLAST
- b) Navigate to Uniprot Retrieve/ID mapping page and under Provide your identifiers, click ‘upload your own file’ and upload a subsetted tab file
- c) c) Under ‘Select options’, choose the place where the identifiers were generated (From) and what database to compare to (To)
- d) Select appropiate columns to include in table
- e) Save files of interest
- f) View output file
This project aims to develop a functional genomic annotation workflow for non-model organisms (ie corals). The goal of functional annotation is to identify and tag genes in a reference genome with known functions of homologous genes in other organisms. The following document is intended as a tutorial in understanding functional annotation in non-model organisms. The genomic information from the coral Pocillopora verrucosa is used in this workflow.
For this functional annotation workflow tutorial, you will need:
- Access to a high performance computing server (ie URI Andromeda) with the following programs:
Make sure to check the modules and see if there are any new versions. If so, contact Kevin Bryant to update modules on Andromeda when needed
- DIAMOND v2.0.0-GCC-8.3.0
- InterProScan/5.52-86.0-foss-2021a
- Java v11.0.2
- Laptop with access to the Internet and the following programs installed:
- Blast2GO Basic
- R v4.0.2
- RStudio v1.3.959
Step 1: Obtain sequences of interest.
In order to conduct functional annotation steps, protein and transcript sequences are needed. There are two main hubs where coral genomic information is stored: Reef Genomics and NCBI. Other researchers store their genomic information on their own personal webpages. Genomic information must be downloaded from one of these databases in order to proceed.
i) Identify species to work with.
For this project, the coral species of interest is Pocillopora verrucosa.
ii) Download genomic files for species of interest.
Pocillopora verrucosa Full Transcripts
Pocillopora verrucosa Gene models (protein)
Ready to start annotating!
Step 2: Identify homologous sequences
Homology refers to the similarity of structure or genes in different taxa due to shared ancestry. {example}
Sequence homology is the homology between DNA, RNA, and protein sequences in terms of shared ancestry. Sequence homology is usually inferred by the similarity of nucleotide or amino acid sequences. Strong sequence similarity (or percent homology) provides evidence that two or more sequences are related through shared ancestry.
Several software programs have the ability to compare sequences to assess homology, the most popular one being NCBI’s BLAST (Basic Local Alignment Search Tool - what a great name). BLAST compares nucleotide or protein sequences of interest (called a query) to their sequence databases to find regions of similarity. If a nucleotide or protein sequence of interest significantly matches a sequence/sequences in the databases, BLAST will tag the sequence of interest with the information about the known sequence(s). For example, if a new transcript sequence is identified in a mouse, BLAST could be used to see if any other animals carry a similar sequence, and if they do, what the biological functions of the sequence are.
In this analysis, the program DIAMOND BLAST was used. Like regular BLAST, DIAMOND is a sequence aligner for nucleotide and protein sequences; unlike BLAST, it is optimized for a higher performance capability of large datasets at 100x-20,000x speed of BLAST.
i) On a HPC server, download nr database from NCBI. Convert this database to be Diamond-readable.
If you are in the Putnam Lab and on Andromeda
I used this step for this functional annotation because using the below commands in bash took way too long
This script, created by Erin Chille on August 6, 2020, downloads the most recent nr database in FASTA format from NCBI and uses it to make a Diamond-formatted nr database. This step was updated by Danielle Becker-Polinski on September 24th, 2021 because the scripts were not including the full CPUs to download and a couple other formatting errors. Go to the sbatch_executables subdirectory in the Putnam Lab shared folder and run the scripts, make_diamond_nr_db.sh
and make_diamond_nr_db.sh
in this order:
$ sbatch download_nr_database.sh
Submitted batch job NNN
$ sbatch -d afterok:NNN make_diamond_nr_db.sh
If you are not in the Putnam Lab
Download the nr database from NCBI. Then, use Diamond’s makedb
command to format the database in a Diamond-friendly format. You can also use the command dbinfo
to find version information for the database.
The nr (non-redundant) database is a collection of non-identical protein sequences compiled by NCBI. It is updated on a daily basis.
# On Andromeda
# Download db from NCBI
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
# Load Diamond module
module load DIAMOND/2.0.0-GCC-8.3.0
diamond makedb --in nr.gz -d nr
diamond dbinfo -d nr.dmnd
Options
makedb
- create a Diamond binary database filein
- database that will be used to search againstd
- base output name for database filedbinfo
- print information about database file
The file nr.dmnd is now Diamond-readable and can be used as a reference.
ii) Align query protein sequences against database
Now that the reference database has been properly generated, the sequences of interest can be aligned against it.
# On Andromeda
# Load Diamond module
module load DIAMOND/2.0.0-GCC-8.3.0 #Load DIAMOND
#Run sequence alignment against the nr database
diamond blastx -d /data/putnamlab/shared/databases/nr.dmnd -q /data/putnamlab/REFS/Pverr/Pver_transcriptome_v1.0.fasta -o Pver_annot -f 100 -b20 --more-sensitive -e 0.00001 -k1 --unal 1 --threads $SLURM_CPUS_ON_NODE --tmpdir /tmp/
Blastp: align protein query sequences against protein reference database
Options
d
- path to nr database fileq
- path to query fasta fileo
- base output namef
- output format- 100 = Diamond format
b
- Block size in billions of sequence letters to be processed at a time. Larger block sizes increase the use of memory and temporary disk space, but also improve performance. Set at 20. 20 is the highest recommended value.more-sensitive
e
- maximum expected value to report an alignment- 1e-05 is typical cutoff for sequence alignments
k
maximum top sequences- Set at 1 to only report top sequence for each gene
iii) Generate readable output files
# On Andromeda
#Converting format to XML format for BLAST2GO
diamond view -a Pver.annot.20210924.daa -o Pver.annot.20210924.xml -f 5
diamond view -a Pver.annot.20210924.daa -o Pver.annot.20210924.tab -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen
View: generate formatted output files
Options
a
- path to input DAA fileo
- base output namef
- output format- 5 = XML output; 6 = TAB output
The output files (XML and TAB) will both be used downstream in this workflow.
iv) View output file
This is an example of part of the DIAMOND output .tab file:
qseqid | sseqid | pident | length | mismatch | gapopen | qstart | qend | sstart | send | evalue | bitscore | qlen | slen |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pver_evm.model.Segkk0_pilon.12 | XP_015753513.1 | 86.7 | 1103 | 15 | 1 | 1 | 2913 | 23 | 1125 | 0.0e+00 | 1807.7 | 2916 | 1125 |
Column names
- qseqid - query seq-id
- sseqid - subject seq-id (from BLAST databases)
- pident - % identical matches
- length - alignment length
- mismatch - number of mismatches
- gapopen - number of gap openings
- qstart - start of alignment in query
- qend - end of alignment in query
- sstart - start of alignment in subject
- send - end of alignment in subject
- evalue - number of expected hits of similar quality that could be found by chance (similar to a pvalue). The smaller the evalue, the better the match!
- bitscore - required size of a sequence database in which the current match could be found just by chance. The higher the bitscore, the better the sequence similarity!
- qlen - query sequence length
- slen - subject sequence length
v) Secure-copy output files to local computer
# From a new terminal window (ie not Andromeda or remote server)
scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/BLAST-GO-KO/Diamond/Pver_annot.xml /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Functional_Annotation/Diamond
scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/BLAST-GO-KO/Diamond/Pver_annot.tab /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Functional_Annotation/Diamond
DIAMOND BLAST results can now be used in further analyses.
Full Andromeda Script: This step will take over 20 hours, you can run this and the InterProScan script at the same time on Andromeda
Pver_annot_diamond.sh:
#!/bin/bash
#SBATCH --job-name="diamond-blastx"
#SBATCH -t 240:00:00
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --mem=100GB
#SBATCH --error="diamond_blastx_out_error"
#SBATCH --output="diamond_blastx_out"
#SBATCH --exclusive
echo "START" $(date)
module load DIAMOND/2.0.0-GCC-8.3.0 #Load DIAMOND
echo "Updating Pver annotation" $(date)
diamond blastx -d /data/putnamlab/shared/databases/nr.dmnd -q /data/putnamlab/REFS/Pverr/Pver_transcriptome_v1.0.fasta -o Pver_annot -f 100 -b20 --more-sensitive -e 0.00001 -k1 --unal 1 --threads $SLURM_CPUS_ON_NODE --tmpdir /tmp/
echo "Search complete... converting format to XML and tab"
diamond view -a Pver_annot.daa -o Pver_annot.xml -f 5
diamond view -a Pver_annot.daa -o Pver_annot.tab -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen
echo "STOP" $(date)
Submitted batch job 1931742
Step 3: Assign gene ontology terms to sequences
After DIAMOND BLAST is completed, analysis can move to assigning gene ontology (GO) terms to sequences.
The Gene Ontology is an extensive consortium that aims to understand gene function and provide functional annotation for organisms across the tree of life. It also maintains a controlled vocabulary of gene and gene attributes across species.
The Gene Ontology has a system to classify genes into terms. The terms are grouped into 3 categories:
- Molecular funciton - molecular-level activities performed by gene products (ie RNA or proteins). Describes the activities rather than the entities (molecules or protein complexes)
- Examples: Toll receptor binding, transcription regulator activity
- Cellular component - cellular anatomy and/or locations where gene products perform function
- Examples: mitochondrion, ribosome
- Biological process - larger biological processes accomplished by interaction of multiple molecular activities
- Examples: glucose transmembrane transport, DNA repair
With this information, genes can be described/annotated with multiple terms. These terms are called GO terms. Here is the basic structure of a GO term:
GOID: GO:0007165
Term: signal transduction
Ontology: BP
Definition: The cellular process in which a signal is conveyed to
trigger a change in the activity or state of a cell. Signal
transduction begins with reception of a signal (e.g. a ligand
binding to a receptor or receptor activation by a stimulus such as
light), or for signal transduction in the absence of ligand,
signal-withdrawal or the activity of a constitutively active
receptor. Signal transduction ends with regulation of a downstream
cellular process, e.g. regulation of transcription or regulation of
a metabolic process. Signal transduction covers signaling from
receptors located on the surface of the cell and signaling via
molecules located within the cell. For signaling between cells,
signal transduction is restricted to events at and within the
receiving cell.
Synonym: GO:0023033
Synonym: signaling pathway
Synonym: signalling pathway
Synonym: signaling cascade
Synonym: signalling cascade
Secondary: GO:0023033
Elements
- GOID - unique 7-digit identifier that is intended to be machine readable
- Term - human-readable name
- Ontology - molecular function (MF), cellular component (CC), biological process (BP)
- Definition - description of what the term means
- Synonym - alternate words closely related in meaning to term, relevant connections
- Secondary - ID created when 2 or more terms are identical in meaning and so are merged into a single term. Secondary ID preserves the excess GO terms
There is so much more information available with these terms, including relationships to other genes/gene products, graphical representation of related GO terms, and much more, but that is beyond the scope of this analysis.
There are many different methdods to assigning GO terms to genes/gene products of interest. This workflow will focus on InterProScan, BLAST2GO, and Uniprot. These tools were chosen because they all utilize different databases for annotation. Additionally, they will all be run in different places: InterProScan on HPC, BLAST2GO on local computer, and Uniprot online. Results can be compared across tools to assess how each performed.
i) Run InterProScan to obtain GO terms
InterProScan is a website/software that provides functional analysis of proteins by using predictive models that look across protein databases to find homologies with query proteins. If a homology is identified in one of the databases, the information about that homology is used to assign the query protein a GO term.
Note: Many fasta files willl use an asterisk to denote a STOP codon. InterProScan does not accept special characters within the sequences, so I removed them prior to running the program using the code below:
cp Pver_proteins_names_v1.0.faa /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/BLAST-GO-KO/InterProScan/
sed -i 's/*//g' Pver_proteins_names_v1.0.faa
InterProScan utilizes several member databases to enhance the chance of obtaining good protein info:
- Structural domains - Gene3D, Superfamily
- Functional annotation of families/domains - PIRSF, TIGR, Panther, Pfam, SMART, Prints, Hamap, ProSite
- Protein features - ProSite
- Prediction of conserved domains - ProDom
a) Run InterProScan on Andromeda
# On Andromeda
# Load module
module load InterProScan/5.52-86.0-foss-2021a
module load Java/11.0.2
java -version
# Run InterProScan
interproscan.sh -version
interproscan.sh -f XML -i Pver_proteins_names_v1.0.faa -b ./Pver.interpro.20210927 -iprlookup -goterms -pa
interproscan.sh -mode convert -f GFF3 -i ./Pver.interpro.20210927.xml -b ././Pver.interpro.20210927
Options
version
- display version number of IPS and javaf
- output formati
- path to input fileb
- base output nameiprlookup
- provides mapping from matched member databases to InterPro entriesgoterms
- provides mapping to Gene Ontologypa
- provides mapping from matches to pathway info, which is based on matched mantually curated InterPro enteriesmode
- convert - change file format
b) View output file
Example of part of IPS output file:
seqid | source | type | start | end | score | strand | phase | attributes |
---|---|---|---|---|---|---|---|---|
Pver_evm.model.Segkk4293_pilon.5 | Pver | protein_match | 158 | 257 | 1.1E-19 | + | . | date=26-09-2020;Target=Pver_evm.model.Segkk4293_pilon.5 158 257;Ontology_term=”GO:0006811”,”GO:0016021”;ID=match$3_158_257;signature_desc=Neurotransmitter-gated ion-channel transmembrane region;Name=PF02932;status=T;Dbxref=”InterPro:IPR006029” |
Column names
- seqid - query sequence id
- source - algorithm or software that generated feature
- type - type of feature
- start - start coordinates of feature
- end - end coordinates of feature
- score - similar to evalue or pvalue
- strand - + for positive strand, - for negative strand
- phase - ?
- attributes - list of feature attributes in format tag=value. There can be multiple tag=value attributes
- Date
- Target - target sequence for analysis and start/stop coordinates
- Ontology - link between feature and ontology databses, like Gene Ontology (ie GO terms)
- ID - feature ID, required to be unique for features that have ‘children’ (gene or mRNA)
- Signature desc
- Name - feature name, does not have to be unique
- Status
- Dbxref - link between feature and other databases, like InterPro
I noticed some inconsistencies with my output GFF3 output table and Jill’s GFF3 output table.
On my output table I noticed there were a lot of extra pathway annotations that occurred after the Dbxref=”InterPro:IPR000504” section of the output. Compared to Jill’s that did not have this. While their is nothing wrong with this, it takes up a lot of space and makes the file much longer.
My Output GFF3 Format:
Jill’s Output GFF3 Format:
Since this information was not needed for downstream analysis, I removed it using the code below:
sed -e 's#^\(.*Dbxref="InterPro:[A-Z0-9]*"\).*#\1#' Pver.interpro.20210927.gff3 > Pver.interpro.20210927-smaller.gff3
I then checked to make sure nothing was deleted by this change:
1. looked at number of gene names
zgrep -c "^>" Pver.interpro.20210927.gff3
455630
zgrep -c "^>" Pver.interpro.20210927-smaller.gff3
455630
2. looked at number of Dbxref= references
grep -c 'Dbxref=' Pver.interpro.20210927.gff3
257193
grep -c 'Dbxref=' Pver.interpro.20210927-smaller.gff3
257193
3. looked at number of Ontology_term= references
grep -c 'Ontology_term=' Pver.interpro.20210927.gff3
127451
grep -c 'Ontology_term=' Pver.interpro.20210927-smaller.gff3
127451
4. looked at number of status=T references
grep -c 'status=T' Pver.interpro.20210927.gff3
429951
grep -c 'status=T' Pver.interpro.20210927-smaller.gff3
429951
Full Andromeda Script: Took ~16 hours to complete
Pver_InterProScan.sh:
#!/bin/bash
#SBATCH --job-name="InterProScan"
#SBATCH -t 30-00:00:00
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --mem=100GB
#SBATCH --error="interproscan_out_error"
#SBATCH --output="interproscan_out"
#SBATCH --exclusive
cd /data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/BLAST-GO-KO/InterProScan/
echo "START $(date)"
# Load module
module load InterProScan/5.52-86.0-foss-2021a
module load Java/11.0.2
java -version
# Run InterProScan
interproscan.sh --cpu $SLURM_CPUS_ON_NODE ...
interproscan.sh -version
interproscan.sh -f XML -i Pver_proteins_names_v1.0.faa -b Pver.interpro.20210927 -iprlookup -goterms -pa
interproscan.sh -mode convert -f GFF3 -i Pver.interpro.20210927.xml -b Pver.interpro.20210927
# -i is the input data
# -b is the output file base
# -f is formats
# -iprlookup enables mapping
# -goterms is GO Term
# -pa is pathway mapping
# -version displays version number
echo "DONE $(date)"
Script error:
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'InterProScan/5.46-81.0-foss-2019b'
openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
/var/spool/slurmd/job1931747/slurm_script: line 21: interproscan.sh: command not found
/var/spool/slurmd/job1931747/slurm_script: line 22: interproscan.sh: command not found
/var/spool/slurmd/job1931747/slurm_script: line 23: interproscan.sh: command not found
Error searching for module:
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'InterProScan/5.52-86.0-foss-2019b'
#Kevin had to download the background packages on bluewaves and suggested to use andromeda for faster running times, ran on andromeda and it worked
Submitted batch job 88966
#noticed in my output script that it mentioned a new version/module of InterProtScan available, had Kevin Bryan update this to InterProScan/5.52-86.0-foss-2021a
#also Kevin suggested adding these two additions to the above code to speed up the process:
#SBATCH --exclusive
interproscan.sh --cpu $SLURM_CPUS_ON_NODE ...
#added them above and submitted
Submitted batch job 89016
c) Secure-copy output file to local computer
# From a new terminal window (ie not Andromeda or remote server)
scp danielle_becker@ssh3.hac.uri.edu:/data/putnamlab/dbecks/Becker_E5/Becker_RNASeq/BLAST-GO-KO/InterProScan/Pver.interpro.20210927-smaller.gff3 /Users/Danielle/Desktop/Putnam_Lab/Becker_E5/Functional_Annotation/InterProScan/
ii) Run BLAST2GO to obtain GO terms
BLAST2GO (B2G) is a bioinformatics tools for functional annotation and analysis of gene or protein sequences. It was originally developed to provide a user-friendly interface for GO annotation and now hosts many different functional annotation tools. The B2G software makes it possible to generate annotation without requiring writing any code. While B2G can do an extensive array of functions, this analysis primarily utilizes the GO mapping and annotation functions.
a) Download BLAST2GO to personal computer and activate the Basic subscription plan.
The B2G application can be downloaded here. B2G is available for Mac, Windows, and Linux systems. 2GB of RAM is recommended. Additionally, Internet connection is required to use most application features.
Register for B2G Basic here. B2G Basic is free and includes the necessary features for this analysis. Registering will generate an activation key, which be put into the B2G software. You must be a part of some research institution to obtain B2G Basic.
b) Load the XML files generated from DIAMOND BLAST.
While B2G has the ability to run BLAST, this analysis prefers to use results from DIAMOND because of its high performance capability and senstivity. The XML file generated from DIAMOND BLAST will be used here.
To load the file, go to File<Load<Load Blast results<Load Blast XML (Legacy)
Once the file is loaded, a table loads with info about those BLAST results (nr, Tags, SeqName, Description, Length, Hits, e-Value, and sim mean). All of the cells should be orange with Tags that say BLASTED. This indicates that these sequences have only been blasted, not mapped or annotated.
c) Map GO terms
Mapping is the process of retrieving GO terms associated with the Description obtained by the DIAMOND BLAST search. Several mapping {steps} occur:
- BLAST result accessions are used to retrieve gene names from NCBI and those gene names are used to search the GO database.
- BLAST result accessions are used to retrieve protein information (with GO terms already annotated) through UniProt, which uses the databases SD, UniProt, Swiss-Prot, TrEMBL, RefSeq, GenPept and PDB.
- BLAST result accessions are directly searched in GO database.
To map results, select the mapping icon (white circle with green circle inside) at the top of the screen. Then select Run Mapping. A box will open up; don’t change anything, click run. Depending on the number of BLAST results, mapping could take hours to days. B2G will continue running if the computer is in sleep mode. Mapping status can be checked under the Progress tab in the lower left box. If mapping retrieved a GO term that may be related to a certain sequence, that sequence row will turn light green.
d) Annotate GO terms
Now that GO terms have been retrieved, the annotation process will select GO terms from the GO pool obtained through mapping and assign them to query sequences. Several annotation steps occur:
- For all found GO terms, an annotation rule (AR) is applied. The rule seeks to find the most specific annotations with a certain level of reliability and can be adjusted in the settings prior to running annotation.
- If a candidate GO term is found, an annotation score is calculated that weights the highest hit similarity of that candidate GO term by its evidence code (from mapping step). The candidate GO term will only be assigned to a sequence if it passes a certain threshold based on calculations above. For more info about the math behind annotation, go here.
To annotate, select the annot icon (white circle with blue circle inside) at the top of the screen. Then select Run Annotation. A box will open up; don’t change anything unless thresholds need to be adjusted. If no changes are necessary, click next through the boxes until the final one and click run. Depending on the mapping results, annotating could take hours to days. B2G will continue running if the computer is in sleep mode. Annotation status can be checked under the Progress tab in the lower left box. If a GO term has been successfully assigned to a sequence, that sequence row will turn blue.
e) Export annotated sequences and info
To export the file with B2G annotated genes, go to File<Export<Export as Table. A box will open up, and save as a text file. Re-save as csv as needed.
f) View output file
seqName | top_hit | length | evalue | simMean | GO.ID | GO_names |
---|---|---|---|---|---|---|
Pver_evm.model.Segkk0_pilon.11 | EDO40119.1 | 1779 | 1.9e-199 | 68.38 | P:GO:0007166; P:GO:0007275; C:GO:0016020; C:GO:0120025 | P:cell surface receptor signaling pathway; P:multicellular organism development; C:membrane; C:plasma membrane bounded cell projection |
Column names
- seqName - query sequence
- top_hit - top match from DIAMOND BLAST
- length - alignment length
- evalue - number of expected hits of similar quality that could be found by chance (similar to a pvalue). The smaller the evalue, the better the match!
- simMean - mean similarity between query and top match
- GO.ID - gene ontology mapping results, GO terms and evidence codes
- GO_names - annotated GO terms
iii) Run Uniprot to obtain GO terms
Uniprot (Universal Protein Resource) provides information for protein sequence and annotation data. It maintains several protein databases:
- UniProt Knowledgebase (UniProtKB) - central collection hub for functional protein info and annotation
- UniProt Reference Clusters (UniRef) - provides clustering from UniProtKB to ensure complete coverage of sequences while masking redundant sequences
- UniProt Archive (UniParc) - database of publicly available protein sequences
In this analysis, Uniprot uses BLAST input to search against its protein databases.
a) Make a list of identifiers found by DIAMOND BLAST.
Uniprot uses the .tab file generated from DIAMOND BLAST as input. The column ‘sseqid’ is the primary input to Uniprot, as it can use that BLAST input to find protein matches. Because UniProt is a website and lacks proper storage and RAM, it cannot handle an entire .tab DIAMOND BLAST file. To ensure that UniProt reads all inputs, subset files so that a file only has ~2000-3000 lines per file in Terminal.
# Check how many lines in the full file
wc -l Pver_annot.tab
21606 Pver_annot.tab
#https://kb.iu.edu/d/afar (for instructions on split command)
#use split command to split into multiple files that have 2,000 designated lines each
split -l 2000 Pver_annot.tab tab
tabaa tabad tabag tabaj
tabab tabae tabah tabak
tabac tabaf tabai
b) Navigate to Uniprot Retrieve/ID mapping page and under Provide your identifiers, click ‘upload your own file’ and upload a subsetted tab file.
UniProt will be able to process the smaller/subsetted files. However, each subsetted .tab file will need to be put in as input one at a time.
c) Under ‘Select options’, choose the place where the identifiers were generated (From) and what database to compare to (To).
In this analysis, the identifiers were generated from the EMBL/GenBank/DDBJ CDS option (aka NCBI/BLAST) and they will be compared to (UniProtKB). Once that is finished, hit submit. It may take a few minutes). If the mapping fails and the page displays ‘Service Unavailable’, go back and try again.
d) Select appropiate columns to include in table.
If mapping was successful, the screen should have something like X out of Y EMBL/GenBank/DDBJ CDS identifiers were successfully mapped to X UniProtKB IDs in the table below
. There will be a table below with the results. Select the ‘Columns’ tab right above the table. This will open a window to pick more column options so more information can be included in the table. Many of the column options don’t apply to this analysis.
Under ‘Names & Taxonomy’, Entry name, Gene names, Organism, and Protein Names should already be selected. Under ‘Sequences’, Length should already be selected. Under ‘Miscellanous’, the gold paper with a star and the blue paper symbols should already be selected. If they are not, select those columns. Under ‘Gene Ontology’, select Gene ontology (GO) and Gene ontology IDs. Under ‘Genome Annotation’, select KEGG. Once selections are completed, scroll back to the top of the page and hit Save in the upper-right hand corner.
e) Save files of interest
To save the main table, click Download in the top left of the table. Select Download all, Uncompressed, and Tab-seperated as Format. Click Go.
To save the unmapped identifiers, click on the ‘Click here to download the 6281 unmapped identifiers’. To save UniParc results (if any are found), click on the UniParc link under the main header of X out of Y EMBL/GenBank/DDBJ CDS identifiers were successfully mapped to X UniProtKB IDs in the table below
. To finish downloading UniParc results, click Download in the top left of the table. Select Download all, Uncompressed, and Tab-seperated as Format. Click Go.
f) View output files
This is an example of Uniprot output .tab file:
my_list | top_hit | uniprotkb_entry | status | protein_names | gene_names | organism | length | go_ids | gene_ontology | kegg |
---|---|---|---|---|---|---|---|---|---|---|
EDO36971.1 | A7SH22 | A7SH22_NEMVE | unreviewed | Predicted protein (Fragment) | v1g118173 | Nematostella vectensis (Starlet sea anemone) | 239 | GO:0004888; GO:0005230; GO:0005887; GO:0007165; GO:0007268; GO:0030594; GO:0034220; GO:0042391; GO:0043005; GO:0045202; GO:0050877 | integral component of plasma membrane [GO:0005887]; neuron projection [GO:0043005]; synapse [GO:0045202]; extracellular ligand-gated ion channel activity [GO:0005230]; neurotransmitter receptor activity [GO:0030594]; transmembrane signaling receptor activity [GO:0004888]; chemical synaptic transmission [GO:0007268]; ion transmembrane transport [GO:0034220]; nervous system process [GO:0050877]; regulation of membrane potential [GO:0042391]; signal transduction [GO:0007165] | nve:5508437 |
Column names
- my_list - query id (generated from DIAMOND BLAST)
- top_hit - top match from Uniprot databases
- uniprotkb_entry - official Uniprot entry for top_hit
- status - entry status; indicates if entry has been manually annotated and reviewed (reviewed = SwissProt section of Uniprot, unreviewed = computer-annotated TrEMBL section)
- protein_names - list of all names for a particular protein (Predicted protein = unreviewed = omputer-annotated TrEMBL section)
- gene_names - name of the genes that code for particular protein sequence
- organism - name of organism that is source of protein information
- length - alignment length
- go_ids - gene ontology mapping results, GO terms and evidence codes
- gene_ontology - annotated GO terms
- ko - K number used to reconstruct biological pathways in KEGG mapper (outside scope of this analysis)
- kegg - gene identifier used to search biological pathways in KEGG mapper (outside scope of this analysis)
Step 4: Merge all information for full annotation
Follow instructions/code here to compile all annotation information in R.