Proteinortho

Proteinortho is a tool to detect orthologous genes within different species. For doing so, it compares similarities of given gene sequences and clusters them to find significant groups. The algorithm was designed to handle large-scale data and can be applied to hundreds of species at one. Details can be found in Lechner et al., BMC Bioinformatics. 2011 Apr 28;12:124. To enhance the prediction accuracy, the relative order of genes (synteny) can be used as additional feature for the discrimination of orthologs. The corresponding extension, namely PoFF (manuscript in preparation), is already build in Proteinortho. The general workflow of proteinortho is depicted [here].

New Features of Proteinortho Version 6!

Continuous Integration

supports The badge pipeline status indicates the current status of the continuous integration (CI) among various platforms (ubuntu, centos, debian, fedora) and GNU c++ versions (5, 6, latest) The whole git repository gets deployed on a clean docker imager (gcc:latest,gcc:5,ubuntu:latest,fedora:latest,debian:latest,centos:latest) and compiled (make all) and tested (make test). The badge is green only if all test are passed. For more information see Continuous Integration (proteinortho wiki).

Table of Contents

  1. Installation
  2. Synopsis and Description
  3. Options/Parameters
  4. PoFF synteny extension
  5. Output description
  6. Examples
  7. Error Codes and Troubleshooting <- look here if you cannot compile/run (proteinortho wiki)
  8. Large compute jobs example (proteinortho wiki)
  9. Biological example (proteinortho wiki)

Bug reports: See chapter 7. or send a mail to incoming+paulklemm-phd-proteinortho-7278443-issue-@incoming.gitlab.com (Please include the 'Parameter-vector' that is printed for all errors) You can also send a mail to lechner@staff.uni-marburg.de.

Installation

Proteinortho comes with precompiled binaries of all executables (Linux/x86) so you should be able to run perl proteinortho6.pl in the downloaded directory. You could also move all executables to your favorite directory (e.g. with make install PREFIX=/home/paul/bin). If you cannot execute the src/BUILD/Linuxx8664/proteinortho_clustering, then you have to recompile with make, see the section 2. Building and installing proteinortho from source.


Easy installation with (bio)conda (for Linux + OSX)

conda install proteinortho

If you need conda (see here) and the bioconda channel: conda config --add channels defaults && conda config --add channels bioconda && conda config --add channels conda-forge.

install with bioconda alt


Easy installation with brew (for OSX)

brew install proteinortho

If you need brew (see here)

install with brew dl


Easy installation with docker

docker pull quay.io/biocontainers/proteinortho

install with docker


Easy installation with dpkg (root privileges are required)

The deb package can be downloaded here: https://packages.debian.org/unstable/proteinortho. Afterwards the deb package can be installed with sudo dpkg -i proteinortho*deb.


(Easy installation with apt-get)

! Disclamer: Work in progress ! proteinortho will be released to stable with Debian 11 (~2021), then proteinortho can be installed with sudo apt-get install proteinortho (currently this installes the outdated version v5.16b)


1. Prerequisites

Proteinortho uses standard software which is often installed already or is part of then package repositories and can thus easily be installed. The sources come with a precompiled version of Proteinortho for 64bit Linux.

To run Proteinortho, you need: (Click to expand)


To compile Proteinortho (linux/osx), you need: (Click to expand)


2. Building and installing proteinortho from source (linux and osx)

Here you can use a working lapack library, check this with 'dpkg --get-selections | grep lapack'. Install lapack e.g. with 'apt-get install libatlas3-base' or liblapack3.

If you dont have Lapack, then 'make' will automatically compiles Lapack v3.8.0 for you !

Fetch the latest source code archive downloaded from here

or from here (Click to expand)

git clone https://gitlab.com/paulklemm_PHD/proteinortho

wget https://gitlab.com/paulklemm_PHD/proteinortho/-/archive/master/proteinortho-master.zip


OSX additional informations (the -fopenmp error)

Install a newer g++ compiler for -fopenmp support (multithreading) with brew (get brew here https://brew.sh/index_de)
brew install gcc --without-multilib
Then you should have a g++-7 or whatever newer version that there is (g++-8,9,...). Next you have to tell make to use this new compiler with one of the following:
ln -s /usr/local/bin/gcc-7 /usr/local/bin/gcc
ln -s /usr/local/bin/g++-7 /usr/local/bin/g++
OR(!) specify the new g++ in 'make CXX=/usr/local/bin/g++-7 all'

'make' successful output (Click to expand)

[  0%] Prepare proteinortho_clustering ...
[ 20%] Building **proteinortho_clustering** with LAPACK (static/dynamic linking)
[ 25%] Building **graphMinusRemovegraph**
[ 50%] Building **cleanupblastgraph**
[ 75%] Building **po_tree**
[100%] Everything is compiled with no errors.

The compilation of proteinortho_clustering has multiple fall-back routines. If everything fails please look here Troubleshooting (proteinortho wiki).

3. Make test output

'make test' successful output (Click to expand)

Everything is compiled with no errors.
[TEST] 1. basic proteinortho6.pl -step=2 tests
 [1/11] -p=blastp+ test: passed
 [2/11] -p=blastp+ synteny (PoFF) test: passed
 [3/11] -p=diamond test: passed
 [4/11] -p=diamond (--moresensitive) test (subparaBlast): passed
 [5/11] -p=lastp (lastal) test: passed
 [6/11] -p=topaz test: passed
 [7/11] -p=usearch test: passed
 [8/11] -p=ublast test: passed
 [9/11] -p=rapsearch test: passed
 [10/11] -p=blatp (blat) test: passed
 [11/11] -p=mmseqsp (mmseqs) test: passed
[TEST] 2. -step=3 tests (proteinortho_clustering)
 [1/2] various test functions of proteinortho_clustering (-test): passed
 [2/2] Compare results of 'with lapack' and 'without lapack': passed
[TEST] Clean up all test files...
[TEST] All tests passed

If you have problems compiling/running the program go to Troubleshooting (proteinortho wiki).


SYNOPSIS

proteinortho6.pl [options] \ (one fasta for each species, at least 2)

OR

proteinortho [options] \

DESCRIPTION

proteinortho is a tool to detect orthologous genes within different species. For doing so, it compares similarities of given gene sequences and clusters them to find significant groups. The algorithm was designed to handle large-scale data and can be applied to hundreds of species at one. Details can be found in Lechner et al., BMC Bioinformatics. 2011 Apr 28;12:124. To enhance the prediction accuracy, the relative order of genes (synteny) can be used as additional feature for the discrimination of orthologs. The corresponding extension, namely PoFF (manuscript in preparation), is already build in Proteinortho.

Proteinortho assumes, that you have all your gene sequences in FASTA format either represented as amino acids or as nucleotides. The source code archive contains some examples, namely C.faa, E.faa, L.faa, M.faa located in the test/ directory. By default Proteinortho assumes amino acids sequences and thus uses diamond (-p=diamond) to compare sequences. If you have nucleotide sequences, you need to change this by adding the parameter -p=blastn+ (or some other algorithm). (In case you have only have NCBI BLAST legacy installed, you need to tell this too - either by adding -p=blastp or -p=blastn respectively.) The full command for the example files would thus be

proteinortho6.pl -project=test test/C.faa test/E.faa

test/L.faa test/M.faa. Instead of naming the FASTA files one by one, you could also use test/*.faa. Please note that the parameter -project=test is optional, for naming the output. With this, you can set the prefix of the output files generated by Proteinortho. If you skip the project parameter, the default project name will be myproject.

OPTIONS graphical user interface

Open proteinorthoHelper.html in your favorite browser or visit lechnerlab.de/proteinortho online for an interactiv exploration of the different options of proteinortho.

OPTIONS

Main parameters (can be used with -- or -)

(Click to expand)


Synteny options (optional, step 2) (output: .ffadj-graph, .poff.tsv (tab separated file)-graph)

(Click to expand)


Clustering options (step 3) (output: .proteinortho.tsv, .proteinortho.html, .proteinortho-graph)

(Click to expand)


Misc options

(Click to expand)


Large compute jobs


PoFF

The PoFF extension allows you to use the relative order of genes (synteny) as an additional criterion to disentangle complex co-orthology relations. To do so, add the parameter -synteny. You can use it to either come closer to one-to-one orthology relations by preferring synthetically conserved copies in the presence of two very similar paralogs (default), or just to reduce noise in the predictions by detecting multiple copies of genomic areas (add the parameter -dups=3). Please note that you need additional data to include synteny, namely the gene positions in GFF3 format. AsProteinortho is primarily made for proteins, it will only accept GFF entries of type CDS (column #3 in the GFF-file). The attributes column (#9) must contain Name=GENE IDENTIFIER where GENE IDENTIFIER corresponds to the respective identifier in the FASTA format. It may not contain a semicolon (;)! Alternatively, you can also set ID=GENE IDENTIFIER. Example files are provided in the source code archive. Hence, we can run proteinortho6.pl -project=test -synteny test/A1.faa test/B1.faa test/E1.faa test/F1.faa to add synteny information to the calculations. Of course, this only makes sense if species are sufficiently similar. You won't gain much when comparing e.g. bacteria with fungi. When the analysis is done you will find an additional file in your current working directory, namely test.poff.tsv (tab separated file). This file is equivalent to the test.proteinortho.tsv file (above) but can be considered more accurate as synteny was involved for its construction.

Output

BLAST Search (step 1-2)

myproject.blast-graph (Click to expand)

filtered raw blast data based on adaptive reciprocal best blast
matches (= reciprocal best match plus all reciprocal matches within a
range of 95% by default) The first two rows are just comments
explaining the meaning of each row. Whenever a further comment line (starting
with #) follows, it indicates results comparing the two species is
about to follow. E.g. # M.faa L.faa tells that the next lines representing
results for species M and L. All matches are reciprocal matches. If
e.g. a match for M_15 L_15 is shown, L_15 M_15 exists implicitly.
E-Values and bit scores for both directions are given behind each
match.
The 4 comment numbers ('# 3.8e-124        434.9...') are representing the median values of  
evalue_ab, bitscore_ab, evalue_ba and bitscore_ba.

  # file_a    file_b
  # a   b     evalue_ab     bitscore_ab   evalue_ba     bitscore_ba 
  # E.faa     C.faa   
  # 3.8e-124        434.9   2.8e-126        442.2
  E_11  C_11  5.9e-51 190.7   5.6e-50 187.61
  E_10  C_10  3.8e-124    434.9   2.8e-126    442.2
  ...


Clustering (step 3)

myproject.proteinortho-graph (Click to expand) clustered myproject.blast-graph. Its connected components are represented in myproject.proteinortho.tsv / myproject.proteinortho.html. The format of myproject.blast-graph is the same as the blast-graph (see above).

  # file_a    file_b
  # a   b     evalue_ab     bitscore_ab   evalue_ba     bitscore_ba
  # E.faa     C.faa
  E_10  C_10  3.8e-124    434.9   2.8e-126    442.2
  E_11  C_11  5.9e-51 190.7   5.6e-50 187.6


myproject.proteinortho.tsv (Click to expand) The connected components. The first line starting with #is a comment line indicating the meaning of each column for each of the following lines which represent an orthologous group each. The very first column indicates the number of species covered by this group. The second column indicates the number of genes included in the group. Often, this number will equal the number of species, meaning that there is a single ortholog in each species. If the number of genes is bigger than the number of species, there are co-orthologs present. The third column gives rise to the algebraic connectivity of the respective group. Basically, this indicates how densely the genes are connected in the orthology graph that was used for clustering. A connectivity of 1 indicates a perfect dense cluster with each gene similar to each other gene. By default, Proteinortho splits each group into two more dense subgroups when the connectivity is below 0.1 (can be user defined). Hint: you can open this file in Excel / Numbers / Open Office.

  # Species   Genes   Alg.-Conn.    C.faa   C2.faa  E.faa   L.faa   M.faa
  2   5     0.16  *     *     *     L_643,L_641   M_649,M_640,M_642
  3   6     0.138   C_164,C_166,C_167,C_2   *     *     L_2   M_2
  2   4     0.489   *     *     *     L_645,L_647   M_644,M_646


myproject.proteinortho.html (Click to expand) The html version of the myproject.proteinortho.tsv file

POFF (-synteny)

The synteny based graph files (myproject.ffadj-graph and myproject.poff.tsv (tab separated file)-graph) have two additional columns: same_strand and simscore. The first one indicates if two genes from a match are located at the same strands (1) or not (-1). The second one is an internal score which can be interpreted as a normalized weight ranging from 0 to 1 based on the respective e-values. Moreover, a second comment line is followed after the species lines, e.g.

# M.faa L.faa
# Scores: 4   39    34.000000     39.000000

myproject.ffadj-graph (Click to expand)

filtered blast data based on adaptive reciprocal best blast matches
and synteny (only if -synteny is set)


myproject.poff.tsv (tab separated file)-graph (Click to expand)

clustered ffadj graph. Its connected components are represented in
myproject.poff.tsv (tab separated file) (only if -synteny is set)


EXAMPLES

Calling proteinortho Sequences are typically given in plain fasta format like the files in test/

test/C.faa:

>C_10
VVLCRYEIGGLAQVLDTQFDMYTNCHKMCSADSQVTYKEAANLTARVTTDRQKEPLTGGY
HGAKLGFLGCSLLRSRDYGYPEQNFHAKTDLFALPMGDHYCGDEGSGNAYLCDFDNQYGR
...

test/E.faa:

>E_10
CVLDNYQIALLRNVLPKLFMTKNFIEGMCGGGGEENYKAMTRATAKSTTDNQNAPLSGGF
NDGKMGTGCLPSAAKNYKYPENAVSGASNLYALIVGESYCGDENDDKAYLCDVNQYAPNV
...

To run proteinortho for these sequences, simply call

perl proteinortho6.pl test/C.faa test/E.faa test/L.faa test/M.faa

To give the outputs the name 'test', call

perl proteinortho6.pl -project=test test/*faa

To use blast instead of the default diamond, call

perl proteinortho6.pl -project=test -p=blastp+ test/*faa

If installed with make install, you can also call

proteinortho -project=test -p=blastp+ test/*faa

Hints

Using .faa to indicate that your file contains amino acids and .fna to show it contains nucleotides makes life much easier.

Sequence IDs must be unique within a single FASTA file. Consider renaming otherwise. Note: Till version 5.15 sequences IDs had to be unique among the whole dataset. Proteinortho now keeps track of name and species to avoid the necessissity of renaming.

You need write permissions in the directory of your FASTA files as Proteinortho will create blast databases. If this is not the case, consider using symbolic links to the FASTA files.

The directory src contains useful tools, e.g. proteinorthograbproteins.pl which fetches protein sequences of orthologous groups from Proteinortho output table. (These files are installed during 'make install')

Kmere Heuristic

Example 1

In the following example a huge blast graph is used for step 3 (clustering). The first connected component contains 7410694 nodes, hence the kmere heuristic is activated. Since the fiedler vector would result in a good split, the kmere heuristic is then deactivated immediatly.

as fallback (Click to expand)

...
[CRITICAL WARNING]   Failed to partition subgraph with 6929 nodes into (6929,0,0) sized groups, now using kmere heuristic as fall-back.
...

working example for large graphs (Click to expand)

...
17:32:15 [DEBUG] (kmere-heuristic) The current connected component is so large that the k-mere heuristic can be used. First: Testing if a normal split would result in a good partition (|.|>20%) of the CC.
 [WARNING] (kmere-heuristic) A normal split would NOT result in a good partition (|.|>20%) of the CC, therefore  the k-mere heuristic is now used. The current connected component will be split in 3.85373 (= number of proteins <6929> / ( n
odes per species <1> * number of species <1798>)) groups greedily accordingly to the fiedler vector.
...

example for large graphs, where kmere is tested but not needed (Click to expand)

...
20:27:07 [DEBUG] (kmere-heuristic) The current connected component is so large that the k-mere heuristic can be used. First: Testing if a normal split would result in a good partition (|.|>20%) of the CC.
20:27:09 [DEBUG] (kmere-heuristic) A normal split would result in a good partition (|.|>20%) of the CC, therefore returning now to the normal algorithm (no k-mere heuristic).
...

Credit where credit is due

ONLINE INFORMATION

For download and online information, see https://www.bioinf.uni-leipzig.de/Software/proteinortho/ or https://gitlab.com/paulklemm_PHD/proteinortho

REFERENCES

Lechner, M., Findeisz, S., Steiner, L., Marz, M., Stadler, P. F., & Prohaska, S. J. (2011). Proteinortho: detection of (co-) orthologs in large-scale analysis. BMC bioinformatics, 12(1), 124.