TREND

Analysis

Studying the evolutionary history and function of a protein is an intriguing but not a trivial task.

Computational analysis usually starts with collecting homologous proteins.

Consequent steps include:

creating multiple sequence alignment;
building phylogenetic tree;
identifying protein domains;
identifying transmembrane regions and other protein features;
determining gene neighborhoods.

The next important step is to analyze sequence conservation, protein features, and gene neighborhoods in the context of phylogenetic clustering.

Such a comprehensive analysis is time-consuming and error-prone.

To help scientists analyze their data we have created a protein analysis hub that automates these steps and furthermore provides additional tools.

Domains

Full pipeline mode.

There are two ways the pipeline can be used in this mode:

1. Predicting protein features and building phylogenetic tree using full-length sequences.

If you want to use this option just paste your sequences or upload a file and click SUBMIT button.

You can also provide a list of NCBI or MiST protein ids/locus tags instead of protein sequences, and our id analyzer will extract the sequences from NCBI and MiST.

Just don't forget to tick Retrieve sequences (NCBI and MiST).

If you have your own alignment that you already prepared and edited you may skip the alignment step of the pipeline and use your alignment instead. Just paste your alignment and uncheck Align option.

If you want to use TREND just to align your sequences and build a tree you may skip domains identification step unchecking Identify Features option.

2. Building phylogenetic tree using fragments of protein sequences and identifying protein features using full-length sequences.

If you want to use this option:

a) Click Add second area button. New query area will appear.

b) Put full-length sequences into the first area (or choose a file). Protein features will be identified using these sequences.

c) Put fragments of sequences into the second area (or choose a file). Protein sequences in this area will be used for alignment and building phylogentic tree.

Important: sequence names in two areas should be identical. If you downloaded sequences from the same database or were changing sequence names consistently in all the files this will happen naturally.

Partial pipeline mode.

If you have your own phylogenetic tree and protein sequences used to produce the tree use this mode.

Important: sequence names in all the files should be matched. If you downloaded sequences from the same database or were changing sequence names consistently in all the files this will happen naturally.

First way:

a) In Domains section click Partial Pipeline button.

b) Upload protein sequences used to build the tree (Choose file with protein sequences (fasta) button).

c) Optional, if you want the alignment to be ordered according to the tree leaves order, upload your alignment (Choose file with alignment (fasta) button).

d) Upload the tree in newick format (Choose file with tree (Newick) button).

If you just want to reorder sequences in your alignment according to the tree leaves just upload the alignment and tree.

Second way:

a) Put at the beginning of the tree leaves names NCBI/MiST protein ids/locus tags separated from the rest of the names by space, underscore ( _ ), vertical bar (|) or forward slash (/).

b) Tick Retrieve sequences (NCBI and MiST). Our id analyzer will extract the sequences from NCBI and MiST.

c) And then just upload the tree in newick format (Choose file with tree (Newick) button) and start the analysis

If you download sequences from NCBI or MiST and use them to build the tree, the ids will be naturally at the beginning of sequences and you don't have to do anything.

As a result of running Domains pipeline a phylogenetic tree combined with interactive protein features will be produced.

Clicking on features will open an information block with details of the identified features. Domain analysis details contain links to entries in corresponding database (Pfam and CDD) for each identified domain. Clicking on a feature will highlight the part of a sequence corresponding to it. To zoom in/zoom out use mouse wheel.

All the produced data is downloadable.

Neighborhoods

You can cluster prokaryotic genes based on the shared domains of the encoded proteins and visualize the clusters and gene neighborhoods on phylogenetic tree running our neighborhoods analysis pipeline.

Protein names in the file with the sequences or the tree should start with protein identifiers separated from the rest of the name by space, underscore ( _ ), vertical bar (|) or forward slash (/).

Suitable identifiers:

a) NCBI RefSeq Id (ex., YP_026207.1 or WP_000809774.1);

b) Locus tag, either old or new (ex., b3210);

b) MiST Id (ex., GCF_000005845.2-b3210).

The analyzer can be run in four ways:

1) by pasting protein sequences (or uploading corresponding file)

2) by pasting phylogenetic tree in newick format (or uploading corresponding file)

3) by pasting a list of NCBI/MiST protein ids/locus tags (or uploading corresponding file). Don't forget to check Retrieve sequences (NCBI and MiST)

4) by pasting an alignment in FASTA format (or uploading corresponding file)

Bear in mind that the refSeq Ids are not organism specific. If you want to explore the neighborhoods of genes from a particular organism you should provide locus tags or MiST Ids.

'Operon tolerance' parameter is a distance in nucleotides between neighboring genes to consider them as being encoded in one operon.

'Not shared domains tolerance' parameter is a number of not shared domains between any two proteins that is allowed for corresponding genes sill to be considered as members of the same cluster.

'Number of neighboring genes on one side (max 15)' parameter is a number of neighboring genes on each side of a gene of interest. Maximum is 15, i.e. 30 neighboring genes in sum will be displayed.

As a result of running Neighborhoods pipeline a phylogenetic tree combined with interactive gene neighborhoods will be produced.

Genes that belong to the same cluster will be colored in the same color. Genes that belong to the same operon will have borders of the same color.

Query genes that were used as input will be shown using thick borders.

'Сodirect' option.

Genes that are encoded in genomes in the same direction as query genes will be shown by default as forward genes (oriented left to right) regardless of their actual orientation.

Genes that are encoded in genomes in the opposite direction, will be shown as reverse genes (oriented right to left) regardless of their actual orientation. This substantially facilitates patterns identification.

The default gene rendering can be changed using 'Сodirect' switcher on the Neighborhoods pipeline results page. Switching it off will show genes as they are actually encoded in genomes. Reorienting genes may take some time, especially if your input was a large data set. So, give the switcher a time.

Hovering the mouse on the picture shows the corresponding gene information including its product name, NCBI and MiST ids, links to the databases, encoded domains and cluster ids. To zoom in/zoom out use mouse wheel.

All the produced data is downloadable.

Credits

MAFFT

FastTree

MEGA

HMMER

TMHMM

CD-HIT

BLAST

ETE Toolkit

jsPhyloSVG

CDD

Pfam

COG

KOG

SMART

PRK

TIGRFAMs

MiST

Citation

If you use our platform for your research, please cite us. This will help us keep running it and implement new functionality.

Gumerov VM, Zhulin IB (2020) TREND: a platform for exploring protein function in prokaryotes based on phylogenetic, domain architecture and gene neighborhood analyses. Nucleic Acids Research 48: W72–W76

Browser compatibility

First supported version is shown

OS	Version	Chrome	Firefox	Opera	Microsoft Edge	Safari
Linux	Ubuntu 14+	63+	55+	49+	n/a	n/a
Windows	10+	63+	55+	49+	17+ Partially	n/a
MacOS	10.14+	69+	62+	56+	n/a	12+