TRee-based Exploration of Neighborhoods and Domains


1) Protein names on the tree can be selected holding Ctrl button and using the mouse pointer.

2) Enumerate option adds a unique number to each protein sequence name in the alignment and to the corresponding tree leaf. This is useful when identifying patterns in sequences that are grouped in separate clades on the tree.

3) In order to successfully draw gene neighborhoods please provide identifiers of proteins from the RefSeq database. This database is listed in 'Database' section of the BLASTP page as 'Reference proteins (refseq_protein)' option. Once you selected this option you can run BLASTP against this database using your protein of interest and collect homologous proteins and use this collected sequence set as input for TREND.

4) If even using the RefSeq proteins you still couldn't retrieve the gene neighborhoods for some of your proteins this means that corresponding organisms are not in our database yet. Thousands of new genomes get deposited to NCBI every week. After quality assessment and ensuring that the genomes meet the depositions standards they slowly get migrated to the RefSeq database. Once the genomes are there the MiST database will collect and process them and finally TREND will be able to process and show the neighboring genes of corresponding genomes.

5) When you collected a set of homologous proteins, for example running BLASTP, redundancy reduction step is necessary, because in the final set numerous identical or very similar sequences will be present. The file containing sequence clusters gets generated by TREND and can be used to identify how many and which sequences each representative sequence, sent over to the pipeline after the reduction step, represents.

6) FFT-NS-i - is a fast and of high quality alignment algorithm. Once you figure out your data, you may use a refined representative set of sequences of a smaller size to run more robust L-INS-i, G-INS-i or E-INS-i algorithms. L-INS-i is recommended when proteins have one common alignable region, G-INS-i - when proteins can be aligned along the entire length, and E-INS-i - when proteins have several alignable regions interspersed with unalignable less common regions.

7) FastTree - is approximately Maximum-Likelihood algorithm that produces phylogenetic trees of very good quality. We recommend using it as a first exploratory step of your analysis or as the only step if you have a large dataset. Once exploring the FastTree you established the kind of data your are dealing with try using MEGA algorithms with the refined dataset of a smaller size. You may reduce the dataset redundancy running the redundancy reduction step of TREND.

8) Not shared domains tolerance parameter adjusting can help to identify gene clusters that have only some domains common between them. Unobvious subtle regularities can be uncovered using this parameter.