Institute of Bioinformatics WWU Münster
Claim
TEclass - Classification of TE consensus sequences

Description

TEclass classifies unknown transpsosable element (TE) consensus sequences into four categories, according to their mechanism of transposition: DNA transposons, LTRs, LINEs, SINEs. The classification uses support vector machines (here), random forests (here), learning vector quantisation (here), and also predicts ORFs (here). In the current version the input sequences must be in fasta format. You can either upload the file you want to process, or paste the sequences directly. Note that the tool cannot distinguish betwen TEs and non-TEs, thus every sequence will be classified into one of the four categories (or, in ambiguous cases will be marked as unknown) even if it is not a TE.

Notes

TEclass is not a tool to annotate whole-genome data, thus it is not a replacement for RepeatMasker or Censor. Its primary purpose is to classify the repeat libraries which can subsequently be used by these two tools. Thus, the input should not contain more than a few thousand sequences, if you have significantly more its a sign that you are almost certainly using TEclass improperly.

The entered data must not exceed 1MB in size!

Methods

We analyze repeats in different size categories: 0-600 bp, 601-1800 bp, 1801-4000 bp, >4000 bp, and build independent classifiers for all these length classes. We use libsvm as the SVM engine, with a Gaussian kernel. The classification process is binary, with the following steps: forward versus reverse sequence orientation > DNA versus Retrotransposon > LTRs versus nonLTRs (for retroelements) > LINEs versus SINEs (for nonLTR repeats). The last step is performed only for repeats with lengths below 1800 bp, because we are not aware of SINEs longer than 1800 bp. Separate classifiers were built for each length class and for each classification step. If the different methods of classification lead to conflicting results, TEclass reports the repeat either as unknown, or as the last category where the clasification methods are in agreement.

Citation

Please cite Abrusan G, Grundmann N, DeMeester L, Makalowski W 2009. TEclass: a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25:1329-1330 here

Download

You can download TEclass scripts from here, and the pre-built classifiers from here They were built using the version 21 of RepBase.

Links

Tools for de novo reconstruction of repeat consensi:
RepeatModeler (Arian Smit and Robert Hubley)
RepeatScout (Price et al. 2005)
RECON (Bao and Eddy 2002)
Piler (Edgar and Myers 2005)

Tools for similarity based repeat identification:
RepeatMasker (Arian Smit and Robert Hubley)
Censor (Jurka et al.)

Credits

Please contact the bioinformatics team or the author directly at gyorgy01||gmail||com (replace || with the approprate signs) if you have any questions. The classification tool was written by György Abrusán and was funded by the Katholieke Universiteit Leuven, Belgium (postdoctoral fellowship for G.A.) and the University of Münster