This script makes an attempt to classify unknown transpsosable element (TE) consensus sequences into four categories, according to their mechanism of transposition: DNA transposons, LTRs, LINEs, SINEs. The classification is based on machine learning, and uses libsvm as the classification engine. In the current version the input sequences must be in fasta format. You can either upload the file you want to process, or paste the sequences directly. Note that the tool cannot distinguish betwen TEs and non-TEs, thus every sequence will be classified into one of the four categories (or, in ambiguous cases will be marked as unknown).
We analyze repeats in different size categories: 0-600 bp, 601-1800 bp, 1801-4000 bp, >4000 bp, and built independent classifiers for all these length classes. We use LIBSVM as the SVM engine, with a Gaussian kernel. The classification process is binary, with the following steps: forward versus reverse sequence orientation > DNA versus Retrotransposon > LTRs versus nonLTRs (for retroelements) > LINEs versus SINEs (for nonLTR repeats). The last step is performed only for repeats with lengths below 1800 bp, because we are not aware of SINEs longer than 1800 bp. Separate classifiers were built for each length class and for each classification step. In each classification step the sequence of a TE is represented as a vector of oligomer frequencies, which was used as the input for the SVM engine.
Two complete sets of classifiers were built using tetramers and pentamers, which are used in two separate rounds of the classification. In the first the models based on tetramer frequencies are used, and in the second round the models based on pentamers. The result of the classification is the last step where the two rounds are in agreement, i.e. if the first classification round classifies a TE as LTR while the second as LINE it is reported as a retroelement.
A detailed description of the methods can be found in Abrusan et al. 2009. Bioinformatics 25:1329-1330
You can download TEclass scrips from
here,
and the pre-built
tetramer
and
pentamer models.
Both were built using the 2009/01/20 release of RepBase (RepeatMasker edition).
Tools for de novo reconstruction of repeat consensi:
RepeatModeler (Arian Smit and Robert Hubley)
RepeatScout (Price et al. 2005)
RECON (Bao and Eddy 2002)
Piler (Edgar and Myers 2005)
Tools for similarity based repeat identification:
RepeatMasker (Arian Smit and Robert Hubley)
Censor (Jurka et al.)