This is the web-site for
a versatile software tool for efficiently
solving large scale sequence matching tasks.
Vmatch subsumes the software tool
but is much more general, with a very flexible user interface,
and improved space and time requirements.
is a printable version of this
HTML-page in PDF.
Usually, in a large scale matching problem, extensive portions of the sequences
under consideration are static, i.e. they do not change much over time.
Therefore it makes sense to preprocess this static data
to extract information from it and to store this in a structured manner,
allowing efficient searches.
Vmatch does exactly this: it preprocesses a set of sequences
into an index structure. This is stored as a collection of several files
constituting the persistent index. The index efficiently represents all
substrings of the preprocessed sequences and, unlike
many other sequence comparison tools, allows matching tasks to be solved
in time, independent of the size of the index. Different matching
tasks require different parts of the index, but only the required parts
of the index are accessed during the matching process.
Most software tools for sequence analysis are restricted to DNA
and/or protein sequences. In contrast, Vmatch can process sequences
over any user defined alphabet not larger than 250 symbols.
Vmatch fully implements the concept of symbol mappings,
denoting alphabet transformations. These allow the user to specify that
different characters in the input sequences should be considered identical
in the matching process. This feature is used
to group similar amino acids, for example.
Vmatch allows a multitude of different matching tasks to be solved
using the persistent index. Every matching task is basically characterized by
(1) the kind of sequences to be matched,
(2) the kind of matches sought,
(3) additional constraints on the matches, and
(4) the kind of postprocessing to be done with the matches.
In the standard case, Vmatch matches sequences over the same alphabet.
Additionally, DNA sequences can be matched against a protein sequence
index in all six reading frames. Finally, DNA sequences can be
transformed in all six reading frames and compared against itself.
Where appropriate, Vmatch can compute the
following kinds of matches, using state-of-the-art algorithms:
maximal substring matches
using the algorithm of .
maximal unique substring matches
using the algorithm of .
using the algorithms of  and 
To compute degenerate substring matches or degenerate repeats,
each kind of match (with the exception of tandem repeats and complete matches)
can be taken as an exact seed and extended by either
of two different strategies:
the maximum error extension strategy, as described in
 for repeat detection,
Vmatch is based on enhanced suffix arrays
described in .
This data structure
has been shown to be as powerful as suffix trees, with the advantage of
a reduced space requirement and reduced processing time. Careful
implementation of the algorithms and data structures incorporated in
Vmatch have led to exceedingly fast and robust software,
allowing very large sequence sets to be processed quickly.
The 32-bit version of Vmatch can process
up to 400 million symbols, if enough memory is available.
For large server class machines Vmatch
is available as a 64 bit version, enabling gigabytes of sequences
to be processed.
The most common formats for input sequences (Fasta, Genbank, EMBL, and
SWISSPROT) are accepted. The user does not have to specify the input
format. It is automatically recognized. All input files can contain
an arbitrary number of sequences. Gzipped compressed inputs are
Vmatch's output can be parsed by other programs easily.
Furthermore, several options allow for its customization.
XML output is available and new output formats can easily be
incorporated without changing Vmatch's
program code. Certain matches can easily be selected by
user defined criteria, without intermediate output and subsequent
Up until now we have referred to Vmatch as a collection
of programs. In the following we use the same name,
vmatch (in typewriter font), for the most important
program in this collection. Besides vmatch, there are
the following programs available:
mkvtree constructs the persistent index and stores it on files.
mkdna6idx constructs an index for a DNA sequence
after translating this in all six reading frames.
vseqinfo delivers information about indexed database sequences.
vstree2tex outputs a representation of the index
in LATEX-format. It can be used, for example, for educational
or debugging purposes.
vseqselect selects indexed sequences satisfying specific
vsubseqselect selects substrings of a specified length range
from an index.
vmigrate.sh converts an index from big endian to
little endian architectures, or vice versa.
vmatchselect sort and selects matches
delivered by vmatch.
chain2dim computes optimal chains of matches from files
matchcluster computes clusters of matches from files
is a software tool to compute gene structure predictions. The gene structure
predictions are calculated using a similarity-based approach where additional
cDNA/EST and/or protein sequences are used to predict gene structures via
spliced alignments. GenomeThreader uses the matching capabilities
of Vmatch to efficiently map the reference sequence to a genomic
sequence. For details, see .
Following is a list of completed and ongoing projects in which Vmatch
has been successfully used:
developed at the Lawrence Livermore National
Laboratories, uses Vmatch to detect unique substrings in large
collection of DNA sequences. These unique substrings serve as
signatures allowing for rapid and accurate diagnostics
to identify pathogen bacteria and viruses. A similar application
is reported in .
In , Vmatch is used to a compute a
non-redundant set from a large collection of protein sequences
from Zea-Maize. Similar applications are used in the
For the development of the
Barley1 GeneChipVmatch is used to search
The latest assembly of the Arabidopsis thaliana genome
(GenBank entries of 2/19/04) contains vector sequence contaminations.
For example, region 3,617,880 to 3,625,027 of chromosome II is a cloning
vector. Vmatch was used to detect the vector contamination, see
 developed by Jacques van Helden
use Vmatch to purge
sequences before computing
sequence statistics. Similar applications are reported in
The program SpliceNest
 computes gene indices and uses Vmatch to
clustered sequences to large genomes.
The oligo design program
 developed by
Sven Rahmann is based on the persistent index structure of Vmatch.
Promide uses mkvtree for generating the index.
is a web-based server which efficiently maps large
EST and cDNA data sets to genomic DNA. The use of Vmatch
allows to significantly extend the size of data that can be mapped in
reasonable time. e2g is available as a web service and hosts
large collections of EST sequences (e.g. 4.1 million mouse ESTs
of 1.87 Gbp) in a precomputed persistent index. For details see
PlantGDB  provides
a service called
for genome wide pattern searches in plant sequences. The service is based
The Mu Transposon Information
used Vmatch to (1) match 130,861 vector-trimmed sequences against the
maize repeat database, and (2) to cluster near-identical sequences.
See  for details.
In  Vmatch was used to
reveal long repeats inside human chromosome 1 and long similar regions
between human chromosome 1 and all other human chromosomes.
In  Vmatch was used to cluster
317,242 EST and cDNA sequences from Xenopus laevis.
Vmatch was chosen for the following reasons:
At first, there was no clustering tool available which could handle
large data sets efficiently, and which was documented well enough to
allow a detailed b replication and evaluation of existing clusters.
Second, Vmatch identifies similarities between sequences rapidly,
and it provides additional options to cluster a set of sequences
based on these matches. Furthermore, the Vmatch output provides
information about how the clusters were derived. Due to the
efficiency of Vmatch, it was possible to perform the clustering for a
wide variety of parameters on the complete sequence set.
This allows to study the effect of the parameter choice on the clustering.
In  Vmatch was used for three different tasks:
Searching spliced mRNA in the Arabidopsis genome to detect
micromatches of length at least 20 with maximum 2 mismatches.
Finding matches of length at least 15 long with at most one mismatch
between predicted mature miRNA-sequences and a set of ESTs as well
as sequences from the Arabidopsis Small RNA Project (ASRP).
Aligning and performing single linkage clustering
of the predicted mature miRNA sequences. Candidate pairs aligning over at least
17 bases, allowing an edit distance of 1 were grouped in the same family.
is a versatile computational tool which aids in visualizing
relationships between RNA sequences (particularly between ncRNAs and
their putative target transcripts) in an intuitive and accessible way.
Besides BLAST, CrossLink uses Vmatch to reveal the sequence
relationships to be visualized.
The DOE Joint Genome Institute
used Vmatch to
identify and mask all continuous non-unique sequence fragments over
500 bp in Frankia sp. and Shewanella oneidensis.
In , Seidel et. al. describe
methods for creating web-services and give examples which, among other tools,
also integrate Vmatch.
In , Pobigaylo et. al.
use Vmatch to map signature tags to the genome of S. meliloti.
In , Liang et et. al. use Vmatch for Vector screening.
[21,20] make use of Vmatch to
efficiently find maximal repeats, as a first step in localizing
Clustered regularly interspaced short palindromic repeats (CRISPRs).
The programm Gepard  uses mkvtree to
compute enhanced suffix arrays.
The MIPSPlantsDB database
Vmatch to cluster large sequence sets.
In , Vmatch was
used to compare target genes of the tomato Chs RNAi to a tomato gene index.
In , Vmatch was used to search different
plant genomes for matches of length at least 20 with maximum of 2 mismatches.
Here the fact that Vmatch is an exhaustive search is important.
In , Vmatch was used to map
millions of short sequence reads to the A. Thaliana genome.
Up to four mismatches and up to three indels were allowed in the matching
process. The seed size was chosen to be 0. The reads were aligned using the
best match strategy by iteratively increasing the the allowed number of
mismatches and gaps at each round.
In , Vmatch was used to map
millions of short sequence reads to the A. Thaliana genome.
Vmatch was part of a multi-step pipeline, combining a fast
matching algorithm (Vmatch) for initial read mapping and
an optimal alignment algorithm based on dynamic programming (QPALMA)
for high quality detection of splice sites.
In , Vmatch was used to map
RNAseq reads in both forward and reverse complementary orientation (options
and -p), allowing up to two mismatches (option -h 2), requiring the whole read
to map (option -l 36), and generating maximal substring matches that are unique
reference dataset (option -mum cand).
Vmatch is available in executable format for the following platforms:
32 bit Linux for Intel and AMD architectures
64 bit Linux for Intel and AMD architectures
Mac OSX for Apple Intel.
If you need Vmatch for an additional platform, then please contact
If you want to use Vmatch for
academic research, educational and demonstration purposes you may obtain
a free of charge non-commercial license as follows: download the
license agreement form, read
it, sign it, and fax it to the number given in the agreement. If you
want to obtain a commercial license for Vmatch, then
please directly contact
M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch.
The enhanced suffix array and its applications to genome analysis.
In Proceedings of the Second Workshop on Algorithms in
Bioinformatics, pages 449-463. Lecture Notes in Computer Science 2452,
M.I. Abouelhoda and E. Ohlebusch.
A Local Chaining Algorithm and its Applications in Comparative
In Proc. 3rd Worksh. Algorithms in Bioinformatics (WABI 2003),
number 2812 in Lecture Notes in Bioinformatics, pages 1-16. Springer-Verlag,
P.G. Buckley, C. Jarbo, U. Menzel, T. Mathiesen, C. Scott, S.G. Gregory, C.F.
Langford, and J.P. Dumanski.
Comprehensive DNA Copy Number Profiling of Meningioma Using a
Chromosome 1 Tiling Path Microarray identifies Novel Candidate Tumor
Cancer Res., 65(7):2653-2661, 2005.
T. Dezulian, M. Schaefer, R. Wiese, D. Weigel, and D.H. Huson.
CrossLink: visualization and exploration of sequence relationships
between (micro) RNAs.
Nucleic Acids Res., 34(Web Server Issue):W400-W404, 200.
J.A. Eisen, R.S. Coyne, M. Wu, D. Wu, M. Thiagarajan, J.R. Wortman, J.H.
Badger, Q. Ren, P. Amedeo, and K.M. Jones et al.
Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila,
a Model Eukaryote.
PLoS Biology, 4(9):e286, 2006.
J.P. Fitch, S.N. Gardner, T.A. Kuczmarski, S. Kurtz, R. Myers, L.L. Ott, T.R.
Slezak, E.A. Vitalis, A.T. Zemla, and P.M. McCready.
Rapid development of nucleic acid diagnostics.
Proceedings of the IEEE, 90(11):1708-1721, 2002.
S.N. Gardner, T.A. Kuczmarski, E.A. Vitalis, and T.R. Slezak.
Limitations of TaqMan PCR for Detecting Viral Pathogens I llustrated
by Hepatitis A, B, C, and E Viruses and Human Immunodeficiency Virus.
J. of Clinical Microbiology, 41(6):2417-2427, 2003.
R.J.M. Hulzink, H. Weerdesteyn, A.F. Croes, M.M.A. Gerats, T. van Herpen, and
J. van Helden.
In Silico Identification of Putative Regulatory Sequence Elements in
the 5'-Untranslated Region of Genes That Are Expressed during Male
Gametogenesis Gene Co-regulation.
Plant Physiol., 132:75-83, 2003.
S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and
REPuter: The manifold applications of repeat analysis on a genomic
Nucleic Acids Res., 29(22):4633-4642, 2001.
N. Pobigaylo, D. Wetter, S. Szymczak, U. Schiller, S. Kurtz, F. Meyer, T.W.
Nattkemper, and Becker A.
Construction of a large signature-tagged mini-Tn5 transposon
library and its application to mutagenesis of Sinorhizobium
Appl Environ Microbiol., 72(6):4329-4337, 2006.
J.-F. Pombert, C. Lemieux, and M. Turmel.
The complete chloroplast DNA sequence of the green alga
Oltmannsiellopsis viridis reveals a distinctive quadripartite architecture in
the chloroplast genome of early diverging ulvophytes.
BMC Biology, 4:3, 2006.
E.G.W.M. Schijlen, C.H. Ric de Vos, S. Martens, H.H. Jonker, F.M. Rosin, J.W.
Molthoff, Y.M. Tikunov, G.C. Angenent, A.J. van Tunen, and A.G. Bovy.
RNA interference silencing of chalcone synthase, the first step
in the flavonoid biosynthesis pathway, leads to parthenocarpic tomato fruits.
Plant Physiol, 144(3):1520-30, 2007.
P.N. Seibel, J. Krüger, S. Hartmeier, K. Schwarzer, K. Löwenthal,
H. Mersch, T. Dandekar, and R. Giegerich.
XML schemas for common bioinformatic data types and their
application in workflow systems.
BMC Bioinformatics, 7:490, 2006.
T. Slezak, T. Kuczmarski, L. Ott, C. Torres, D. Medeiros, J. Smith, B. Truitt,
N. Mulakken, M. Lam, E. Vitalis, A. Zemla, C.E. Zhou, and S. Gardner.
Comparative Genomics Tools Applied to Bioterrorism Defense.
Briefings in Bioinformatics, 4(2):133-149, 2003.
M. Spannagl, O. Noubibou, D. Haase, L. Yang, H. Gundlach, T. Hindemitt,
K. Klee, G. Haberer, H. Schoof, and K.F.X. Mayer.
MIPSPlantsDB-plant database resource for integrative and
comparative plant genome research.
Nucleic Acids Res, 35(Database issue):D834-40, 2007.
M. Turmel, C. Otis, and C. Lemieux.
The Chloroplast Genome Sequence of Chara vulgaris Sheds New Light
into the Closest Green Algal Relatives of Land Plants.
Molecular Biology and Evolution, 23:1324-1338, 2006.
P. Wolff, I. Weinhofer, J. Seguin, P. Roszak,
C. Beisel, M.T. Donoghue, C Spillane, M. Nordborg,
M. Rehmsmeier, and C. Köhler.
High-Resolution Analysis of Parent-of-Origin Allelic Expression in the
PLoS Genet., 7(6):e1002126, 2011.