SilkDB
Home News Genome Microarray Tools Download Document Resource Links About SilkDB  

Clustalw: Multiple Alignments (Des Higgins)



your e-mail



Sequences File (or Alignment File for Bootstrap and Tree actions) (-infile) : please enter either :
  1. the name of a file:

    Please input the sequence in Fasta format or silkworm Gene ID.

    Notes:If you input Gene ID, the line must start with "BmID:", and gene name must suffix with "-TA" to indicates this sequence is nucleotide, and suffix with "-PA" indicates it is protein peptides.

    For example, BmID:BGIBMGA012615-PA,BGIBMGA012616-PA,BGIBMGA012617-PA

  2. or the actual data here:

(sequence format)



Actions

Phylip alignment output format (-output)

Multiple Alignments parameters

Fast Pairwise Alignments parameters

Slow Pairwise Alignments parameters

Tree parameters

Output parameters

Profile Alignments parameters

Structure Alignments parameters


Multiple Alignments parameters

Toggle Slow/Fast pairwise alignments (-quicktree) ? Slow Fast

Protein or DNA (-type) ? [default] protein DNA

Protein weight matrix (-matrix)

DNA weight matrix (-dnamatrix) ? [default] IUB CLUSTALW

Gap opening penalty (-gapopen)

Gap extension penalty (-gapext)

End gap separation penalty (-endgaps)

Gap separation pen. range (-gapdist)

Residue specific penalties (Pascarella gaps) (-nopgap)

Hydrophilic gaps (-nohgap)



Hydrophilic residues list (-hgapresidues):



Delay divergent sequences : % ident. for delay (-maxdiv)

Negative values in matrix ? (-negative)

Transitions weight (between 0 and 1) (-transweight)

File for new guide tree (-newtree)



File for old guide tree (-usetree) : please enter either :
  1. the name of a file:
  2. or the actual data here:





[Return to the main part with your favorite browser's Back function]


Fast Pairwise Alignments parameters

Word size (-ktuple)

Number of best diagonals (-topdiags)

Window around best diags (-window)

Gap penalty (-pairgap)

Percent or absolute score ? (-score) ? [default] percent absolute



[Return to the main part with your favorite browser's Back function]


Slow Pairwise Alignments parameters

Protein weight matrix (-pwmatrix)

DNA weight matrix (-pwdnamatrix) ? [default] IUB CLUSTALW

Gap opening penalty (-pwgapopen)

Gap extension penalty (-pwgapext)



[Return to the main part with your favorite browser's Back function]


Tree parameters

Use Kimura's correction (multiple substitutions) ? (-kimura)

Ignore positions with gaps ? (-tossgaps)

Bootstrap a NJ tree (give the number of bootstraps, 0 for none) (-bootstrap)

Phylip bootstrap positions (-bootlabels) ? [default] NODE labels BRANCH labels

Seed number for bootstraps (-seed)

Output tree/distance format (-outputtree)



[Return to the main part with your favorite browser's Back function]


Output parameters

Alignment File (-outfile)

Output format (-output)

Upper case GDE output (-case)

Result order (-outorder) ? [default] input aligned

Output sequence numbers in the output file (clustalw format) (-seqnos)



[Return to the main part with your favorite browser's Back function]


Profile Alignments parameters



Profile 1 (-profile1) : please enter either :
  1. the name of a file:
  2. or the actual data here:





Profile 2 (-profile2) : please enter either :
  1. the name of a file:
  2. or the actual data here:





File for old guide tree for profile1 (-usetree1) : please enter either :
  1. the name of a file:
  2. or the actual data here:





File for old guide tree for profile2 (-usetree2) : please enter either :
  1. the name of a file:
  2. or the actual data here:



File for new guide tree for profile1 (-newtree1)

File for new guide tree for profile2 (-newtree2)



[Return to the main part with your favorite browser's Back function]


Structure Alignments parameters

Use profile 1 secondary structure / penalty mask (-nosecstr1)

Use profile 2 secondary structure / penalty mask (-nosecstr2)

Helix gap penalty (-helixgap)

Strand gap penalty (-strandgap)

Loop gap penalty (-loopgap)

Secondary structure terminal penalty (-terminalgap)

Helix terminal positions: number of residues inside helix to be treated as terminal (-helixendin)

Helix terminal positions: number of residues outside helix to be treated as terminal (-helixendout)

Strand terminal positions: number of residues inside strand to be treated as terminal (-strandendin)

Strand terminal positions: number of residues outside strand to be treated as terminal (-strandendout)

Output in alignment (-secstrout)



[Return to the main part with your favorite browser's Back function]


your e-mail


Some explanations about the options



Main parameters
enter either the name of a file or the actual data
if you are using Netscape 2.x or later, you can select a file by typing its name, or better, by selecting it with the Netscape file browser (Browse button)
OR you can type your data in the next area, or cut and paste it from another application.
(but not both)


Profile Alignments parameters
By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile alignments allow you to store alignments of your favourite sequences and add new sequences to them in small bunches at a time. A profile is simply an alignment of one or more sequences (e.g. an alignment output file from CLUSTAL W). Each input can be a single sequence. One or both sets of input sequences may include secondary structure assignments or gap penalty masks to guide the alignment.
Give 2 profiles to align the 2 profiles to each other


Fast Pairwise Alignments parameters
These similarity scores are calculated from fast, approximate, global alignments, which are controlled by 4 parameters. 2 techniques are used to make these alignments very fast: 1) only exactly matching fragments (k-tuples) are considered; 2) only the 'best' diagonals (the ones with most k-tuple matches) are used.
Word size (-ktuple)
K-TUPLE SIZE: This is the size of exactly matching fragment that is used. INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity. For longer sequences (e.g. >1000 residues) you may need to increase the default.
Number of best diagonals (-topdiags)
The number of k-tuple matches on each diagonal (in an imaginary dot-matrix plot) is calculated. Only the best ones (with most matches) are used in the alignment. This parameter specifies how many. Decrease for speed; increase for sensitivity.
Window around best diags (-window)
WINDOW SIZE: This is the number of diagonals around each of the 'best' diagonals that will be used. Decrease for speed; increase for sensitivity
Gap penalty (-pairgap)
This is a penalty for each gap in the fast alignments. It has little affect on the speed or sensitivity except for extreme values.


Slow Pairwise Alignments parameters
These parameters do not have any affect on the speed of the alignments. They are used to give initial alignments which are then rescored to give percent identity scores. These % scores are the ones which are displayed on the screen. The scores are converted to distances for the trees.
Protein weight matrix (-pwmatrix)
The scoring table which describes the similarity of each amino acid to each other. For DNA, an identity matrix is used.
BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity (homology searches). The matrices used are: Blosum80, 62, 40 and 30.
The Gonnet Pam 250 matrix has been reported as the best single matrix for alignment, if you only choose one matrix. Our experience with profile database searches is that the Gonnet series is unambiguously superior to the Blosum series at high divergence. However, we did not get the series to perform systematically better than the Blosum series in Clustal W (communication of the authors).
PAM (Dayhoff). These have been extremely widely used since the late '70s. We use the PAM 120, 160, 250 and 350 matrices.
DNA weight matrix (-pwdnamatrix)
For DNA, a single matrix (not a series) is used. Two hard-coded matrices are available:
1) IUB. This is the default scoring matrix used by BESTFIT for the comparison of nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
2) CLUSTALW(1.6). The previous system used by ClustalW, in which matches score 1.0 and mismatches score 0. All matches for IUB symbols also score 0.




Structure Alignments parameters
These options, when doing a profile alignment, allow you to set 2D structure parameters. If a solved structure is available, it can be used to guide the alignment by raising gap penalties within secondary structure elements, so that gaps will preferentially be inserted into unstructured surface loops. Alternatively, a user-specified gap penalty mask can be supplied directly.
A gap penalty mask is a series of numbers between 1 and 9, one per position in the alignment. Each number specifies how much the gap opening penalty is to be raised at that position (raised by multiplying the basic gap opening penalty by the number) i.e. a mask figure of 1 at a position means no change in gap opening penalty; a figure of 4 means that the gap opening penalty is four times greater at that position, making gaps 4 times harder to open.
Gap penalty masks is to be supplied with the input sequences. The masks work by raising gap penalties in specified regions (typically secondary structure elements) so that gaps are preferentially opened in the less well conserved regions (typically surface loops).
CLUSTAL W can read the masks from SWISS-PROT, CLUSTAL or GDE format input files. For many 3-D protein structures, secondary structure information is recorded in the feature tables of SWISS-PROT database entries. You should always check that the assignments are correct - some are quite inaccurate. CLUSTAL W looks for SWISS-PROT HELIX and STRAND assignments e.g.
FT HELIX 100 115
FT HELIX 100 115
The structure and penalty masks can also be read from CLUSTAL alignment format as comment lines beginning !SS_ or GM_ e.g.
!SS_HBA_HUMA ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA
!GM_HBA_HUMA 112224444444444222122244444444442222224222111111111222444444
HBA_HUMA VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
Note that the mask itself is a set of numbers between 1 and 9 each of which is assigned to the residue(s) in the same column below. In GDE flat file format, the masks are specified as text and the names must begin with SS_ or GM_. Either a structure or penalty mask or both may be used. If both are included in an alignment, the user will be asked which is to be used.
Use profile 1 secondary structure / penalty mask (-nosecstr1)
This option controls whether the input secondary structure information or gap penalty masks will be used.
Use profile 2 secondary structure / penalty mask (-nosecstr2)
This option controls whether the input secondary structure information or gap penalty masks will be used.
Helix gap penalty (-helixgap)
This option provides the value for raising the gap penalty at core Alpha Helical (A) residues. In CLUSTAL format, capital residues denote the A and B core structure notation. The basic gap penalties are multiplied by the amount specified.
Strand gap penalty (-strandgap)
This option provides the value for raising the gap penalty at Beta Strand (B) residues. In CLUSTAL format, capital residues denote the A and B core structure notation. The basic gap penalties are multiplied by the amount specified.
Loop gap penalty (-loopgap)
This option provides the value for the gap penalty in Loops. By default this penalty is not raised. In CLUSTAL format, loops are specified by . in the secondary structure notation.
Secondary structure terminal penalty (-terminalgap)
This option provides the value for setting the gap penalty at the ends of secondary structures. Ends of secondary structures are observed to grow and-or shrink in related structures. Therefore by default these are given intermediate values, lower than the core penalties. All secondary structure read in as lower case in CLUSTAL format gets the reduced terminal penalty.
Helix terminal positions: number of residues inside helix to be treated as terminal (-helixendin)
This option (together with the -helixendin) specify the range of structure termini for the intermediate penalties. In the alignment output, these are indicated as lower case. For Alpha Helices, by default, the range spans the end helical turn.
Helix terminal positions: number of residues outside helix to be treated as terminal (-helixendout)
This option (together with the -helixendin) specify the range of structure termini for the intermediate penalties. In the alignment output, these are indicated as lower case. For Alpha Helices, by default, the range spans the end helical turn.
Strand terminal positions: number of residues inside strand to be treated as terminal (-strandendin)
This option (together with the -strandendout option) specify the range of structure termini for the intermediate penalties. In the alignment output, these are indicated as lower case. For Beta Strands, the default range spans the end residue and the adjacent loop residue, since sequence conservation often extends beyond the actual H-bonded Beta Strand.
Strand terminal positions: number of residues outside strand to be treated as terminal (-strandendout)
This option (together with the -strandendin option) specify the range of structure termini for the intermediate penalties. In the alignment output, these are indicated as lower case. For Beta Strands, the default range spans the end residue and the adjacent loop residue, since sequence conservation often extends beyond the actual H-bonded Beta Strand.
Output in alignment (-secstrout)
This option lets you choose whether or not to include the masks in the CLUSTAL W output alignments. Showing both is useful for understanding how the masks work. The secondary structure information is itself very useful in judging the alignment quality and in seeing how residue conservation patterns vary with secondary structure.


Multiple Alignments parameters
Multiple alignments are carried out in 3 stages :
1) all sequences are compared to each other (pairwise alignments);
2) a dendrogram (like a phylogenetic tree) is constructed, describing the approximate groupings of the sequences by similarity (stored in a file).
3) the final multiple alignment is carried out, using the dendrogram as a guide.
Pairwise alignment parameters control the speed/sensitivity of the initial alignments.
Multiple alignment parameters control the gaps in the final multiple alignments.
Toggle Slow/Fast pairwise alignments (-quicktree)
slow: by dynamic programming (slow but accurate)
fast: method of Wilbur and Lipman (extremely fast but approximate)
Protein weight matrix (-matrix)
There are three 'in-built' series of weight matrices offered. Each consists of several matrices which work differently at different evolutionary distances. To see the exact details, read the documentation. Crudely, we store several matrices in memory, spanning the full range of amino acid distance (from almost identical sequences to highly divergent ones). For very similar sequences, it is best to use a strict weight matrix which only gives a high score to identities and the most favoured conservative substitutions. For more divergent sequences, it is appropriate to use 'softer' matrices which give a high score to many other frequent substitutions.
BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity (homology searches). The matrices used are: Blosum80, 62, 40 and 30.
The Gonnet Pam 250 matrix has been reported as the best single matrix for alignment, if you only choose one matrix. Our experience with profile database searches is that the Gonnet series is unambiguously superior to the Blosum series at high divergence. However, we did not get the series to perform systematically better than the Blosum series in Clustal W (communication of the authors).
PAM (Dayhoff). These have been extremely widely used since the late '70s. We use the PAM 120, 160, 250 and 350 matrices.
DNA weight matrix (-dnamatrix)
1) IUB. This is the default scoring matrix used by BESTFIT for the comparison of nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
2) CLUSTALW(1.6). The previous system used by ClustalW, in which matches score 1.0 and mismatches score 0. All matches for IUB symbols also score 0.
Gap opening penalty (-gapopen)
End gap separation penalty (-endgaps)
End gap separation treats end gaps just like internal gaps for the purposes of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above). If you turn this off, end gaps will be ignored for this purpose. This is useful when you wish to align fragments where the end gaps are not biologically meaningful.
Gap separation pen. range (-gapdist)
Gap separation distance tries to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalised more than other gaps. This does not prevent close gaps; it makes them less frequent, promoting a block-like appearance of the alignment.
Residue specific penalties (Pascarella gaps) (-nopgap)
Residue specific penalties are amino acid specific gap penalties that reduce or increase the gap opening penalties at each position in the alignment or sequence. As an example, positions that are rich in glycine are more likely to have an adjacent gap than positions that are rich in valine.
Table of residue specific gap modification factors:
A 1.13 M 1.29
C 1.13 N 0.63
D 0.96 P 0.74
E 1.31 Q 1.07
F 1.20 R 0.72
G 0.61 S 0.76
H 1.00 T 0.89
I 1.32 V 1.25
K 0.96 Y 1.00
L 1.21 W 1.23
The values are normalised around a mean value of 1.0 for H. The lower the value, the greater the chance of having an adjacent gap. These are derived from the original table of relative frequencies of gaps adjacent to each residue (12) by subtraction from 2.0.
Hydrophilic gaps (-nohgap)
Hydrophilic gap penalties are used to increase the chances of a gap within a run (5 or more residues) of hydrophilic amino acids; these are likely to be loop or random coil regions where gaps are more common. The residues that are 'considered' to be hydrophilic are set by menu item 3.
Delay divergent sequences : % ident. for delay (-maxdiv)
Delays the alignment of the most distantly related sequences until after the most closely related sequences have been aligned. The setting shows the percent identity level required to delay the addition of a sequence; sequences that are less identical than this level to any other sequences will be aligned later.
Transitions weight (between 0 and 1) (-transweight)
The transition weight option for aligning nucleotide sequences has been changed in version 1.7 from an on/off toggle to a weight between 0 and 1. A weight of zero means that the transitions are scored as mismatches; a weight of 1 gives transitions the full match score. For distantly related DNA sequences, the weight should be near to zero; for closely related sequences it can be useful to assign a higher score.
File for old guide tree (-usetree)
You can give a previously computed tree (.dnd file) - on the same data


Tree parameters
If you ask for an alignment, the program automatic computes the tree as well; but you can also ask for a tree, given an alignment (file .aln), with specific options.
The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First you calculate distances (percent divergence) between all pairs of sequence from a multiple alignment; second you apply the NJ method to the distance matrix.
Use Kimura's correction (multiple substitutions) ? (-kimura)
For small divergence (say <10%) this option makes no difference. For greater divergence, this option corrects for the fact that observed distances underestimate actual evolutionary distances. This is because, as sequences diverge, more than one substitution will happen at many sites. However, you only see one difference when you look at the present day sequences. Therefore, this option has the effect of stretching (for DNA or proteins) are both due to Motoo Kimura. See the documentation for details.
For VERY divergent sequences, the distances cannot be reliably corrected. You will be warned if this happens. Even if none of the distances in a data set exceed the reliable threshold, if you bootstrap the data, some of the bootstrap distances may randomly exceed the safe limit.
Ignore positions with gaps ? (-tossgaps)
With this option, any alignment positions where ANY of the sequences have a gap will be ignored. This means that 'like' will be compared to 'like' in all distances. It also, automatically throws away the most ambiguous parts of the alignment, which are concentrated around gaps (usually). The disadvantage is that you may throw away much of the data if there are many gaps.
Bootstrap a NJ tree (give the number of bootstraps, 0 for none) (-bootstrap)
BOOTSTRAPPING is a method for deriving confidence values for the groupings in a tree (first adapted for trees by Joe Felsenstein). It involves making N random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000); drawing N trees (1 from each sample) and counting how many times each grouping from the original tree occurs in the sample trees. You must supply a seed number for the random number generator. Different runs with the same seed will give the same answer. See the documentation for details.
Phylip bootstrap positions (-bootlabels)
The bootstrap values written in the phylip tree file format can be assigned either to branches or nodes. The default is to write the values on the nodes, as this can be read by several commonly-used tree display programs. But note that this can lead to confusion if the tree is rooted and the bootstraps may be better attached to the internal branches: Software developers should ensure they can read the branch label format.
Output tree/distance format (-outputtree)
Clustal format output: This format is verbose and lists all of the distances between the sequences and the number of alignment positions used for each. The tree is described at the end of the file. It lists the sequences that are joined at each alignment step and the branch lengths. After two sequences are joined, it is referred to later as a NODE. The number of a NODE is the number of the lowest sequence in that NODE.
Phylip format tree output: This format is the New Hampshire format, used by many phylogenetic analysis packages. It consists of a series of nested parentheses, describing the branching order, with the sequence names and branch lengths. It can be used by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see the trees graphically. This is the same format used during multiple alignment for the guide trees.
The distance matrix only: This format just outputs a matrix of all the pairwise distances in a format that can be used by the Phylip package. It used to be useful when one could not produce distances from protein sequences in the Phylip package but is now redundant (Protdist of Phylip 3.5 now does this).
NEXUS format tree: This format is used by several popular phylogeny programs, including PAUP and MacClade.
Sequence format
The sequence will be automatically converted in the format needed for the program
providing you enter a sequence either:
in plain (raw) sequence format or in one of the following known formats:
IG,GenBank,NBRF,EMBL,GCG,DNAStrider,Fitch,fasta,Phylip,PIR,MSF,ASN,PAUP,CLUSTALW
You may enter in the text area a database entry code, or an accession number, in this form:

database:entry_name

or:

database:accession.

References:

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.