This script tries to find suitable score parameters (substitution and gap scores) for aligning some given sequences.
It (probabilistically) aligns the sequences using some initial score parameters, then estimates better score parameters based on the alignments, and repeats this procedure until the parameters stop changing.
The usage is like this:
lastdb mydb reference.fasta last-train mydb queries.fasta
last-train prints a summary of each alignment step, followed by the final score parameters in a format that can be read by lastal's -p option.
-h, --help Show a help message, with default option values, and exit.
--revsym Force the substitution scores to have reverse-complement symmetry, e.g. score(A→G) = score(T→C). This is often appropriate, if neither strand is "special". --matsym Force the substitution scores to have directional symmetry, e.g. score(A→G) = score(G→A). --gapsym Force the insertion costs to equal the deletion costs. --pid=PID Ignore alignments with > PID% identity. This aims to optimize the parameters for low-similarity alignments, similarly to the BLOSUM matrices.
-r SCORE Initial match score. -q COST Initial mismatch cost. -p NAME Initial match/mismatch score matrix. -a COST Initial gap existence cost. -b COST Initial gap extension cost. -A COST Initial insertion existence cost. -B COST Initial insertion extension cost.
-D LENGTH Query letters per random alignment. -E EG2 Maximum expected alignments per square giga. -s NUMBER Which query strand to use: 0=reverse, 1=forward, 2=both. -S NUMBER Score matrix applies to forward strand of: 0=reference, 1=query. This matters only if the scores lack reverse-complement symmetry. -m COUNT Maximum number of initial matches per query position. -T NUMBER Type of alignment: 0=local, 1=overlap. -Q NUMBER Query sequence format: 0=fasta, 1=fastq-sanger. Important: if you use option -Q, last-train will only train the gap scores, and leave the substitution scores at their initial values.
last-train assumes that gap lengths roughly follow a geometric distribution. If they do not (which is often the case), the results may be poor.
last-train can fail for various reasons, e.g. if the sequences are too dissimilar.