Mapping tags to a genome with LAST
==================================

LAST has many adjustable parameters, providing many ways of mapping
tags to a genome.  We cannot tell you which way is best, but here are
some ideas that might be helpful.

1. A simple mapping procedure
-----------------------------

Suppose we wish to map tags of length 36 to the mouse genome.  The
following commands do the job fairly quickly and accurately:

  lastdb -s16G mousedb mouse/chr*.fa
  lastal -a2 -e30 -f0 mousedb tags.fa

Here, we used -s16G to indicate that 16 gigabytes of memory are
available.  This will make lastal run faster.  If you don't have 16
gigabytes, omit this option.  We then used -a2 to set the gap
existence cost to 2, and -e30 to get alignments with score >= 30.  We
left the other score parameters at their default values: match score =
1, mismatch cost = 1, gap extension cost = 1.  (The default gap
existence cost is tuned for genome-versus-genome alignments, and it
may be too high for short tags.)  These parameters allow a few
mismatches and/or a few small gaps.  The -f0 option simply selects the
compact tabular output format.

2. How does this procedure work: what are its limitations?
----------------------------------------------------------

If you want to understand how this mapping procedure works in more
detail, read on.  LAST uses a two-step approach: first find initial
matches, then extend alignments from these matches.  In this case, the
"initial matches" are: all exact matches of any part of a tag to the
genome, of any size, where the match occurs at most ten times in the
genome.

One consequence of this is that repetitive tags will not be mapped: if
a tag perfectly matches more than ten locations in the genome, it gets
dropped at the first step.

Another wrinkle is the effect of database volumes.  LAST is designed
to work with 2 gigabytes of memory, so it splits large
(e.g. mammalian) genomes into "volumes", and maps tags to each volume
in turn as if they were separate genomes.  If a tag perfectly matches
more than ten locations in one volume, but less than ten in another,
then the former matches will not be reported but the latter will.  You
can avoid this inconsistency by using -s16G to put the whole mouse
genome into one volume.  Even if the genome is in one volume, however,
the two strands get searched separately.

The main point is that this procedure does not guarantee to find all
alignments with score >= 30.  It is more likely to miss alignments
that have uniformly-spaced mismatches/gaps, and less likely to miss
alignments with mismatches/gaps concentrated at the ends.  We think it
does a good job in practice.

3. Counting exact matches
-------------------------

We can gain information on repetitive tags as follows:

  lastal -j0 -l36 mousedb tags.fa

Here, -j0 tells lastal to just report counts of initial matches.  In
this case, there is no limit on how often the matches occur: matches
that occur more than ten times in the genome are counted too.  So
nothing is missed, and there is no effect from database volumes.  The
-l36 option requests matches of size >= 36 only: this makes it faster
and makes the output smaller.  (Without -l36, it counts all matches of
size >= 1: this is still quite fast.)

4. Finding all matches with up to N mismatches
----------------------------------------------

One approach to tag mapping is to guarantee finding all matches with
up to N mismatches.  The "guarantee" part sounds good, but there are
some drawbacks to this approach:

* It does not allow for insertions or deletions.

* It does not allow for higher error rates near the ends of tags.

* It is not suitable for partial matches, e.g. if a tag crosses a
  splice junction.

* Usually, some tags match repetitively to millions of genome
  locations: finding all these matches is slow and produces huge
  output.

You can mitigate the last drawback by counting exact matches (as
explained above) and then removing tags with many exact matches.

Suppose we wish to find all matches of our length-36 tags to the
genome, allowing up to two mismatches.  A naive approach is to start
by finding all exact matches of size 12, and extend alignments from
these.  This works because any length-36 tag with two mismatches is
guaranteed to have an exact match of size 12.  It will be very slow,
however, because there will be many unproductive size-12 matches.

We can do better by finding matches using a spaced seed, and then
extending alignments.  For example, our tags are guaranteed to have a
match using this spaced seed pattern: 11111011000111110110001111.
Since this seed has 18 matched positions (18 "1"s), we will get far
fewer unproductive matches.  With LAST, we can do this as follows:

  lastdb -m11111011000 mydb genome.fa
  lastal -l18 -m4000000000 -j1 -q0 -d34 mydb tags.fa

In the lastdb command, the seed pattern gets cyclically repeated, so
we only need to specify the repeating unit of the pattern.  In the
lastal command, we used -l18 to require 18 matched positions in
initial matches, and -m4000000000 to accept hugely repeated initial
matches.  We also used -j1 to request gapless alignments, -q0 to set
the mismatch cost to 0, and -d34 to request alignments with score >=
34.  This will give us all 36-mer alignments with at most two
mismatches.

The following table shows optimal spaced seed patterns for various tag
sizes and numbers of mismatches.  Each entry shows the number of
matched positions (e.g. 18) and the pattern (e.g. 11111011000).

====  ===========  ================  ==================  ======================
Tag   1 mismatch   2 mismatches      3 mismatches        4 mismatches
size
====  ===========  ================  ==================  ======================
16    10 11110      7 1110100         4 11010000          3 1110
17    11 11110      7 1110100         5 11010000          4 1110
18    12 11110      8 1110100         5 11010000          4 1110
19    12 11110      8 1110100         6 11010000          4 1110
20    13 11110      8 1110100         6 11010000          4 1100010000
21    14 11110      9 1110100         6 11010000          5 1100010000
22    15 11110     10 1110100         7 1110100000        5 1100010000
23    16 11110     11 1110100         7 11101001000       5 1100010000
24    16 111110    11 1110100         8 11101001000       5 1100010000
25    17 111110    12 1110100         8 11101001000       6 1100010000
26    18 111110    12 1110100         9 11101001000       6 1100010000
27    19 111110    12 1110100         9 11101001000       6 1110100000000
28    20 111110    13 1110100         9 11101001000       7 1110100000000
29    20 111110    14 1110100        10 11101001000       7 1110100000000
30    21 111110    15 1110100        10 11101001000       8 1110100000000
31    22 111110    15 1110100        11 1110110100000     8 1110100000000
32    23 111110    16 1110100        11 111101011001000   8 111010010000000
33    24 111110    16 1110100        12 111101011001000   8 111010010000000
34    25 111110    17 1111101110010  12 111101011001000   9 111010010000000
35    25 1111110   17 11111011000    13 111101011001000   9 111010010000000
36    26 1111110   18 11111011000    13 111101011001000   9 11110010000001000
37    27 1111110   19 11111011000    14 111101011001000  10 11110010000001000
38    28 1111110   19 11111011000    15 111101011001000  10 11110010000001000
39    29 1111110   20 11111011000    15 111101011001000  10 11110010000001000
40    30 1111110   21 11111011000    15 111101011001000  11 11110010000001000
41    30 1111110   21 11111011000    16 111101011001000  11 11110010000001000
42    31 1111110   22 1111110101100  16 111101011001000  11 1101110100000010000
43    32 1111110   23 1111110101100  16 111101011001000  12 1101110100000010000
44    33 1111110   24 1111110101100  17 1110110100000    12 1101110100000010000
45    34 1111110   24 1111110101100  17 111101011001000  13 1101110100000010000
46    35 1111110   25 1111110101100  18 111101011001000  13 1101001110100000000
47    36 1111110   26 1111101110010  19 111101011001000  13 1101001110100000000
48    36 11111110  26 1111101110010  20 111101011001000  14 1101001110100000000
49    37 11111110     ?              20 111101011001000  14 1101001110100000000
50    38 11111110     ?              21 111101011001000  14 1101001110100000000
====  ===========  ================  ==================  ======================

This table was made using software kindly provided by the authors of
these publications:

* G Kucherov, L Noé, M Roytberg (2005) IEEE/ACM Trans Comput Biol
  Bioinform 2:51-61.
* S Burkhardt, J Kärkkäinen (2003) Fundamenta Informaticae 56:51-70.

For longer tags, it becomes harder to determine the optimal seed patterns.

5. Some useful Unix pipelines
-----------------------------

5a. Merging identical tag sequences
-----------------------------------

Suppose we have tag sequences in a FASTA format file called "tags.fa":

  >tagA
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >tagB
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >tagC
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >tagD
  GGCACTCTTTCCCTACACGACGCTCTTCCGATCTGG

If there are many identical sequences, we can speed up the mapping by
merging them.  The following Unix pipeline merges identical sequences
(assuming each sequence is all on one line):

  grep -v '>' tags.fa | sort | uniq -c | awk '{print ">" NR ":" $1 "\n" $2}'

The output of this command looks as follows:

  >1:3
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >2:1
  GGCACTCTTTCCCTACACGACGCTCTTCCGATCTGG

The number after the colon is the count of the tag, and the number
before the colon is just a serial number.

5b. Discarding sub-optimal mappings
-----------------------------------

LAST will often align a tag to more than one genome location.  We may
wish to keep only the highest-scoring alignment(s) for each tag.
Suppose we have LAST alignments, in tabular format, in a file called
"mymap".  The following pipeline obtains the highest-scoring
alignment(s) for each tag:

  grep -v '#' mymap | sort -k7,7 -k1,1nr | awk '$7!=n {n=$7; s=$1} $1==s'

5c. Discarding tags that map equally well to multiple locations
---------------------------------------------------------------

After using the previous pipeline, there may still be some tags that
map to more than one location (with equal scores).  We may wish to
discard such multi-mapping tags.  The following pipeline accomplishes
this, assuming that the output of the previous pipeline is in
"mymap2":

  awk '{print $0 "\t" $7 "\t" $1}' mymap2 | uniq -uf12 | cut -f1-12

6. Using sequence quality scores
--------------------------------

LAST has a newer option to use sequence quality scores.  The quality
scores can be in FASTQ or PRB format.  For example:

  lastal -F3 mousedb tags_prb.txt

The quality scores have no effect on finding initial matches, but they
do affect extending alignments from the initial matches.  If quality
scores are used, the default alignment scoring scheme is +6 for a
high-quality match and -18 for a high-quality mismatch.  Low-quality
matches and mismatches get scores between these values, as shown in
the following table.  If your tags are very short, make sure that the
alignment score threshold does not exceed the tag length times 6,
otherwise you will not get any matches!

======  ========  ========      ======  ========  ========
Solexa  Match     Mismatch      Phred   Match     Mismatch
score   score     score         score   score     score
======  ========  ========      ======  ========  ========
]  29       6       -18	        >  29       6       -18
\  28       6       -17	        =  28       6       -17
[  27       6       -17	        <  27       6       -17
Z  26       6       -17	        ;  26       6       -17
Y  25       6       -17	        :  25       6       -17
X  24       6       -17	        9  24       6       -17
W  23       6       -17	        8  23       6       -17
V  22       6       -16	        7  22       6       -16
U  21       6       -16	        6  21       6       -16
T  20       6       -15	        5  20       6       -15
S  19       6       -15	        4  19       6       -15
R  18       6       -14	        3  18       6       -14
Q  17       6       -14	        2  17       6       -14
P  16       6       -13	        1  16       6       -13
O  15       6       -13	        0  15       6       -12
N  14       6       -12	        /  14       6       -12
M  13       6       -11	        .  13       6       -11
L  12       6       -10         -  12       6       -10
K  11       6       -10         ,  11       6        -9 
J  10       6        -9         +  10       6        -8 
I   9       5        -8         *   9       5        -7 
H   8       5        -7         )   8       5        -7 
G   7       5        -6         (   7       5        -6 
F   6       5        -6         '   6       5        -5 
E   5       5        -5         &   5       4        -4 
D   4       5        -4         %   4       4        -3 
C   3       4        -3         $   3       3        -2 
B   2       4        -3         #   2       2        -1 
A   1       3        -2         "   1      -1         0  
@   0       3        -2         !   0     -18         1  
?  -1       2        -1
>  -2       2        -1
=  -3       1        -1
<  -4       1         0
;  -5       0         0
   -6      -1         0
   -7      -2         0
   -8      -3         1
   -9      -3         1
  -10      -4         1
  -11      -5         1
  -12      -6         1
  -13      -7         1
  -14      -8         1
  -15      -9         1
  -16     -10         1
  -17     -10         1
  -18     -11         1
  -19     -12         1
  -20     -13         1
  -21     -13         1
  -22     -14         1
  -23     -15         1
  -24     -15         1
  -25     -16         1
  -26     -16         1
  -27     -16         1
  -28     -17         1
  -29     -17         1
  -30     -17         1
  -31     -17         1
  -32     -17         1
  -33     -17         1
  -34     -18         1
======  ========  ========      ======  ========  ========
