Metadata-Version: 2.1
Name: lamassemble
Version: 1.7.2
Summary: Merge overlapping "long" DNA reads into a consensus sequence
Home-page: https://gitlab.com/mcfrith/lamassemble
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Description-Content-Type: text/markdown
License-File: LICENSE

# lamassemble

`lamassemble` merges overlapping "long" DNA reads into a consensus
sequence (i.e. assembles them).  It works OK when the number of reads
is less than a thousand or so.

The merging is not always accurate.  In particular, if the reads come
from huge tandem repeats, wrong parts of the reads may get merged.

**Usage option 1:** use this [web
interface](https://mafft.cbrc.jp/alignment/server/index-rawreads.html).

**Usage option 2:** run it on your computer.  You can install it from
[Debian Med][] or [bioconda][]:

    conda install -c bioconda lamassemble

Or install it manually.  First, you need to install
[LAST](https://gitlab.com/mcfrith/last) (version >= 935) and
[MAFFT](https://mafft.cbrc.jp/alignment/software/) (version >= 7.450)
on your computer, i.e. put them in your `PATH`.  You can install
`lamassemble` in various ways, perhaps by running these commands:

    git clone https://gitlab.com/mcfrith/lamassemble.git
    cp lamassemble/lamassemble ~/bin/

After installing, you can run it like this:

    lamassemble last-train-file sequences.fx > consensus.fa

* The sequence file may be in fasta or fastq format (it makes no difference).
* It's OK to use gzipped (`.gz`) files.

You need to give it a file made by `last-train`, with the rates of
insertions, deletions, and substitutions in the reads.

## Built-in `last-train` files

These files (with case-insensitive names) probably won't be ideal for
your data, but they might be good enough:

* `promethion-2019`: from human DNA sequenced with PromethION R9.4,
  base-called with Guppy 1.4.0 ([De Coster et al. Genome
  Res. 2019](https://www.ncbi.nlm.nih.gov/pubmed/31186302),
  ERR2631604).

* `promethion-rna-2019`: from human RNA (direct RNA), sequenced with
  PromethION R9.4, base-called with Albacore.

* `sequel-II-CLR-2019`: from human DNA sequenced with PacBio Sequel II,
  Continuous Long Reads ([Wenger et
  al. Nat. Biotechnol. 2019](https://www.ncbi.nlm.nih.gov/pubmed/31406327),
  SRR9972588).

Different versions of sequencing hardware or software often have
similar error rates, so these files may work OK, but it's hard to make
guarantees.  If your reads have very different base frequencies
(e.g. AT-rich _Plasmodium_ DNA), a specialized last-train file should
be used.

## Making your own `last-train` file

You can make a `last-train` file by comparing your reads to a genome
of the same (or very closely-related) species, [like
this](https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md).

## Details

* `lamassemble` runs `LAST` to find pairwise alignments between the
  sequences, then runs `MAFFT` to align all the sequences together,
  guided by the `LAST` alignments.

* `lamassemble` assumes that the rates of insertion, deletion, and
  substitutions between two reads are "double" those in the
  `last-train` file, because errors occur in both reads.

* The substitution error rates in the forward and reverse strands of
  the reads may not be the same.  So the reads should all be kept on
  the same strands for `last-train` and `lamassemble`.

## Missing sequences

You may see a warning message like this:

    lamassemble: using 29 out of 32 sequences (linked by pairwise alignments)

This could mean that some sequences are not similar to the others, or
that `LAST` didn't find the similarities.  (To save time, `LAST` does
not find all pairwise similarities.)  You can make `LAST` find more
similarities by increasing option `-m` (and/or decreasing `-u` or `-W`).

## Options

- `-h`, `--help`: show a help message, with default option values, and exit.

- `-a`, `--alignment`: print an alignment, not a consensus sequence.

- `-c`, `--consensus`: just make a consensus, of already-aligned sequences.

- `-f FMT`, `--format=FMT`: consensus output format, fasta/fa or
  fastq/fq.  Fastq shows the error probability of each base (assuming
  the alignment is correct, so over-optimistic).  The format name is
  case-insensitive.

- `-g G`, `--gap-max=G`: make the consensus sequence from alignment
  columns with <= G% gaps.

- `--end`: make the consensus sequence from alignment columns with <=
  G% gaps *including gaps past the ends of the sequences*.  This may
  be useful for full-length reads (e.g. of PCR products) with
  occasional spurious flanks.

- `-s S`, `--seq-min=S`: omit consensus flanks covered by < S
  sequences.  You can specify either a number (e.g. `-s 3`) or a
  percentage (e.g. `-s 50%`).

- `-n NAME`, `--name=NAME`: name of the consensus sequence.

- `-o BASE`, `--out=BASE`: just write MAFFT input files, named `BASE.xxx`.

- `-p P`, `--prob=P`: use pairwise anchors that have error probability <= P.

- `-d D`, `--diagonal-max=D`: only use pairwise alignments whose
  "diagonal" changes by <= D from neighboring higher-scoring pairwise
  alignments.  "Diagonal" means: coordinate in one sequence minus
  coordinate in the other sequence.

- `-v`, `--verbose`: show progress messages.

- `--all`: keep all of each sequence.  The default is to trim ends
  that lack any pairwise LAST alignment.

- `--mafft=ARGS`: additional arguments for MAFFT.

### Options that get passed to `LAST`:

- `-P P`: number of parallel threads.

- `-u U`: go faster by getting `U`-fold fewer initial matches between
  2 sequences.  `U` must be 4, 8, 16, 32, 64, or 128 (values allowed
  by `lastdb -uRY`).

- `-W W`: get initial matches between 2 sequences starting at any base
  that is the "minimum" in any window of `W` consecutive bases.
  ("Minimum" means that the sequence starting here is alphabetically
  earliest.)  Increasing `W` reduces time and memory usage, but may
  reduce accuracy.

- `-m M`: maximum number of initial matches starting at any base in
  the "query" sequence.  Decreasing `M` reduces time and memory usage,
  but may reduce accuracy.

- `-z Z`: maximum gap length.  Larger values may improve accuracy a
  bit, but increase run time.

[bioconda]: https://bioconda.github.io/user/install.html
[Debian Med]: https://www.debian.org/devel/debian-med/
