Quickstart

This section of the documentation gives a brief overview of how to get started using synthaser and a broad overview of its features.

Pre-search configuration using the config module

The NCBI requires that you provide some identification before using their services in order to prevent abuse. This can be an e-mail address, or more recently, an API key (https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/).

You can use the config module to set these parameters for synthaser searches (you’ll only have to do this once!). This module will save a file, config.ini, wherever your operating system stores configuration files (for example, in Linux it will be saved in ~/.local/config/synthaser). When you run remote searches in synthaser, it will first check to see if it can find this file, and then if an e-mail address or API key is saved; if they are not found, synthaser will throw an error.

To set an e-mail address:

$ synthaser config --email "foo@bar.com"

…or an API key:

$ synthaser config --api_key <your API key>

Running a synthaser search using the search module

Once configured, a synthaser search is as easy as:

synthaser search --query_file sequences.fasta

This will take all sequences in sequences.fasta and start a remote CD-Search run.

Note

Most of the arguments used with synthaser have shorter forms which can be found in the help menus (e.g. synthaser search --help). For example, the short version of --query_file here is -qf.

Alternatively, you can use the -qi/--query_ids argument to start a search using just a collection of NCBI identifiers. For example:

synthaser search -qi KAF4294870.1 KAF4294328.1 KAF4293514.1

This will retrieve each sequence and start a remote CD-Search run.

There are several optional arguments you can use which control the output synthaser will generate. To generate a synthaser plot of your sequences, add the -p/--plot argument, optionally specifying a name to generate a portable HTML file:

synthaser search -qf sequences.fasta -p <plot.html>

The plot will then open directly in your web browser.

You can save a file containing the data of a synthaser search by using the -json, --json_file argument:

synthaser search -qf sequences.fasta -json session.json

This is particularly useful in larger searches, as the search won’t have to be completely redone if you want to e.g. generate a new visualisation or output table. If the file specified does not exist, synthaser will create it; if it does exist, synthaser will attempt to load it.

The default output will show your query sequences and their domain architectures, grouped by their classifications. For example, the search using --query_ids from above produces:

PKS --> Type I --> Partially-reducing
-------------------------------------
KAF4294870.1    KS-AT-DH-MT-KR-ACP

Thiolase
--------
KAF4293514.1    KS
KAF4294328.1    KS

You can make this tabular by using the -lf, --long_form argument, which will produce a comma-separated output like:

Synthase,Length (aa),Architecture,Classification
KAF4294870.1,2445,KS-AT-DH-MT-KR-ACP,PKS|Type I|Partially-reducing
KAF4294328.1,413,KS,Thiolase
KAF4293514.1,419,KS,Thiolase

Where each row contains the sequence, its length, domain architecture and classification. This can then be directly imported into spreadsheet software.

Another very useful argument is --cdsid. This allows you to resume or load a CD-Search run at a later time. The CDSID (CD-Search identifier) is reported by synthaser at the start of every search, and takes the form:

QM3-qcdsearch-XXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY

For example, in the output of the above search:

[14:57:52] INFO - Starting synthaser
[14:57:56] INFO - Launching new CD-Search run
[14:57:58] INFO - Run ID: QM3-qcdsearch-894E2B07233244A-1C6342BEDF36CB85

When I then wanted the tabular output, I could simply re-use the CDSID:

synthaser search \
        --query_ids KAF4294870.1 KAF4294328.1 KAF4293514.1 \
        --cdsid QM3-qcdsearch-894E2B07233244A-1C6342BEDF36CB85 \
        --long_form

CD-Search parameters can be altered using the following arguments:

Argument Description
--cdsid CD-Search run ID (e.g. QM3-qcdsearch-XXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY).
--smode Search mode (auto, prec or live; def. auto)
--useid1 Look for sequences in archival database if not found (def. true)
--compbasedadj Use composition-corrected scoring (def. 1)
--filter Filter out compositionally biased regions from queries (def. true)
--evalue Maximum E-value (def. 3)
--maxhit Maximum number of hits to return (def. 500)
--dmode Level of hits to report (full, rep or std; def. full)

For a fuller explanation of these arguments, see the NCBI documentation here.

The synthaser plot

The synthaser plot is fairly intuitive, but deserves a little explanation of its own. Here is what the search above looks like plotted:

synthaser plot

Query sequences are drawn to scale, with coloured boxes representing the domains that were found. Sequences are grouped by their classifications, and annotation bars for classifications are drawn in the appropriate locations automatically. This looks a not nicer when you have more sequences covering more classifications!

Hovering over a domain box in a sequence will show a tooltip menu that looks like:

domain tooltip

This shows you a summary of the best domain family hit, with linkouts to their corresponding entries in the NCBI databases. You can also copy amino acid sequences of either the entire query or just the specific domain by clicking the buttons at the bottom.

The legend contains a list of the domains found in all of your sequences next to a box with the corresponding colour. This colour can be changed simply by clicking the box and selecting another colour.

Sequences can be hidden by clicking on their names. If a sequence containing the last occurrence of a specific domain in the plot is hidden, that domain will automatically be removed from the legend as well.

Sizing and positioning of plot elements can be controlled by the settings in the dropdown menu on the right hand side of the plot. In the above image, the only change from default was the maximum sequence length (in pixels); by default, this is set to 600 px.

You can generate FASTA files containing extracted sequences of specific domain types from your sequences using the Download domain sequences section at the bottom of the dropdown menu. Simply select a domain type and click the Download! button.

Once you are happy with your figure, you can download a SVG image file by clicking the Save SVG button at the top of the menu.