synthaser.results

This module stores functions for parsing CD-Search output.

All functionality is provided by parse, which takes an open file handle corresponding to a CD-Search hit table and returns a list of fully characterized Synthase objects, i.e.:

>>> from synthaser import results
>>> with open('results.txt') as handle:
...     synthases = results.parse(handle)
>>> synthases
[AN6791.2 KS-AT-DH-MT-ER-KR-ACP, ... ]

synthaser uses the results.DOMAINS dictionary to control which domain families, and the quality thresholds (length, bitscore) they must meet, to save in any given search. This can be edited directly using update_domains, or loaded from a JSON file using load_domain_json. An entry in this dictionary may look like:

For further details on how to obtain these values and use a custom domain file, please refer to the user guide.

synthaser.results.choose_representative_domain(group, by='evalue')

Select the best domain from a collection of overlapping domains.

This function tests rules stored in special_rules, which are lambdas that take two variables. It sorts the group by e-value, then tests each rule using the container (first, best scoring group) against all other Domains in the group.

If any test is True, the container type is set to the rule key and returned. Otherwise, this function will return the container Domain with no modification.

Parameters:
  • group (list) – Overlapping Domain objects
  • by (str) – Measure to use when determining the best domain of the group. Choices: ‘bitscore’: return domain with highest bitscore (relative to threshold) ‘evalue’: return domain with lowest E-value ‘length’: return longest domain hit
Returns:

Highest scoring Domain in the group. If any special rules have been satisfied, the type of this Domain will be set to that rule (e.g. Condensation -> Epimerization).

Return type:

Domain

synthaser.results.domain_from_row(row)

Parse a domain hit from a row in a CD-search results file.

For example, a typical row might looks like:

>>> print(row)
Q#1 - >AN6791.2     specific        225858  9       1134    0       696.51  COG3321 PksD    -       cl09938

Using this function will generate:

>>> domain_from_row(row)
PksD [KS] 9-1134
Parameters:row (str) – Tab-separated row from a CDSearch results file
Returns:Instance of the Domain class containing information about this hit
Return type:Domain
Raises:ValueError – If the domain in this row is not in the DOMAINS dictionary.
synthaser.results.filter_domains(domains, by='evalue', coverage_pct=0.5, tolerance_pct=0.1)

Filter overlapping Domain objects and test adjcency rules.

Adjacency rules are tested again here, in case they are missed within overlap groups. For example, the NRPS-para261 domain is not always entirely contained by a condensation domain, so should be caught by this pass.

Parameters:
  • domains (list) – Domain instances to be filtered
  • by (str) – Metric used to choose representative domain hit (def. ‘evalue’)
  • coverage_pct (float) – Conserved domain coverage percentage threshold
  • tolerance_pct (float) – CD length tolerance percentage threshold
Returns:

Domain objects remaining after filtering

Return type:

list

synthaser.results.filter_results(results, **kwargs)

Build Synthase objects from a parsed results dictionary.

Any additional kwargs are passed to _filter_domains.

Parameters:results (dict) – Grouped Domains; output from _parse_cdsearch_table.
Returns:Synthase objects containing all Domain objects found in the CD-Search.
Return type:synthases (list)
synthaser.results.group_overlapping_hits(domains)

Iterator that groups Domain objects based on overlapping locations.

Parameters:domains (list) – Collection of Domain objects belonging to a Synthase
Yields:group (list) – Group of overlapping Domain objects
synthaser.results.is_fragmented_domain(one, two, coverage_pct=0.5, tolerance_pct=0.1)

Detect if two adjacent domains are likely a single domain.

This is useful in cases where a domain is detected with multiple small hits. For example, an NRPS may have two adjacent condensation (C) domain hits that are both individually too small and low-scoring, but should likely just be merged.

If two hits are close enough together, such that the distance between the start of the first and end of the second is within some tolerance (default +-10%) of the total length of a domains PSSM, this function will return True.

Parameters:
  • one (Domain) – Domain instance
  • two (Domain) – Domain instance
  • coverage_pct (float) – Conserved domain hit percentage coverage threshold. A hit is considered truncated if its total length is less than coverage_pct * CD length.
  • tolerance_pct (float) – Percentage of CD length to use when calculating acceptable lower/upper bounds for combined domains.
Returns:

Domain instances are likely fragmented and should be combined. False: Domain instances should be separate.

Return type:

True

synthaser.results.load_domains(rule_file)

Loads domains from a synthaser rule file.

Rule file domain schema: {

‘name’: KS, ‘domains’: [

{
‘accession’: ‘smart00825’, ‘name’: ‘PKS_KS’ …

}

This function flattens the domain type array to create a dictionary of domain families, so these can be easily looked up directly from CD-Search rows.

synthaser.results.parse(handle, mode='remote', **kwargs)

Parse CD-Search results.

Any additional kwargs are passed to synthases_from_results.

Parameters:
  • handle (file) – An open CD-Search results file handle. If you used the website to analyse your sequences, the file you should download is Domain hits, Data mode: Full, ASN text. When using a CDSearch object, this format is automatically selected.
  • mode (str) – Search mode (‘local’ or ‘remote’)
Returns:

A list of Synthase objects parsed from the results file.

Return type:

list

Raises:

ValueError – Search mode not ‘local’ or ‘remote’

synthaser.results.parse_cdsearch(handle)

Parse a CD-Search results table and instantiate Domain objects for each hit.

Parameters:handle (file) – Open file handle corresponding to a CD-Search results file.
Returns:Lists of Domain objects keyed on the query they were found in.
Return type:results (dict)
synthaser.results.parse_rpsbproc(handle)

Parse a results file generated by rpsblast->rpsbproc.

This function takes a handle corresponding to a rpsbproc output file. local.rpsbproc returns a subprocess.CompletedProcess object, which contains the results as byte string in it’s stdout attribute.