synthaser.classify

This module contains the logic for classifying synthase objects based on user-defined rules.

To classify a collection of sequences, use the classify function:

>>> from synthaser.classify import classify
>>> classify(my_sequences)

A custom classification rule file can be provided to this function like so:

>>> classify(my_sequences, rule_file="my_rules.json")

Briefly, rule files should contain:

  1. Rule entries, specifying the domain combinations required to satisfy them
  2. Rule graph, encoding the hierarchy and order in which rules are evaluated

Alternatively, you could build a RuleGraph object in Python, e.g.:

>>> from synthaser.classify import Rule, RuleGraph
>>> one = Rule(name="Rule 1", domains=["D1", "D2"], evaluator="0 and 1")
>>> two = Rule(name="Rule 2", domains=["D3", "D4", "D5"], evaluator="(0 and 1) or 2")
>>> three = Rule(name="Rule 3", domains=["D6", "D7"], evaluator="0 or 1")
>>> graph = [
...     "Rule 1",
...     {
...         "Rule 2": ["Rule 3"]
...     }
... ]
>>> rg = RuleGraph(rules=[one, two, three], graph=graph)

And then save it to a file:

>>> with open("my_rules.json", "w") as fp:
...     rg.to_json(fp)

This RuleGraph object can directly classify Synthase objects:

>>> rg.classify(my_sequences)

For further explanation of rule files, refer to the documentation.

class synthaser.classify.Rule(name=None, order=None, renames=None, domains=None, filters=None, evaluator=None, **kwargs)

A classification rule.

name

Name given to proteins satisfying this rule.

Type:str
domains

Domain types required to satisfy rule.

Type:list
filters

Specific CDD families for each domain type.

Type:dict
evaluator

Evaluatable rule satisfaction statement.

Type:str
evaluate(conditions)

Evaluates the rules evaluator string given evaluated conditions.

Iterates backwards to avoid bad substitutions in larger (>=10) indices. e.g. “0 and 1 and … and 13” –> “False and True and … and True3”

Parameters:conditions (list) – Boolean values corresponding to domains in this rule.
Returns:True if rule is satisfied, otherwise False.
rename_domains(domains)

Renames domain types if substitutions are specified in the rule.

The rename dictionary maps domain types to other domain types. For example, an ACP domain in a PKS matches the same PP-binding domain as a T domain in an NRPS, so to follow the naming convention the NRPS rule renames ACPs to Ts.

Additionally, rename rules can be nested dicts to allow extra rules. For example, in a PKS-NRPS, the PP-binding domain in the NRPS module should be named T, not ACP. So, its rule is {‘after’: [‘A’, ‘C’], ‘to’: ‘T’}; any ACP domains after the first A or C will be renamed T.

satisfied_by(domains)

Evaluates this rule against a collection of domains.

Checks that: 1) required domain types are represented in the supplied domains, and 2) domains are of the desired CDD families, if any are specified.

Placeholders in the evaluator string are then replaced by their respective booleans, and evaluated.

Once a domain in the supplied domains has matched one in the rule, it cannot be matched to another in the rule. This enables rules based on counts of domains (e.g. multi-modular PKS w/ 2 KS domains).

valid_family(domain)

Checks a given domain matches a specified CDD family in the rule.

If no families have been specified for the given domain type, this function will return True (i.e. any family of the type is accepted).

This behaviour is controlled by the filters property of a synthaser rule. For example, to restrict a KS domain to certain CDD families:

"filters": [
    "type": "KS",
    "domains": ["one", "two"]
]
valid_order(domains: List[synthaser.models.Domain]) → bool

Checks given domains match specified order, if any.

Used for rules where domain order matters, e.g. a hybrid NRPS-PKS vs PKS-NRPS, where NRPS module comes before PKS and vice versa.

Iterates domain order list, finding earliest matching index in domain list. If the domain is not found, or the current index is lower than the previous (current domain occurs earlier than previous domain), order is invalid and False is returned.

class synthaser.classify.RuleGraph(rules=None, graph=None)

A hierarchy of classification rules.

The RuleGraph is used to classify synthases based on their domains. It stores Rule objects, as well as a directed graph controlling the order and hierarchy of classification.

An example synthaser rule graph looks like this:

[
    "Hybrid",
    {"PKS": ["HR-PKS", "PR-PKS", "NR-PKS"]},
    "NRPS"
]

In this example, the “Hybrid” rule is evaluated first. If unsuccessful, the “PKS” rule is evaluated. If this is successful, synthaser recurses into child rules, in which case the “HR-PKS”, “PR-PKS” and “NR-PKS” rules can be evaluated, and so on. Each rule name must have a corresponding entry in the rules attribute.

Note that terminal leaves in the graph are placed in lists, whereas hierarchies are written as dictionaries of lists. This preserves rule order in Python, as well as preventing empty, unnecessary dictionaries at every level.

rules

Collection of synthaser rules.

Type:dict
graph

Hierarchy of synthaser rules for classification.

Type:dict
synthaser.classify.classify(synthases, rule_file=None)

Classifies synthases based on defined rules.

If no rule_file is provided, the packaged rules.json will be loaded by default.

Parameters:
  • synthases (list) – Synthase objects to classify.
  • rule_file (str) – Path to custom classification rule file.
synthaser.classify.get_domain_index(query: str, domains: List[synthaser.models.Domain]) → Optional[int]

Finds the earliest index of a domain in a list of domains, if present.

synthaser.classify.traverse_graph(graph, rules, domains, classifiers=None)

Traverses a rule graph and classifies a collection of domains.

Each node is a dictionary with the schema:

{

“title”: “Rule name”, “children”: [

{
“title”: Rule name”, “children”: [ … ],

]

}

Rules are evaluated in order. If a rule is successfully evaluated, this function will recurse into any child rules, if any exist.

Finally a classification list, containing the path of rules satisfied by the given domains, is returned.

Parameters:
  • graph (list, dict) – Rule graph to traverse.
  • rules (dict) – Rule objects to evaluate on domains.
  • domains (list) – Domain objects to classify.
  • classifiers (list) – Current classifiers for a Domain collection.
Returns:

classifiers