synthaser.ncbi

This module handles all interaction with NCBI.

Given a collection of Synthase objects, a workflow might look like:

  1. Launch new CD-Search run
>>> cdsid = ncbi.launch(synthases)
>>> cdsid
QM3-qcdsearch-B4BAD4B59BC5B80-3E7CFCD3F93E21D0

The query sequences are sent to the batch CD-Search API, where a new run is started and assigned a unique CD-Search identifier (CDSID) which can be used to check on search progress.

Note that search parameters for this search are specified by the values in SEARCH_PARAMS:

>>> ncbi.SEARCH_PARAMS
{
  'db': 'cdd',
  'smode': 'auto',
  'useid1': 'true',
  'compbasedadj': '1',
  'filter': 'true',
  'evalue': '3.0',
  'maxhit': '500',
  'dmode': 'full',
  'tdata': 'hits'
}

which can be freely edited, either directly or by using the set_search_params function.

  1. Poll CD-Search API for results using the CDSID
>>> response = ncbi.retrieve(cdsid)

This function repeatedly polls the API at regular intervals until either results or an error has occurred. Internally, this function calls check(), which takes a CDSID and sends a single request to the API. It returns a Response object (from the requests library), which will have any search content saved in its text or content properties.

>>> print(response.text)
#Batch CD-search tool   NIH/NLM/NCBI
#cdsid  QM3-qcdsearch-B4BAD4B59BC5B80-3E7CFCD3F93E21D0
#datatype       hitsFull Results
#status 0
#Start time     2019-09-03T04:21:23     Run time        0:00:04:23
#status success
  1. Parse results and create Synthase objects
>>> from synthaser import results
>>> handle = results.text.split("\n")
>>> synthases = results.parse(handle)

Additionally, this module provides efetch_sequences, a function for fetching sequences from NCBI from a collection of accessions. For example:

>>> ncbi.efetch_sequences(['CBF71467.1', 'XP_681681.1'])
{'CBF71467.1': 'MQSAGMHRATA...', 'XP_681681.1': 'MQDLIAIVGSA...'}

The accessions are sent to the NCBI’s Entrez API, which returns the sequences in FASTA format. They are parsed using fasta.parse, and the resulting dictionary is returned.

synthaser.ncbi.check(cdsid)

Checks the status of a running CD-search job.

CD-Search runs are assigned a unique search ID, which typically take the form:

QM3-qcdsearch-xxxxxxxxxxx-yyyyyyyyyyy

This function queries NCBI for the status of a running CD-Search job corresponding to the search ID specified by cdsid.

>>> response = check('QM3-qcdsearch-B4BAD4B59BC5B80-3E7CFCD3F93E21D0')

If the job has finished, this function will return the requests.Response object which contains the run results. If the job is still running, this function will return None. If an error is encountered, a ValueError will be thrown with the corresponding error code and message.

Parameters:

cdsid (str) – CD-search identifier (CDSID).

Returns:

If the job has completed and is ready for download False: If the job is still running

Return type:

True

Raises:
  • ValueError – If the returned results file has a successful status code but is actually empty (i.e. contains no results), perhaps due to an invalid query.
  • ValueError – When a status code of 1, 2, 4 or 5 is returned from the request.
synthaser.ncbi.efetch_sequences(headers)

Retrieve protein sequences from NCBI for supplied accessions.

This function uses EFetch from the NCBI E-utilities to retrieve the sequences for all synthases specified in headers. It then calls fasta.parse to parse the returned response; note that extra processing has to occur because the returned FASTA will contain a full sequence description in the header line after the accession.

Parameters:headers (list) – A collection of NCBI sequence identifiers (accession, GI, etc)
Returns:Sequences downloaded from NCBI
Return type:sequences (dict)
synthaser.ncbi.get_results(cdsid)

Downloads results corresponding to a CDSID.

Parameters:cdsid (str) – CD-Search identifier
Returns:Response object containing search results
Return type:requests.Response
Raises:ValueError – If response has bad status code
synthaser.ncbi.launch(query)

Launches a new CDSearch run.

Parameters:

query (Synthase, SynthaseContainer) – Synthase objects to be searched. This could either be a single Synthase object or a SynthaseContainer; other objects could be used as long as they implement a to_fasta method.

Returns:

CDSearch ID (CDSID) corresponding to the new run. This takes the form: QM3-qcdsearch-XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYY.

Return type:

cdsid (str)

Raises:
  • AttributeError – query has no to_fasta method
  • AttributeError – No CDSID was returned from NCBI
synthaser.ncbi.retrieve(cdsid, max_retries=-1, delay=20)

Poll CDSearch for results.

This method queries the NCBI for results from a CDSearch job corresponding to the supplied cdsid. If max_retries is -1, this function will check for results every delay interval until something is returned.

If you wish to save the results of a CD-Search run to file, you can supply an open file handle via the output parameter:

>>> with open('results.tsv', 'w') as results:
...     retrieve(
...         'QM3-qcdsearch-B4BAD4B59BC5B80-3E7CFCD3F93E21D0',
...         output=results
...     )

This function returns the Response object returned by check():

>>> response = retrieve('QM3-qcdsearch-B4BAD4B59BC5B80-3E7CFCD3F93E21D0')
>>> print(response.text)
#Batch CD-search tool       NIH/NLM/NCBI
#cdsid      QM3-qcdsearch-B4BAD4B59BC5B80-3E7CFCD3F93E21D0
#datatype   hitsFull Results
#status     0
...
Parameters:
  • cdsid (str) – CD-search job ID. Looks like QM3-qcdsearch-xxxxxxxxxxx-yyyyyyyyyyy.
  • output (file pointer) – Save results to a given open file handle instead of a local file. This facilitates usage of e.g. tempfile objects.
  • max_retries (int) – Maximum number of retries for checking job completion. If -1 is given, this function will keep paging for results until something is returned.
  • delay (int) – Number of seconds to wait between each request to the NCBI. The wait time is re-calculated to this value each time, based on the time taken by the previous request. By default, this is set to 20; giving a value less than 10 will result in a ValueError being thrown.
Returns:

Response returned by the check()

Return type:

(requests.models.Response)

Raises:
  • ValueError – If delay is less than 10.
  • ValueError – If no Response is returned by check()
synthaser.ncbi.set_search_params(database=None, smode=None, useid1=None, compbasedadj=None, filter=None, evalue=None, maxhit=None, dmode=None)

Set CD-Search search parameters.

All search parameters are stored in SEARCH_PARAMS; this can either be edited directly, or through this function, prior to a search.

Parameters:
  • database (str) – Name of search database. Available options are ‘cdd’ (default), ‘pfam’, ‘smart’, ‘tigrfam’, ‘cog’ and ‘kog’. Only applies when smode is live.
  • smode (str) – Search mode; ‘auto’ (automatic), ‘prec’ (precalculated only) or ‘live’ (live searches).
  • useid1 (str) – Search archived sequences (‘true’ or ‘false’)
  • compbasedadj (str) – Composition-corrected scoring (‘0’ or ‘1’)
  • filter (str) – Filter out compositionally biased regions (‘true’ or ‘false’)
  • evalue (float) – E-value cutoff
  • maxhit (int) – Maximum number of hits per query
  • dmode (str) – Data mode of output (‘rep’, ‘std’, or ‘full’)

For a full description of parameters, refer to the NCBI’s documentation.