Python API

Note

This section is currently incomplete. We’re working to fill out the details of the Python API as soon as possible.

Configuration

The immunedb.common.config module provides methods to initialize a connection to a new or existing database.

Most programs using ImmuneDB will start with code similar to:

import immunedb.common.config as config
parser = config.get_base_arg_parser('Some description of the program')
# ... add any additional arguments to the parser ...
args = parser.parse_args()

session = config.init_db(args.db_config)

When this script is run, it will require at least one argument which is the path to a database configuration (as generated with immunedb_admin). Using that, a Session object will be made, connected to the associated database.

One can also directly specify the path to a configuration directly.

import immunedb.common.config as config
session = config.init_db('path/to/config')

Alternatively a dictionary with the same information can be passed:

import immunedb.common.config as config
session = config.init_db({
    'host': '...',
    'database': '...',
    'username': '...',
    'password': '...',
})

Returned will be a Session object which can be used to interact with the database.

Using the Session

ImmuneDB is built using SQLAlchemy as a MySQL abstraction layer. Simply put, instead of writing SQL, the database is queried using Python constructs. Full documentation on using the session can be found in SQLAlchemy’s documentation.

Once a session is created, the models listed below can be queried.

Example Queries

Below are some example queries that demonstrate how to use the ImmuneDB API.

Clone CDR3s

Get all clones with a given V-gene and print their CDR3 AA sequences.

Input

import immunedb.common.config as config
from immunedb.common.models import Clone

session = config.init_db(...)

for clone in session.query(Clone).filter(Clone.v_gene == 'IGHV3-30'):
    print 'clone {} has AAs {}'.format(clone.id, clone.cdr3_aa)

Output

clone 37884 has AAs CARGYSSSYFDYW
clone 37886 has AAs CARSRTSLSIYGVVPTGDFDSW
clone 37885 has AAs CARNGLNTVSGVVISPKYWLDPW
clone 37887 has AAs CARDLFRGVDFYYYGMDVW

Clone Frequency

Determine how many sequences appear in each sample belonging to clone 1234.

Note the CloneStats model has one entry for each clone/sample combination plus one where the sample_id field is null which represents the overall clone.

Input

import immunedb.common.config as config
from immunedb.common.models import CloneStats

session = config.init_db(...)
for stat in session.query(CloneStats).filter(
        CloneStats.clone_id == 1234).order_by(CloneStats.sample_id):
    print 'clone {} has {} unique sequences and {} copies {}'.format(
        stat.clone_id,
        stat.unique_cnt,
        stat.total_cnt,
        ('in sample ' + stat.sample.name) if stat.sample else 'overall')

Output

clone 1234 has 53 unique sequences and 1331 copies overall
clone 1234 has 27 unique sequences and 379 copies in sample sample1
clone 1234 has 27 unique sequences and 339 copies in sample sample3
clone 1234 has 24 unique sequences and 311 copies in sample sample4
clone 1234 has 28 unique sequences and 302 copies in sample sample10

V-gene Usage

This is a more complex query which gathers the V-gene usage of all sequences which are (a) in subject with ID 1, (b) associated with a clone, and (c) are unique to the subject, printing them from least to most frequent.

Input

import immunedb.common.config as config
from immunedb.common.models import Sequence, SequenceCollapse

session = config.init_db(...)

subject_unique_seqs = session.query(
    func.count(Sequence.seq_id).label('count'),
    Sequence.v_gene
).join(
    SequenceCollapse
).filter(
    Sequence.subject_id == 1,
    ~Sequence.clone_id.is_(None),
    SequenceCollapse.copy_number_in_subject > 0
).group_by(
    Sequence.v_gene
).order_by(
    'count'
)

for seq in subject_unique_seqs:
    print seq.v_gene, seq.count

Output

# ... output trimmed ...
IGHV4-34 1128
IGHV1-2 1160
IGHV3-48 1169
IGHV4-39 1310
IGHV3-7 1345
IGHV3-30|3-30-5|3-33 1607
IGHV3-23|3-23D 1626
IGHV3-21 1878

Data Models

class immunedb.common.models.Clone(**kwargs)

A group of sequences likely originating from the same germline

Parameters:
  • id (int) – An auto-assigned unique identifier for the clone
  • functional (bool) – If the clone is functional
  • v_gene (str) – The V-gene assigned to the sequence
  • j_gene (str) – The J-gene assigned to the sequence
  • cdr3_nt (str) – The consensus nucleotides for the clone
  • cdr3_num_nts (int) – The number of nucleotides in the group’s CDR3
  • cdr3_aa (str) – The amino-acid sequence of the group’s CDR3
  • subject_id (int) – The ID of the subject to which the clone belongs
  • subject (Relationship) – Reference to the associated Subject instance
  • germline (str) – The germline sequence for this clone
  • tree (str) – The textual representation of the clone’s lineage tree
  • parent_id (int) – The (possibly null) ID of the clone’s parent
consensus_germline

Returns the consensus germline for the clone

regions

Returns the IMGT region boundaries for the clone

class immunedb.common.models.CloneStats(**kwargs)

Stores statistics for a given clone and sample. If sample is null the statistics are for the specified clone in all samples.

Parameters:
  • clone_id (int) – The clone ID
  • clone (Relationship) – Reference to the associated Clone instance
  • functional (bool) – If the associated clone is functional. This is a denormalized field.
  • sample_id (int) – The sample ID
  • sample (Relationship) – Reference to the associated Sample instance
  • unique_cnt (int) – The number of unique sequences in the clone in the sample
  • total_cnt (int) – The number of total sequences in the clone in the sample
  • mutations (str) – A JSON stanza of mutation count information
class immunedb.common.models.DuplicateSequence(**kwargs)

A sequence which is a duplicate of a Sequence. This is used to minimize the size of the sequences table. The copy_number attribute of Sequence instances is equal to the number of its duplicate sequences plus one.

Parameters:
  • pk (int) – A primary key for this duplicate sequence
  • seq_id (str) – A unique identifier for the sequence as output by the sequencer
  • duplicate_seq_ai (str) – The auto-increment value of the sequence in the same sample with the same sequence
  • duplicate_seq (Relationship) – Reference to the associated Sequence instance of which this is a duplicate
  • sample_id (int) – The ID of the sample from which this sequence came
class immunedb.common.models.ModificationLog(**kwargs)

A log message for a database modification

Parameters:
  • id (int) – The ID of the log message
  • datetime (datetime) – The date and time of the message
  • action_type (str) – A short string representing the action
  • info (str) – A JSON stanza with log message information
class immunedb.common.models.NoResult(**kwargs)

A sequence which could not be match with a V or J.

Parameters:
  • pk (int) – A primary key for this no result
  • seq_id (str) – A unique identifier for the sequence as output by the sequencer
  • sample_id (int) – The ID of the sample from which this sequence came
  • sample (Relationship) – Reference to the associated Sample instance
  • sequence (str) – The sequence of the non-identifiable input
  • sequence – The quality of the non-identifiable input
class immunedb.common.models.Sample(**kwargs)

A sample of sequences.

Parameters:
  • id (int) – An auto-assigned unique identifier for the sample
  • name (str) – A unique name for the sample as defined by the experimenter
  • study_id (int) – The ID of the study under which the subject was sampled
  • study (Relationship) – Reference to the associated Study instance
  • subject_id (int) – The ID of the subject from which the sample was taken
  • subject (Relationship) – Reference to the associated Subject instance
  • v_ties_mutations (float) – Average mutation rate of sequences in the sample
  • v_ties_len (float) – Average length of sequences in the sample
class immunedb.common.models.SampleStats(**kwargs)

Aggregate statistics for a sample. This exists to reduce the time queries take for a sample.

Parameters:
  • sample_id (int) – The ID of the sample for which the statistics were generated
  • sample (Relationship) – Reference to the associated Sample instance
  • filter_type (str) – The type of filter for the statistics (e.g. functional)
  • outliers (bool) – If outliers were included in the statistics
  • full_reads (bool) – If only full reads were included in the statistics
  • v_identity_dist (str) – Distribution of V gene identity
  • v_match_dist (str) – Distribution of V gene match count
  • v_length_dist (str) – Distribution of V gene total length
  • j_match_dist (str) – Distribution of J gene match count
  • j_length_dist (str) – Distribution of J gene total length
  • v_gene_dist (str) – Distribution of V-gene assignments
  • j_gene_dist (str) – Distribution of J-gene assignments
  • copy_number_dist (str) – Distribution of copy number
  • cdr3_length_dist (str) – Distribution of CDR3 lengths
  • sequence_cnt (int) – The total number of sequences
  • in_frame_cnt (int) – The number of in-frame sequences
  • stop_cnt (int) – The number of sequences containing a stop codon
  • functional_cnt (int) – The number of functional sequences
  • no_result_cnt (int) – The number of invalid sequences
class immunedb.common.models.Sequence(**kwargs)

Represents a single unique sequence.

Parameters:
  • ai (int) – An auto-incremented value for the sequence
  • subject_id (int) – The ID of the subject for this subject
  • seq_id (str) – A unique identifier for the sequence as output by the sequencer
  • sample_id (int) – The ID of the sample from which this sequence came
  • sample (Relationship) – Reference to the associated Sample instance
  • partial (bool) – If the sequence is a partial read
  • probable_indel_or_misalign (bool) – If the sequence likely has an indel or is a bad alignment
  • v_gene (str) – The V-gene assigned to the sequence
  • j_gene (str) – The J-gene assigned to the sequence
  • num_gaps (int) – Number of inserted gaps within the V read
  • seq_start (int) – The offset from the germline where the sequence starts
  • v_match (int) – The number of V-gene nucleotides matching the germline
  • v_length (int) – The length of the V-gene segment prior to a streak of mismatches in the CDR3
  • j_match (int) – The number of J-gene nucleotides matching the germline
  • j_length (int) – The length of the J-gene segment after a streak of mismatches in the CDR3
  • removed_prefix (str) – The sequence (if any) which was removed from the beginning of the sequence during alignment. Possibly used during indel correction
  • removed_prefix_qual (str) – The quality (if any) which was removed from the beginning of the sequence during alignment. Possibly used during indel correction
  • pre_cdr3_length (int) – The length of the V-gene prior to the CDR3
  • pre_cdr3_match (int) – The number of V-gene nucleotides matching the germline prior to the CDR3
  • post_cdr3_length (int) – The length of the J-gene after to the CDR3
  • post_cdr3_match (int) – The number of J-gene nucleotides matching the germline after to the CDR3
  • in_frame (bool) – If the sequence’s CDR3 has a length divisible by 3
  • functional (bool) – If the sequence is functional
  • stop (bool) – If the sequence contains a stop codon
  • copy_number (int) – Number of reads in the sample which collapsed to this sequence
  • cdr3_num_nts (int) – The number of nucleotides in the CDR3
  • cdr3_nt (str) – The nucleotides comprising the CDR3
  • cdr3_aa (str) – The amino-acids comprising the CDR3
  • sequence (str) – The (possibly-padded) sequence
  • quality (str) – Optional Phred quality score (in Sanger format) for each base in sequence
  • germline (str) – The germline sequence for this sequence
  • clone_id (int) – The clone ID to which this sequence belongs
  • clone (Relationship) – Reference to the associated Clone instance
  • mutations_from_clone (str) – A JSON stanza with mutation information
clone_sequence

Gets the sequence within the context of the associated clone by adding insertions from other sequences to this one.

get_v_extent(in_clone)

Returns the estimated V length, including the portion in the CDR3

original_quality

Returns the original quality given with the J end trimmed to the germline

original_sequence

Returns the original sequence given with the J end trimmed to the germline

regions

Returns the IMGT region boundaries for the sequence

class immunedb.common.models.SequenceCollapse(**kwargs)

A one to many table that links sequence from different samples that collapse to one another. This is used instead of a field in Sequence for performance reasons.

Parameters:
  • sample_id (int) – The ID of the sample with the sequence being collapsed
  • seq_ai (int) – The auto-increment value of the sequence being collapsed
  • clone (Relationship) – Reference to the associated Sequence instance being collapsed
  • collapse_to_subject_sample_id (int) – The ID of the sample in which the collapse to sequence belongs
  • collapse_to_subject_seq_ai (int) – The auto-increment value of the sequence collapsing to
  • collapse_to_subject_seq_id (int) – The sequence ID of the sequence collapsing to. This is a denormalized field.
  • instances_in_subject (int) – The number of instance of the sequence in the subject
  • copy_number_in_subject (int) – The aggregate copy number of the sequence in the subject
collapse_to_seq

Returns the sequence being collapse to

class immunedb.common.models.Study(**kwargs)

A study which aggregates related samples.

Parameters:
  • id (int) – An auto-assigned unique identifier for the study
  • name (str) – A unique name for the study
  • info (str) – Optional information about the study
class immunedb.common.models.Subject(**kwargs)

A subject which was sampled for a study.

Parameters:
  • id (int) – An auto-assigned unique identifier for the subject
  • identifier (str) – An identifier for the subject as defined by the experimenter
  • study_id (int) – The ID of the study under which the subject was sampled
  • study (Relationship) – Reference to the associated Study instance
immunedb.common.models.check_string_length(cls, key, inst)

Checks if a string can properly fit into a given field. If it is too long, a ValueError is raised. This prevents MySQL from truncating fields that are too long.