Directly Querying the Database

ImmuneDB is backed by a MySQL database that can be queried directly to gather information, bypassing the Python API.

Accessing the Database

There are many ways to access the database directly. The two introduced here are directly through MySQL or using immunedb_sql which simply wraps a call to MySQL.

Directly with MySQL

From the command line, you may access an ImmuneDB database DATABASE from user USERNAME with:

$ mysql -u USERNAME -p DATABASE

This will prompt for a password and then to the database. This method of access is useful for quickly querying the database. To save results of a query QUERY run the command:

$ mysql -u USERNAME -p DATABASE -e "QUERY" > output

Since ImmuneDB stores usernames and passwords in config files immunedb_sql is provided as a small wrapper around the mysql command. It can be invoked with:

With the immunedb_sql wrapper

$ immunedb_sql PATH_TO_CONFIG

This is entirely equivalent to using mysql and will drop to the MySQL interpreter.

Querying

The data is split into different referential tables based on the data models defined in Models.

Most tables contain data that will generally be aggregated by a query directly. For example, to determine how many sequences are in sample 10, one could simply count the rows:

> SELECT COUNT(*) FROM sequences where sample_id=10;

This query will work, but may be slow with large datasets. Two tables can assist in some computationally-expensive queries by providing pre-aggregated information. The CloneStats and SampleStats tables are pre-populated with the immunedb_clone_stats and immunedb_sample_stats pipeline commands.

This same result could be achieved more quickly with:

> SELECT sequence_cnt FROM sample_stats WHERE filter_type='all' AND
    outlier=1 AND full_reads=0 and sample_id=10;

This is more verbose but is quicker and also makes other more complex tasks simpler. For example, let’s find how many unique sequences are in sample 10 which are full reads and not outliers:

> SELECT sequence_cnt FROM sample_stats WHERE filter_type='unique' AND
    outlier=0 AND full_reads=1 and sample_id=10;

The CloneStats table also has useful information that has been pre-aggregated. For example, how many unique sequences in sample 10 are in clone 5:

> SELECT unique_cnt FROM clone_stats where sample_id=10 and clone_id=5;

Or how many total unique sequences are in clone 5 (NULL is a placeholder in the sample_id column meaning “All Samples”):

> SELECT unique_cnt FROM clone_stats where sample_id=NULL and clone_id=5;

Other Example Queries

How many clones have a CDR3 starting with `CARD`?

> SELECT COUNT(*) FROM clones WHERE clones.cdr3_aa like 'CARD%';

Get a list of non-identifiable sequences in FASTA format.

> SELECT CONCAT('>', seq_id, '\n', sequence) from noresults;

How many indels and total sequences do I have?”

> SELECT SUM(IF(probable_indel_or_misalign=1, 1, 0)) AS indels, count(*) AS
total FROM sequences;