Karenina

Simulation and modeling tools for studying Anna Karenina effects in animal microbiomes This package aims to develop tools for modeling microbiome variability in disease. Initial versions focus on simulating microbiome change over time using simple Ornstein-Uhlenbeck (OU) models.

Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy

Installation

Karenina is in active development and is available as a direct command-line program (karenina), or as a Qiime2 plugin (q2-karenina). Karenina is a python3.5+ program and requires scipy, pandas, matplotlib, and seaborn.

  1. Install ffmpeg (visualization dependency)
sudo apt-get install ffmpeg -y
  1. Install karenina
pip install git+https://github.com/zaneveld/karenina.git

Qiime2 Users

Qiime2 users will have the ability to use their Q2 Artifacts and receive output compatible with other Q2 plugins. q2-karenina requires both ffmpeg and karenina.

  1. Install q2-karenina
pip install git+https://github.com/zaneveld/q2-karenina.git
  1. Update Qiime2 cache to enable newly installed plugin.
qiime dev refresh-cache

Usage

Script Description
spatial_ornstein_uhlenbeck OU Simulation Modeling
fit_timeseries Fit PCoA Data to OU Models
karenina_visualization Visualize PCoA Timeseries
qiime karenina spatial-ornstein-uhlenbeck Q2 application of OU Modeling
qiime karenina fit-timeseries Q2 application of PCoA fitting
qiime karenina visualization Q2 application of 3D Visualization

karenina

spatial_ornstein_uhlenbeck

>spatial_ornstein_uhlenbeck -h
Usage: spatial_ornstein_uhlenbeck -o ./simulation_results

This script simulates microbiome change over time using Ornstein-Uhlenbeck
(OU) models.  These are similar to Brownian motion models, with the exception
that they include reversion to a mean. Output is a tab-delimited data table
and figures.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit

  Required options:
        -o OUTPUT, --output=OUTPUT
                                                the output folder for the simulation results

  Optional options:
        --pert_file_path=PERT_FILE_PATH
                                                file path to a perturbation file specifying parameters
                                                for the simulation results
        --treatment_names=TREATMENT_NAMES
                                                Comma seperated list of treatment named
                                                [default:control,destabilizing_treatment]
        -n N_INDIVIDUALS, --n_individuals=N_INDIVIDUALS
                                                Comma-separated number of individuals to simulate per
                                                treatment.Note: This value must be enclosed in quotes.
                                                Example: "35,35". [default: 35,35]
        -t N_TIMEPOINTS, --n_timepoints=N_TIMEPOINTS
                                                Number of timepoints to simulate. (One number, which
                                                is the same for all treatments) [default: 10]
        -p PERTURBATION_TIMEPOINT, --perturbation_timepoint=PERTURBATION_TIMEPOINT
                                                Timepoint at which to apply a perturbation. Must be
                                                less than --n_timepoints [default: 5]
        -d PERTURBATION_DURATION, --perturbation_duration=PERTURBATION_DURATION
                                                Duration that the perturbation lasts. [default: 100]
        --interindividual_variation=INTERINDIVIDUAL_VARIATION
                                                Starting variability between individuals. [default:
                                                0.01]
        --delta=DELTA       Starting delta parameter for Brownian motion and
                                                Ornstein-Uhlenbeck processes. A higher number
                                                indicates more variability over time. [default: 0.25]
        -l L, --L=L         Starting lambda parameter for Ornstein-Uhlenbeck
                                                processes. A higher number indicates a greater
                                                tendancy to revert to the mean value. [default: 0.2]
        --fixed_start_pos=FIXED_START_POS
                                                Starting x,y,z position for all points, as comma
                                                separated floating point values, e.g. 0.0,0.1,0.2. If
                                                not supplied, starting positions will be randomized
                                                based on the interindividual_variation parameter
                                                [default: none]
        -v, --verbose
                                                allows for verbose output [default: False]

The spatial_ornstein_uhlenbeck function takes in simulation parameters and develops a distance matrix, PCoA ordination, and metadata file for use in other processes. This tutorial will explore the functionality of the script, and provide sample input data for fit_timeseries and karenina_visualization.

The following parameters are used to generate a simulation using Ornstein Uhlenbeck and Brownian motion models:

-o [Direct/filepath/to/output/dir/]
--n_timepoints 50
--perturbation_timepoint 15
--n_individuals 3,3

The following parameters were used as their default input:

--treatment_names control,destabilizing_treatment
--pert_file_path os.path.abspath(resource_filename('karenina.data','set_xyz_lambda_zero.tsv'))
--perturbation_duration 100
--interindividual_variation 0.01
--delta 0.25
--L 0.20

After installing karenina, the following command is used to execute the simulation from within the output directory.

spatial_ornstein_uhlenbeck -o ./ --n_timepoints 50 --perturbation_timepoint 15 --n_individuals 3,3

This generates the following files:

  • ordination.txt : Simulation PCoA data
  • metadata.tsv : Simulation metadata file
  • log.txt : Simulation parameters
  • euclidean.txt : Euclidean distance-matrix of Simulation data
  • simulation_video.mp4 : Visualization of simulated data

These files can be viewed here:

https://github.com/zaneveld/karenina/tree/master/data/outputs/OU_simulation_50.3.3

fit_timeseries

>fit_timeseries -h
Usage: fit_timeseries -o ./simulation_results

This script fits microbiome change over time to Ornstein-Uhlenbeck (OU)
models.Demo fitting an OU model using default parameters.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit

  Required options:
        -o OUTPUT, --output=OUTPUT
                                                the output folder for the simulation results

  Optional options:
        --fit_method=FIT_METHOD
                                                Global optimization_method to use
                                                [default:basinhopping]
        --pcoa_qza=PCOA_QZA
                                                The input PCoA qza file
        --metadata=METADATA
                                                The input metadata tsv file, if not defined, metadata
                                                will be extracted from PCoA qza
        --individual=INDIVIDUAL
                                                The host subject ID column identifier, separate by
                                                comma to combine TWO columns
        --timepoint=TIMEPOINT
                                                The timepoint column identifier
        --treatment=TREATMENT
                                                The treatment column identifier
        -v, --verbose
                                                allows for verbose output [default: False]

The fit_timeseries process allows users to fit PCoA and Metadata information to Ornstein Uhlenbeck models.

Warning

This script does not allow for optimization of single-point observations, where individuals are only measured or treated at one specific timepoint. If subjects are applied treatments at a single timepoint, it may be necessary to manually modify the metadata file to apply a boolean label to “treatment x applied” or not.

Note

The optimal number of measured timepoints is roughly 50 per subject, based on benchmarking.

Benchmark Lambda Error Benchmark Sigma Error Benchmark Theta Error

Using the data from the previously ran simulation (see spatial_ornstein_uhlenbeck), the fit_timeseries process will attempt to fit Ornstein Uhlenbeck models to the individuals and treatment cohorts of the input data.

The data being used for this tutorial can be found here:

https://github.com/zaneveld/karenina/blob/master/data/fit_timeseries/simulation.qza?raw=true

Note

fit_timeseries is currently configured to utilize basinhopping as its global optimizer, and more optimization methods will be added in the future.

fit_timeseries is currently configured to only accept Qiime2 formatted data, but will accept direct PCoA and metadata files in the future.

The following parameters are defined to fit the PCoA data to OU models:

-o [Direct/filepath/to/output/dir/]
--individual Subject
--timepoint Timepoint
--treatment Treatment

After installing karenina the following command may be used to generate OU model parameter output from within the same directory as simulation.qza. The output folder in this instance is /simulation_fit_ts.

fit_timeseries -o ./simulation_fit_ts/ \
--pcoa_qza ./simulation.qza \
--individual Subject \
--timepoint Timepoint \
--treatment Treatment

Note

fit_timeseries will create and optimize models for n_timepoints and n_individuals, and thus takes quite some time on larger datasets. Because this simulation data requires optimization of over 7000 entries, we have provided the data below.

With the treatment column identified, the program will generate two output files containing metadata for individuals and cohorts. Below is a sample of the cohort output generated. With input parameters being optimized to sigma/ delta: 0.25, lambda: 0.20, theta/ mu: 0.00 (from the OU simulation), the fit_timeseries process estimates the Ornstein Uhlenbeck model parameters.

This data can be found at the following link:

https://github.com/zaneveld/karenina/tree/master/data/outputs/fit_timeseries/tx_defined

Below is a sample of the cohort output generated, which estimates the control cohort to have parameters approaching the input. The perturbation set lambda to zero, and the estimation clearly displays the parameter approaching zero in the destabilizing_treatment cohort.

Subject pc sigma lambda theta nLogLik aic
control_pc1 pc1 0.25387 0.23143 -0.0109 -190.49 -4.4992
control_pc2 pc2 0.25388 0.23144 -0.0109 -190.49 -4.4992
control_pc3 pc3 0.25388 0.23144 -0.0109 -190.49 -4.4992
destabilizing_treatment_pc1 pc1 0.24395 0.01951 -0.1203 -201.98 -4.6163
destabilizing_treatment_pc2 pc2 0.24395 0.01951 -0.1203 -201.98 -4.6163
destabilizing_treatment_pc3 pc3 0.24395 0.01951 -0.1203 -201.98 -4.6163

Because the simulation data is large, and thus takes some time to process, we can alternatively fit the timeseries from Moving Pictures. The weighted_unifrac_pcoa_results.qza file will be used.

The PCoA data can be found here:

https://github.com/zaneveld/karenina/karenina/blob/master/data/fit_timeseries/weighted_unifrac_pcoa_results.qza?raw=true

Because the Moving Pictures data contains single-point observations for their treatments, being that their treatments are applied at timepoint-0 only, fit_timeseries will fit OU models to only individuals if we do not define a treatment identifier.

After installing karenina the following command can be ran from the same directory as weighted_unifrac_pcoa_results.qza.

Note

The individual column combines the columns Subject and BodySite, this is entered as a comma-separated list.

Note

The treatments are single-point observations, so fit_timeseries will fit to only individuals if treatment-col is None.

fit_timeseries \
-o ./moving_pictures_fit_ts/
--pcoa_qza ./weighted_unifrac_pcoa_results.qza
--individual Subject,BodySite
--timepoint DaysSinceExperimentStart

The output data from this execution can be found here:

https://github.com/zaneveld/karenina/blob/master/data/outputs/fit_timeseries/tx_undefined

Subject_BodySite pc sigma lambda theta nLogLik aic
subject-1_gut pc1 0.41463 0.00360 -0.5421 -30.398 1.17122
subject-1_gut pc2 -0.4051 0.03268 0.02503 -30.398 1.17122
subject-1_gut pc3 0.42444 0.04177 -0.0019 -30.398 1.17122
subject-1_left-palm pc1 -0.4043 0.02676 0.18309 -30.398 1.17122
subject-1_left-palm pc2 -0.4021 -0.0245 -0.2053 -30.398 1.17122
subject-1_left-palm pc3 0.39130 -0.0246 0.28571 -30.398 1.17122
subject-1_right-palm pc1 -0.4045 0.03113 0.15796 -30.398 1.17122
subject-1_right-palm pc2 0.41556 0.00018 -2.5981 -30.398 1.17122
subject-1_right-palm pc3 0.41527 -0.2349 0.13882 -30.398 1.17122
subject-1_tongue pc1 -0.8731 0.04062 0.27590 -27.550 3.36793
subject-1_tongue pc2 -0.8970 0.00520 0.29044 -24.257 3.62250
subject-1_tongue pc3 -0.8792 0.03174 -0.0404 -26.767 3.42560
subject-2_gut pc1 0.40325 -0.0210 -0.5796 -30.398 1.17122
subject-2_gut pc2 -0.4024 0.00664 0.01683 -30.398 1.17122
subject-2_gut pc3 -0.4132 0.01286 0.00126 -30.398 1.17122
subject-2_left-palm pc1 0.39737 0.01596 0.18911 -30.398 1.17122
subject-2_left-palm pc2 0.41552 0.03488 0.04242 -30.398 1.17122
subject-2_left-palm pc3 -0.4025 -0.0174 0.09704 -30.398 1.17122
subject-2_right-palm pc1 -0.4023 0.00992 0.33786 -30.398 1.17122
subject-2_right-palm pc2 -0.4132 0.04034 -0.0563 -30.398 1.17122
subject-2_right-palm pc3 0.41496 0.00458 -0.0590 -30.398 1.17122
subject-2_tongue pc1 -0.4038 0.46210 0.29795 -30.398 1.17122
subject-2_tongue pc2 -0.4145 0.33915 0.21338 -30.398 1.17122
subject-2_tongue pc3 0.41374 -0.1617 0.07803 -30.398 1.17122

karenina_visualization

Warning

This function is currently experiencing some bugs, and is not recommended for use at this time. Visualizations can be generated through execution of spatial_ornstein_uhlenbeck, and is currently being addressed.

>karenina_visualization -h
Usage: karenina_visualization -o ./simulation_results

This script fits microbiome change over time to Ornstein-Uhlenbeck (OU)
models.Demo fitting an OU model using default parameters.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit

  Required options:
        -o OUTPUT, --output=OUTPUT
                                                the output folder for the simulation results

  Optional options:
        --pcoa_qza=PCOA_QZA
                                                The input PCoA qza file
        --metadata=METADATA
                                                The input metadata tsv file, if not defined, metadata
                                                will be extracted from PCoA qza
        --individual=INDIVIDUAL
                                                The host subject ID column identifier, separate by
                                                comma to combine TWO columns
        --timepoint=TIMEPOINT
                                                The timepoint column identifier
        --treatment=TREATMENT
                                                The treatment column identifier
        -v, --verbose
                                                allows for verbose output [default: False]

The karenina_visualization function allows users to visualize their PCoA results as an mp4 movie.

Using the data from the previously ran simulation (see spatial_ornstein_uhlenbeck), the karenina_visualization program will visualize the PCoA Results, and return an mp4 movie called simulation_video.mp4.

The data being used for this tutorial can be found here:

https://github.com/zaneveld/karenina/blob/master/data/fit_timeseries/simulation.qza?raw=true

The following parameters are defined to visualize the PCoA data:

-o [Direct/filepath/to/output/dir/]
--individual Subject
--timepoint Timepoint
--treatment Treatment

After installing karenina the following command may be used to generate a 3d visualization video from within the same directory as simulation.qza. The output folder in this instance is /simulation_vis.

karenina_visualization -o ./simulation_vis/ \
--pcoa_qza ./simulation.qza \
--individual Subject \
--timepoint Timepoint \
--treatment Treatment

The output video can be viewed below, and the file can be found here:

https://github.com/zaneveld/karenina/tree/master/data/outputs/visualization

q2-karenina

>qiime karenina --help
Usage: qiime karenina [OPTIONS] COMMAND [ARGS]...

Description: This script simulates microbiome change over time using
Ornstein-Uhlenbeck (OU) models.  These are similar to Brownian motion
models, with the exception that they include reversion to a mean. Output
is a tab-delimited data table and figures.

Plugin website: https://github.com/zaneveld/karenina

Getting user support: Please post to the QIIME 2 forum for help with this
plugin: https://forum.qiime2.org

Options:
        --version    Show the version and exit.
        --citations  Show citations and exit.
        --help       Show this message and exit.

Commands:
fit-timeseries              Fit OU Models to PCoA Ordination output
spatial-ornstein-uhlenbeck  Spatial Ornstein Uhlenbeck microbial community
                                                        simulation
visualization               Generates 3D animations of PCoA Timeseries

qiime karenina spatial-ornstein-uhlenbeck

Warning

This method is currently under construction and will be available for use in Qiime2 in the near future. to implement this method, please reference karenina spatial_ornstein_uhlenbeck.

>qiime karenina spatial-ornstein-uhlenbeck --help
Usage: qiime karenina spatial-ornstein-uhlenbeck [OPTIONS]

This method simulates microbial behavior over time usingOrnstein Uhlenbeck
models. This are similar to Brownian Motionwith the exception that they
include reversion to a mean.

Options:
        --p-perturbation-fp TEXT
                                                                filepath for perturbation parameters for
                                                                simulation results  [required]
        --p-treatment-names TEXT
                                                                ['control,destabilizing_treatment'] Names
                                                                for simulation treatments  [required]
        --p-n-individuals TEXT
                                                                ['35,35'] Number of individuals per
                                                                treatment  [required]
        --p-n-timepoints INTEGER
                                                                ['10'] Number of simulation timepoints
                                                                [required]
        --p-perturbation-timepoint INTEGER
                                                                ['5'] Timepoint at which to apply treatment
                                                                (<n_timepoints)  [required]
        --p-perturbation-duration INTEGER
                                                                ['100'] Duration of perturbation.
                                                                [required]
        --p-interindividual-variation FLOAT
                                                                ['0.01']Starting variability between
                                                                individuals  [required]
        --p-delta FLOAT
                                                                ['0.25'] Starting Delta parameter for
                                                                Brownian Motion/ OU models. Higher values
                                                                indicate greater variability over time
                                                                [required]
        --p-lam FLOAT
                                                                ['0.20'] Starting Lambda value for OU
                                                                process. Higher values indicate a greater
                                                                tendancy to revert to the mean value.
                                                                [required]
        --p-fixed-start-pos TEXT
                                                                Starting x,y,z position for each point. If
                                                                not defined, starting positions will be
                                                                randomized based on
                                                                interindividual_variation; type: string, eg:
                                                                ['0.0,0.1,0.2'].  [required]
        --o-ordination ARTIFACT PATH PCoAResults
                                                                Sample PCoA file containing simulation data
                                                                [required if not passing --output-dir]
        --o-distance-matrix ARTIFACT PATH DistanceMatrix
                                                                Sample Distance Matrix containing simulation
                                                                data  [required if not passing --output-dir]
        --output-dir DIRECTORY
                                                                Output unspecified results to a directory
        --cmd-config PATH
                                                                Use config file for command options
        --verbose
                                                                Display verbose output to stdout and/or
                                                                stderr during execution of this action.
                                                                [default: False]
        --quiet
                                                                Silence output if execution is successful
                                                                (silence is golden).  [default: False]
        --citations
                                                                Show citations and exit.
        --help
                                                                Show this message and exit.

Note

The tutorial for this Q2 Method is being created and will be available in the future.

qiime karenina fit-timeseries

Note

This visualization currently does not accept Qiime2 artifacts, and is currently being addressed. Please note that the input is the filepath to the PCoA Results qza file.

>qiime karenina fit-timeseries --help
Usage: qiime karenina fit-timeseries [OPTIONS]

This visualizer generates OU model parameters for PCoA outputdata, for
each individual and each defined treatment cohort.

Options:
        --p-pcoa TEXT
                                                                filepath to PCoA results  [required]
        --p-metadata TEXT
                                                                filepath to Sample metadata  [required]
        --p-method [basinhopping]
                                                                global optimization method  [required]
        --p-individual-col TEXT
                                                                individual column identifier  [required]
        --p-timepoint-col TEXT
                                                                timepoint column identifier  [required]
        --p-treatment-col TEXT
                                                                treatment column identifier  [required]
        --o-visualization VISUALIZATION PATH
                                                                [required if not passing --output-dir]
        --output-dir DIRECTORY
                                                                Output unspecified results to a directory
        --cmd-config PATH
                                                                Use config file for command options
        --verbose
                                                                Display verbose output to stdout and/or
                                                                stderr during execution of this action.
                                                                [default: False]
        --quiet
                                                                Silence output if execution is successful
                                                                (silence is golden).  [default: False]
        --citations
                                                                Show citations and exit.
        --help
                                                                Show this message and exit.

The visualization allows for the use of Qiime 2 to fit Ornstein Uhlenbeck models to PCoA data. Utilizing previous simulation data which can be found at the following link, we will walk through the use of this plugin.

Warning

This script does not allow for optimization of single-point observations, where individuals are only measured or treated at one specific timepoint. If subjects are applied treatments at a single timepoint, it may be necessary to manually modify the metadata file to apply a boolean label to “treatment x applied” or not.

Note

The optimal number of measured timepoints is roughly 50 per subject, based on benchmarking.

Benchmark Lambda Error Benchmark Sigma Error Benchmark Theta Error

The data being used for this tutorial can be found here:

https://github.com/zaneveld/karenina/blob/master/data/fit_timeseries/simulation.qza?raw=true

The following parameters are defined to fit the PCoA data to OU models:

Notice the metadata is within the qza file, so the metadata parameter is defined as None

--p-pcoa [Direct/filepath/to/pcoaResults.qza]
--p-metadata None

Note

The only optimization method employed at this time is basinhopping

--p-method basinhopping

Within the metadata file, we see that the column identifying individuals, timepoints, and treatment are Subject, Timepoint, Treatment.

We define the following parameters to match these identifiers

--p-individual-col Subject
--p-timepoint-col Timepoint
--p-treatment-col Treatment

The output directory is defined as such:

--output-dir /home/qiime2/simulation_ou_fit_ts/

After installing q2-karenina the following command may be used to generate OU model parameter output from within the same directory as simulation.qza.

qiime karenina fit-timeseries \
--p-pcoa ./simulation.qza \
--p-metadata None \
--p-method basinhopping \
--p-individual-col Subject \
--p-timepoint-col Timepoint \
--p-treatment-col Treatment \
--output-dir /home/qiime2/simulation_ou_fit_ts/

A successful visualization provides the following output:

Saved Visualization to: /home/qiime2/simulation_ou_fit_ts/visualization.qzv

Within the visualization.qzv, we have two output data files which contain our modeled timeseries results. With input parameters being optimized to sigma/ delta: 0.25, lambda: 0.20, theta/ mu: 0.00 (from the OU simulation), the fit_timeseries modeled individuals and cohorts, which can be found here:

https://github.com/zaneveld/q2-karenina/blob/master/data/simulation_ou_fit_ts

Below is a sample of the cohort output generated, which estimates the control cohort to have parameters approaching the input. The perturbation set lambda to zero, and the estimation clearly displays the parameter approaching zero in the destabilizing_treatment cohort.

Subject pc sigma lambda theta nLogLik aic
control_pc1 pc1 0.25387 0.23143 -0.0109 -190.49 -4.4992
control_pc2 pc2 0.25388 0.23144 -0.0109 -190.49 -4.4992
control_pc3 pc3 0.25388 0.23144 -0.0109 -190.49 -4.4992
destabilizing_treatment_pc1 pc1 0.24395 0.01951 -0.1203 -201.98 -4.6163
destabilizing_treatment_pc2 pc2 0.24395 0.01951 -0.1203 -201.98 -4.6163
destabilizing_treatment_pc3 pc3 0.24395 0.01951 -0.1203 -201.98 -4.6163

Because the simulation data is large, and thus takes some time to process, we can alternatively fit the timeseries from Moving Pictures. The weighted_unifract_pcoa_results.qza file will be used.

The PCoA data can be found here:

https://github.com/zaneveld/karenina/blob/master/data/fit_timeseries/weighted_unifrac_pcoa_results.qza?raw=true

Because the Moving Pictures data contains single-point observations for their treatments, being that their treatments are applied at timepoint-0 only, fit_timeseries will fit OU models to only individuals if we do not define a treatment identifier.

After installing q2-karenina the following command can be ran from the same directory as weighted_unifrac_pcoa_results.qza.

Note

The individual column combines the columns Subject and BodySite, this is entered as a comma-separated list.

Note

The treatments are single-point observations, so fit_timeseries will fit to only individuals if treatment-col is None.

qiime karenina fit-timeseries \
--p-pcoa ./weighted_unifrac_pcoa_results.qza \
--p-metadata None \
--p-method basinhopping \
--p-individual-col Subject,BodySite \
--p-timepoint-col DaysSinceExperimentStart \
--p-treatment-col None \
--output-dir ./moving_pictures_fit_ts/

A successful visualization provides the following output:

Saved Visualization to: ./moving_pictures_fit_ts/visualization.qzv

The output data from this execution can be found here:

https://github.com/zaneveld/q2-karenina/blob/master/data/moving_pictures_fit_ts

Subject_BodySite pc sigma lambda theta nLogLik aic
subject-1_gut pc1 0.41463 0.00360 -0.5421 -30.398 1.17122
subject-1_gut pc2 -0.4051 0.03268 0.02503 -30.398 1.17122
subject-1_gut pc3 0.42444 0.04177 -0.0019 -30.398 1.17122
subject-1_left-palm pc1 -0.4043 0.02676 0.18309 -30.398 1.17122
subject-1_left-palm pc2 -0.4021 -0.0245 -0.2053 -30.398 1.17122
subject-1_left-palm pc3 0.39130 -0.0246 0.28571 -30.398 1.17122
subject-1_right-palm pc1 -0.4045 0.03113 0.15796 -30.398 1.17122
subject-1_right-palm pc2 0.41556 0.00018 -2.5981 -30.398 1.17122
subject-1_right-palm pc3 0.41527 -0.2349 0.13882 -30.398 1.17122
subject-1_tongue pc1 -0.8731 0.04062 0.27590 -27.550 3.36793
subject-1_tongue pc2 -0.8970 0.00520 0.29044 -24.257 3.62250
subject-1_tongue pc3 -0.8792 0.03174 -0.0404 -26.767 3.42560
subject-2_gut pc1 0.40325 -0.0210 -0.5796 -30.398 1.17122
subject-2_gut pc2 -0.4024 0.00664 0.01683 -30.398 1.17122
subject-2_gut pc3 -0.4132 0.01286 0.00126 -30.398 1.17122
subject-2_left-palm pc1 0.39737 0.01596 0.18911 -30.398 1.17122
subject-2_left-palm pc2 0.41552 0.03488 0.04242 -30.398 1.17122
subject-2_left-palm pc3 -0.4025 -0.0174 0.09704 -30.398 1.17122
subject-2_right-palm pc1 -0.4023 0.00992 0.33786 -30.398 1.17122
subject-2_right-palm pc2 -0.4132 0.04034 -0.0563 -30.398 1.17122
subject-2_right-palm pc3 0.41496 0.00458 -0.0590 -30.398 1.17122
subject-2_tongue pc1 -0.4038 0.46210 0.29795 -30.398 1.17122
subject-2_tongue pc2 -0.4145 0.33915 0.21338 -30.398 1.17122
subject-2_tongue pc3 0.41374 -0.1617 0.07803 -30.398 1.17122

qiime karenina visualization

Warning

This visualization is currently under construction and will be available for use in Qiime2 in the near future. to implement this method, please reference karenina_visualization.

>qiime karenina visualization --help
Usage: qiime karenina visualization [OPTIONS]

This visualizer generates 3D animations of PCoA Timeseries.

Options:
        --p-pcoa TEXT
                                                                filepath to PCoA results  [required]
        --p-metadata TEXT
                                                                filepath to Sample metadata  [required]
        --p-individual-col TEXT
                                                                individual column identifier  [required]
        --p-timepoint-col TEXT
                                                                timepoint column identifier  [required]
        --p-treatment-col TEXT
                                                                treatment column identifier  [required]
        --o-visualization VISUALIZATION PATH
                                                                [required if not passing --output-dir]
        --output-dir DIRECTORY
                                                                Output unspecified results to a directory
        --cmd-config PATH
                                                                Use config file for command options
        --verbose
                                                                Display verbose output to stdout and/or
                                                                stderr during execution of this action.
                                                                [default: False]
        --quiet
                                                                Silence output if execution is successful
                                                                (silence is golden).  [default: False]
        --citations
                                                                Show citations and exit.
        --help
                                                                Show this message and exit.

Note

The tutorial for this Q2 Visualization is being created and will be available in the future.

spatial_ornstein_uhlenbeck.py

karenina.spatial_ornstein_uhlenbeck.check_perturbation_timepoint(perturbation_timepoint, n_timepoints)

Raise ValueError if perturbation_timepoint is < 0 or >n_timepoints

Parameters:
  • perturbation_timepoint – defined timepoint for perturbation application
  • n_timepoints – number of timepoints
karenina.spatial_ornstein_uhlenbeck.ensure_exists(output_dir)

Ensure that output_dir exists

Parameters:output_dir – path to output directory
karenina.spatial_ornstein_uhlenbeck.parse_perturbation_file(pert_file_path, perturbation_timepoint, perturbation_duration)

Return a list of perturbations infile – a .tsv file describing one perturbation per line assume input file is correctly formatted (no warnings if not)

NOTE: each pertubation should be in the format:

set_xyz_lambda_low = {“start”:opts.perturbation_timepoint, “end”:opts.perturbation_timepoint + opts.perturbation_duration, “params”:{“lambda”:0.005}, “update_mode”:”replace”, “axes”:[“x”,”y”,”z”]}

Parameters:
  • pert_file_path – perturbation file path
  • perturbation_timepoint – timepoint to apply perturbation
  • perturbation_duration – duration of perturbation
Returns:

perturbation list parsed from pert file contents

karenina.spatial_ornstein_uhlenbeck.write_options_to_log(log, opts)

Writes user’s input options to log file

Parameters:
  • log – log filename
  • opts – options

fit_timeseries.py

karenina.fit_timeseries.aic(n_param, nLogLik)

calculates AIC with 2 * n_parameters - 2 * LN(-1 * nLogLik)

Parameters:
  • n_param – number of parameters
  • nLogLik – negative log likelihood
Returns:

aic score

karenina.fit_timeseries.fit_cohorts(input, ind, tp, tx, method, verbose=False)

Completes the same operation as fit_input, for cohorts. Fit and minimize the timeseries for each subject and cohort defined.

Parameters:
  • input – dataframe: [#SampleID,individual,timepoint,treatment,pc1,pc2,pc3]
  • ind – subject column identifier
  • tp – timepoint column identifier
  • tx – treatment column identifier
  • method – “basinhopping” if not defined in opts
Returns:

dataframe: [ind,”pc”,”sigma”,”lambda”,”theta”,”optimizer”,”nLogLik”,”n_parameters”,”aic”,tp,tx,”x”]

karenina.fit_timeseries.fit_input(input, ind, tp, tx, method)

Fit and minimize the timeseries for each subject defined

Parameters:
  • input – dataframe: [#SampleID,individual,timepoint,treatment,pc1,pc2,pc3]
  • ind – subject column identifier
  • tp – timepoint column identifier
  • tx – treatment column identifier
  • method – “basinhopping” if not defined in opts
Returns:

dataframe: [ind,”pc”,”sigma”,”lambda”,”theta”,”optimizer”,”nLogLik”,”n_parameters”,”aic”,tp,tx,”x”]

karenina.fit_timeseries.fit_normal(data)

Return the mean and standard deviation of normal data

Parameters:data – fit data to normal distribution
Returns:mu, std, nLogLik
karenina.fit_timeseries.fit_timeseries(fn_to_optimize, x0, xmin=array([-inf, -inf, -inf]), xmax=array([inf, inf, inf]), global_optimizer='basinhopping', local_optimizer='Nelder-Mead', stepsize=0.01, niter=200, verbose=False)

Minimize a function returning input & result fn_to_optimize – the function to minimize. Must return a single value or array x.

x0 – initial parameter value or array of parameter values xmax – max parameter values (use inf for infinite) xmin – min parameter values (use -inf for infinite) global_optimizer – the global optimization method (see scipy.optimize) local_optimizer – the local optimizer (must be supported by global method)

Parameters:
  • fn_to_optimize – function that is being optimized, generated from fn_to_optimize
  • x0 – initial parameter value or array of parameter values
  • xmin – min parameter values (-inf for infinite)
  • xmax – max parameter values (inf for infinite)
  • global_optimizer – global optimization method
  • local_optimizer – local optimization method (must be supported by global)
  • stepsize – size for each step (.01)
  • niter – number of iterations (200)
  • verbose – verbose output, default = False
Returns:

global_min, f_at_global_min

karenina.fit_timeseries.gen_output(fit_ts, ind, tp, tx, method)

Generate output dataframe for either cohort, or non-cohort data

Parameters:
  • fit_ts – dataframe [0: [[Sigma, Lambda, Theta], nLogLik], 1: Individuals, 2: Times 3: Treatments, 4: Values, 5: PC Axis
  • ind – individual identifier
  • tp – timepoint identifier
  • tx – treatment identifier
  • method – optimization method
Returns:

Formatted dataframe for csv output

karenina.fit_timeseries.get_OU_nLogLik(x, times, Sigma, Lambda, Theta)

Return the negative log likelihood for an OU model given data x – an array of x values, which MUST be ordered by time times – the time values at which x was taken. MUST match x in order.

OU model parameters

  1. Sigma – estimated Sigma for OU model (extent of change over time)
  2. Lambda – estimated Lambda for OU model (tendency to return to average position)
  3. Theta – estimated Theta for OU model (average or ‘home’ position)
Parameters:
  • x – array of values ordered by time
  • times – times associated with x values
  • Sigma – initial sigma value
  • Lambda – initial lambda value
  • Theta – initial theta value
Returns:

nLogLik value

karenina.fit_timeseries.make_OU_objective_fn(x, times, verbose=False)

Make an objective function for use with basinhopping with data embedded

scipy.optimize.basinhopping needs a single array p, a function, and will minimize f(p). So we want to embded or dx data and time data in the function, and use the values of p to represent parameter values that could produce the data.

Parameters:
  • x – individual x values for objective function
  • times – timepoints associated with passed-in x values
  • verbose – verbose output, default = False
Returns:

fn_to_optimize

karenina.fit_timeseries.make_OU_objective_fn_cohort(x, times, verbose=False)

Sums nLogLik and build objective function based on cohorts. Operates in the same manner as make_OU_objective_fn, except that it considers a treatment cohort, not just the individuals.

Make an objective function for use with basinhopping with data embedded

scipy.optimize.basinhopping needs a single array p, a function, and will minimize f(p). So we want to embded or dx data and time data in the function, and use the values of p to represent parameter values that could produce the data.

Overall strategy: Treat this exactly like the per-individual fitting, BUT within the objective function loop over all individuals in a given treatment (not just one) to get nLogLiklihoods. The nLogLiklihood for the individuals in the treatment is then just the sum of the individual nLogLikelihoods for each individuals timeseries.

data – a dict of {“individual1”: (fixed_x,fixed_times)}

Parameters:
  • x – x values for cohort objective function
  • times – times associated with x values from cohort
  • verbose – verbose output, default = False
Returns:

fn_to_optimize

karenina.fit_timeseries.parse_metadata(metadata, individual, timepoint, treatment, site)

Parse relevant contents from metadata file to complete input dataframe

Parameters:
  • metadata – tsv file location
  • individual – subject column identifier
  • timepoint – timepoint column identifier
  • treatment – treatment column identifier
  • site – [subjectID, x1, x2, x3]
Returns:

input dataframe: [#SampleID,individual,timepoint,treatment,pc1,pc2,pc3]

karenina.fit_timeseries.parse_pcoa(pcoa_qza, individual, timepoint, treatment, metadata)

Load data from PCoA output in Qiime2 Format

Parameters:
  • pcoa_qza – Location of PCoA file
  • individual – Subject column identifier(s) [ex: Subject; Subject,BodySite]
  • timepoint – Timepoint column identifier
  • treatment – Treatment column identifier
  • metadata – optionally defined metadata file location, if not defined, will use metadata from PCoA.qza
Returns:

input dataframe, tsv filepath location

benchmark.py

karenina.benchmark.benchmark_simulated_dataset(n_timepoints, verbose=False, simulation_type='Ornstein-Uhlenbeck', simulation_params={'delta': 0.25, 'lambda': 0.12, 'mu': 0.5}, local_optimizer=['L-BFGS-B'], local_optimizer_niter=[5], local_optimizer_stepsize=0.005, log=[], dt=0.1)
karenina.benchmark.benchmark_simulated_datasets(max_timepoints=300, output_dir=None, simulation_type='Ornstein-Uhlenbeck', simulation_params={'delta': 0.25, 'lambda': 0.12, 'mu': 0.5}, local_optimizer=['L-BFGS-B'], niter=[5], verbose=False)

Test the ability of fit_timeseries to recover correct OU model params

Parameters:
  • output – location for output log
  • max_timepoints – maximum timepoints to test. The benchmark will test every number of timepoints up to this number. So specifying max_timpoints = 50 will test all numbers of timepoints between 2 and 50.
  • simulation_type – string describing the model for the simulation (passed to a Process object)
  • simulation_params – dict of parameters for the simulation (these will be passed on to the Process object).
  • local_optimizer – list of names of local optimizers to use
  • niter – list of niter values to try. This parameter controls the number of independent restarts for global optimizer
  • verbose – verbosity
Returns:

output dataframe of benchmarked data

karenina.benchmark.calculate_errors(observed, expected, parameter_names=None)

Return per-parameter absolute and squared errors to results

Parameters:
  • names (parameter) – list of strings describing parameter names (in order)
  • observed – an array of observed values
  • expected – an array of expected values
karenina.benchmark.graph_absolute_errors(df, output_dir, model_name_col='model')

Graph absolute error in simulated datasets with varying numbers of tested timepoints

Parameters:
  • df – Dataframe containing model, timepoints, and list of errors
  • output_dir – output directory
  • model_name_col – column of the datafram that has the name of each model/benchmark
karenina.benchmark.log_lines_from_result_dict(results)

Return a list of result lines

Parameters:results – a dict of results (all values must be able to be converted to strings)
karenina.benchmark.make_scatterplot(x_col, y_col, data, hue, palette=['firebrick'], title='', outfile='./results_scatter.png')
karenina.benchmark.write_logfile(output_fp, log_lines)

Write logfile lines to output filepath

Parameters:
  • output_fp – a filepath (as string) for the output log file
  • log_lines – lines of the logfile (linebreaks will be appended)

visualization.py

karenina.visualization.get_timeseries_data(individuals, axes=['x', 'y', 'z'])

Provides timeseries data from list of individuals.

Parameters:
  • individuals – array of individual objects
  • axes – [x,y,z] axes to return
Returns:

array of timeseries data

karenina.visualization.save_simulation_data(data, ids, output)

Saves simulation output data in PCoA format

Parameters:
  • data – data to save
  • ids – Sample_IDs
  • output – output filepath
karenina.visualization.save_simulation_figure(individuals, output_folder, n_individuals, n_timepoints, perturbation_timepoint)

Save a .pdf image of the simulated PCoA plot

Parameters:
  • individuals – array of individuals
  • output_folder – output filepath
  • n_individuals – number of individuals
  • n_timepoints – number of timepoints
  • perturbation_timepoint – timepoint of perturbation application
karenina.visualization.save_simulation_movie(individuals, output_folder, n_individuals, n_timepoints, black_background=True, data_o=False, verbose=False)

Save an .ffmpg move of the simulated community change

Parameters:
  • individuals – array of individuals to visualize
  • output_folder – output directory filepath
  • n_individuals – number of individuals
  • n_timepoints – number of timepoints
  • black_background – T/F, default = True
  • verbose – verbose output, default = False
karenina.visualization.update_3d_plot(end_t, timeseries_data, ax, lines, points=None, start_t=0)

Updates visualization 3d plot

Parameters:
  • end_t – end timepoint
  • timeseries_data – data from timeseries
  • ax – visualization ax
  • lines – lines of data
  • points – values to update
  • start_t – start timepoint (0)

experiment.py

class karenina.experiment.Experiment(treatment_names, n_individuals, n_timepoints, individual_base_params, treatment_params, interindividual_variation, verbose)

Bases: object

This class has responsibility for simulating an experimental design for simulation.

A fixed number of individuals are simulated across experimental conditions called ‘treatments’ Each treatment can have different numbers of individuals. All treatment must have the same number of timepoints.

Each treatment can be associated with the imposition of one or more Perturbations. Each perturbation is inserted or removed from all individuals at a fixed time-point.

check_n_timepoints_is_int(n_timepoints)

Raise a ValueError if n_timepoints can’t be cast as an int

Parameters:n_timepoints – number of timepoints
check_variable_specified_per_treatment(v, verbose=False)

Raise a ValueError if v is not the same length as the number of treatments

Parameters:
  • v – variable
  • verbose – verbose output, default = False
q2_data()

generate output data from object’s self for Qiime2

Returns:data, ids
run()

Run the experiment, simulating timesteps; saves the simulation figure and movie

simulate_timestep(t)

Simulate timestep t of the experiemnt

Approach:

  1. First apply any perturbations that should be active but aren’t yet
  2. Second, simulate the timestep
  3. Third, remove any perturbations that should be off

NOTE: this means that a perturbation that starts and ends at t=1 will be active at t=1 That is, the start and end times are inclusive.

Parameters:t – timestep
simulate_timesteps(t_start, t_end, verbose=False)

Simulate multiple timesteps

Parameters:
  • t_start – start timepoint
  • t_end – end timepoint
  • verbose – verbose output, default = False
write_to_movie_file(output_folder, verbose=False)

Write an MPG movie to output folder

Parameters:
  • output_folder – output directory
  • verbose – verbose output, default = False

individual.py

class karenina.individual.Individual(subject_id, coords=['x', 'y', 'z'], metadata={}, params={}, interindividual_variation=0.01, verbose=False)

Bases: object

Generates an individual for OU simulation

apply_perturbation(perturbation)

Apply a perturbation to the appropriate axes

Parameters:perturbation – perturbation to apply
apply_perturbation_to_axis(axis, perturbation)

Apply a perturbation to a Processes objects

Parameters:
  • axis – Axis to apply perturbation to
  • perturbation – perturbation to apply
check_identity(verbose=False)

Check identity of movement process

Parameters:verbose – verbose output, default = False
Returns:True if processes are equivalent, False if not
get_data(n_timepoints)

get data from movement processes

Parameters:n_timepoints – number of timepoints to gather data for
Returns:data for timepoints
remove_perturbation(perturbation)

Remove a perturbation from the appropriate axes

Parameters:perturbation – perturbation to remove
remove_perturbation_from_axis(axis, perturbation)

Remove a perturbation from one or more Process objects

Parameters:
  • axis – axis to remove perturbation from
  • perturbation – perturbation to remove from axis
simulate_movement(n_timepoints, params=None)

Simulate movement over timepoints

Parameters:
  • n_timepoints – number of timepoints to simulate
  • params – parameters to change from baseparams
karenina.individual.random() → x in the interval [0, 1).

perturbation.py

class karenina.perturbation.Perturbation(start, end, params, update_mode='replace', axes=['x', 'y', 'z'])

Bases: object

Alter a simulation to impose a press disturbance, shifting the microbiome torwards a new configuration

start –inclusive timepoint to start perturbation. Note that this is read at the Experiment level, not by underlying Process objects.

end – inclusive timepoint to end perturbation. Note that this is read at the Experiment level, not by underlying Process objects.

params – dict of parameter values altered by the disturbance

  • ‘mode’ – how the perturbation updates parameter values.
  • ‘replace’ – replace old value with new one
  • ‘add’ – add new value to old one
  • ‘multiply’ – multiply the two values

axes – axes to which the perturbation applies. Like Start and End this is a ‘dumb’ value, read externally by the Experiment object

is_active(t)

determines if timepoint is active

Parameters:t – timepoint
Returns:True if timepoint is active, False if not
update_by_addition(curr_param, perturbation_param)

Update parameters by addition

Parameters:
  • curr_param – current parameter
  • perturbation_param – new parameter
Returns:

curr_param + perturbation_param

update_by_multiplication(curr_param, perturbation_param)

Update parameters by multiplication

Parameters:
  • curr_param – current parameter
  • perturbation_param – new parameter
Returns:

curr_param * perturbation_param

update_by_replacement(curr_param, perturbation_param)

Update parameters by replacement

Parameters:
  • curr_param – current parameter
  • perturbation_param – new parameter
Returns:

perturbation_param

update_params(params)

Update baseparams

Parameters:params – params to update
Returns:new_params

process.py

class karenina.process.Process(start_coord, motion='Ornstein-Uhlenbeck', history=None, params={'L': 0.2, 'delta': 0.25})

Bases: object

Represents a 1d process in a Euclidean space

bm_change(dt, delta)

Change the Brownian motion process

Parameters:
  • dt – time elapsed since last update
  • delta – delta value to adjust
Returns:

change

bm_update(dt, delta)

Update the Brownian motion process

Parameters:
  • dt – time elapsed since last update
  • delta – delta value to adjust
Returns:

change

ou_change(dt, mu, L, delta)

Change the Ornstein Uhlenbeck motion process

The Ornstein Uhlenbeck process is modelled as:

ds = lambda * (mu - s) * dt + dW

ds – change in our process from the last timepoint L – lambda, the speed of reversion to mean (NOTE: lambda is a reserved keyword in Python so I use L) mu – mean position s – current position dt – how much time has elapsed since last update dW – the Weiner Process (basic Brownian motion)

This says we update as usual for Brownian motion, but add in a term that reverts us to some mean position (mu) over time (dt) at some speed (lambda)

Parameters:
  • dt – time elapsed since last update
  • delta – delta value to adjust
Returns:

change in process since last timepoint

ou_update(dt, mu, L, delta, min_bound=-1.0, max_bound=1.0)

Update the Brownian motion process

Parameters:
  • dt – time elapsed since last update
  • mu – mean position
  • L – speed of reversion to the mean
  • delta – variance from the mean
  • min_bound – minimum bound
  • max_bound – maximum bound
update(dt)

Update the process

Parameters:dt – time elapsed since last update