Our Technology

Medicine wants to move away from a `one-size-fits-all’ approach, and instead use the vast amounts of information that one can now record from to tailor treatments optimally: `the right drug at the right time for the right patient’. This is the agenda of personalized precision medicine. To achieve this, just collecting data is not enough. One needs quantitative algorithms that can predict, from data of individual patients, how for each possible treatment regime their disease would evolve. This requires extracting patterns from the data, that link what is measured now to clinical states in the future, and that allow for counter-factual reasoning (what would happen if …). Moreover, our algorithms must be interpretable, i.e. at all times be able to explain why they arrive at conclusions – `black box’ decision-making is in medicine not ethically acceptable.

The main barrier in extracting predictive patterns from medical data is the need to see the difference between real and fluke ones, i.e. between `signal’ and `noise’. This danger of an inference algorithm mistaking noise for signal, also called `overfitting’ for `high-dimensional data’, becomes more acute if we include more data per patient or if the number of patients in a data set is not large enough. The conventional statistical methods used in medicine, although fully interpretable, were not designed to be used for high-dimensional inference. The standard (deep learning) AI methods, with their very many parameters, also struggle with high-dimensional data. In addition the latter tend to be of a black-box type. As a consequence, data are very often not used effectively: either many measurements are simply removed at the outset, or predictions based on all measurements are later found to be non-reproducible.

Our approach to reliable medical inference and prediction from high-dimensional data is to develop new mathematical and statistical models and algorithms, in collaboration with medical professionals and academia, designed to meet the challenges of modern data-driven personalized medicine. They combine the precision and interpretability of Bayesian approaches with mathematical tools from theoretical physics (to make Bayesian computations feasible), and focus on:

Undoing the effects of overfitting in clinical outcome prediction for high-dimensional signals or small data sets.
How to decontaminate inferences for the effects of disease interactions.
How to infer responder subgroups in clinical trials.
How to identify optimal personalized treatments and drug doses.

Below we give a more in-depth description of the technology behind our two main inference pipelines, Saddle Point Signature and Saddle Point Mosaics. Both have by now been used in multiple medical studies and publications.

SaddlePoint Signature

This fully automated pipeline computes reliable risk and treatment response signatures for high-dimensional data, i.e. in the overfitting regime. It can handle multiple clinical outcome types: time-to-event, ordinal class outcomes, and real-valued clinical scales. It suppresses overfitting effects by a combination of iterative removal of irrelevant covariates based on probabilistic arguments (as opposed to on z-scores), and bias removal and optimal regularization based on a statistical mechanical theory of overfitting in Generalized Linear Models (GLM). This theory, developed by Saddle Point in collaboration with academic partners (see the publications list), is based on the so-called replica method. The benefit of this strategy compared to more standard regularization approaches to overfitting (as used in other statistical software packages) is that no data need to be sacrificed for optimization of the hyperparameters; they are either not needed (in bias removal mode) or they are computed analytically (in regularization mode).

Other features include:

Handling of informative covariate missingness.
Inclusion of covariate interactions if required.
Shadow analysis (analysis replication with outcome-randomized data, to exclude false positive inferences).
Generation of simulated clinical data.
Scripting.
Validation of predictions on further data sets.
Fully automated report generation.

SaddlePoint Mosaics

While Saddle Point Signature constructs treatment response scores, if required, these are limited to those scenarios where their available covariates carry (possibly complex) information on the likelihood of an individual’s treatment response. The Bayesian pipeline Saddle Point Mosaics goes further. It also detects responder subgroups in those cases where the measured covariates are not informative. It uses Bayesian model selection to infer more generally:

The number of statistically significant patient subgroups in a given clinical data set.
Whether these subgroups differ in frailties, risk associations, or base hazard rates.
What are the sizes and the quantitative characteristics of each subgroup.
What are the probabilities for individual patients in the data set to belong to each of the detected subgroups.
Whether and how subgroup membership can be predicted a priori from the covariates.

If (some of) the subgroups are distinct in terms of their association with the treatment variable, this results in responder identification. Saddle Point Mosaics thus informs the user on (i) the extent to which a cohort is stratifiable, (ii) what are the characteristics of the strata, and (iii) a rational tool for finding stratifying biomarkers (which would be variables that correlate with the reported class membership probabilities). The pipeline can handle multiple risks, and decontaminate survival predictions for competing risk effects, and generates fully automated analysis reports.

Are you interested in working with one or both of these programs?

Learn More