This opportunity is not published. No applications will be accepted.

Integration of Change Point Detection Algorithms with Spline-Based Smoothers for Drift Correction of Metabolomics Data

Metabolomics is the study of small molecules in various tissues such as blood, urine, etc. Applications of metabolomics include monitoring of clinical trials and drug and biomarker discovery. In a typical metabolomics experiment, samples are placed in numbered wells on plates and processed by mass spectrometer well by well, plate by plate. The resulting data can thus be interpreted as a high-dimensional time series. Metabolomic measurements are prone to batch effects, instrumental drifts and abrupt jumps, which need to be removed in a pre-processing step. Change (or break) point detection considers the localization of abrupt distributional changes in time series. We propose to estimate drifts and jumps simultaneously with change point detection.

Keywords: change point detection metabolomics

Description
Metabolomics is a study of small molecules in a variety of tissues such as blood, urine, etc. Applications of metabolomics include monitoring of clinical trials, drug and biomarker discovery, among other applications(Wishart 2016). Improved ability to detect low concentrations of small molecules in a variety of tissues, faster processing times coupled with cost efficient metabolite extraction techniques revolutionized metabolomics, which is now a fast-growing field. When analyzed correctly, this enables phenotyping and the identification of biomarkers years prior to the onset of diseases. In a typical metabolomics experiment, samples are placed in numbered wells on plates, which then are processed by mass spectrometer well by well, plate by plate. Analysis order is also called run order. One can interpret the resulting data as a high-dimensional time series, with the sample order as time and the relative intensity of individual metabolites per sample as values. Large metabolomic studies often involve analysis of thousands of tissue samples that can take multiple days to weeks to process. In such large-scale studies, metabolomic measurements are prone to batch effects and instrumental drifts (Figure)(Dunn et al. 2011, Watrous et al. 2019). Drifts may result in increase or decrease in intensity with respect to run order. Additionally, adjustments to the instrument by technicians introduce abrupt jumps, mostly occurring between plates. These drifts and jumps, which are only an artifact of the measurement, need to be removed in a pre-processing step. A number of signal-correction methods (also called normalization methods) can be used at pre-processing. The simplest ones are batch-correction methods by mean centering, median scaling. A commonly used normalization method in omics experiments is “ComBat” (Johnson et al. 2007), does not correct for the drifts. Dunn et. al. fit low-order nonlinear locally estimated smoothing splines (LOESS) to the quality control data with respect to the order of injection (QC-RLSC)(Dunn et al. 2011). However, none of these methods have been tested in rigorous simulations and most importantly change point detection is not used. We have developed algorithm that corrects for batch effects and drifts using splines and testing for autocorrelations (https://f1000research.com/posters/9-746). Change (or break) point detection considers the localization of abrupt distributional changes in ordered observations(Killick et al. 2012, Truong et al. 2020). The classical scenario of changes in the mean of a univariate Gaussian variable has been thoroughly studied. Here, the data is i.i.d. Gaussian with constant mean and variance between breakpoints. We propose to relax this assumption and model the mean (corresponding to the drift) of the Gaussian random variables with splines. Work Plan: 1. Read introductions to metabolomics and change point detection. 2. Develop a prototype based on splines and binary segmentation or dynamic programming for drift correction of a single metabolite. Find an adequate stopping criterion, possibly motivated by a BIC-like criterion. Analyze the performance on a simulated dataset. 3. Dynamic programming requires a number of fits quadratic in the number of observations. (Killick et al. 2012) reduced this to linear time in the average case. For changes in the mean, fits of neighboring segments can be recovered with O(1) updates. See whether a similar approach is possible for splines. 4. The resulting method should be able to correct drifts for all metabolites separately in a reasonable time. 5. Analyze the result and compare it to existing methods in simulated environments and in real word metabolomic data. 6. If the time allows: structural breaks often co-occur for multiple metabolites. See whether changes can be estimated for all metabolites simultaneously. This probably requires to modify the stopping criterion. Apply typical analyses/estimations done following the drift correction on metabolomics data. See whether our change point detection approach improves their performance. There will be the possibility to write a scientific publication about the project outcome in place of the thesis. Prerequisites: 1. An interest in biological applications and willingness to learn about and adapt to metabolomics data analysis pipelines and approaches 2. Knowledge of statistical methods 3. Experience in Python and/or R programming 4. Interest in developing novel methodology independently and implementing ideas in algorithms. References: Dunn, W. B., D. Broadhurst, P. Begley, E. Zelena, S. Francis-McIntyre, N. Anderson, M. Brown, J. D. Knowles, A. Halsall and J. N. Haselden (2011). "Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry." Nature protocols 6(7): 1060-1083. Johnson, W. E., C. Li and A. Rabinovic (2007). "Adjusting batch effects in microarray expression data using empirical Bayes methods." Biostatistics 8(1): 118-127. Killick, R., P. Fearnhead and I. A. Eckley (2012). "Optimal detection of changepoints with a linear computational cost." Journal of the American statistical Association 107(500): 1590-1598. Truong, C., L. Oudre and N. Vayatis (2020). "Selective review of offline change point detection methods." Signal Processing 167: 107299. Watrous, J. D., T. J. Niiranen, K. A. Lagerborg, M. Henglin, Y. J. Xu, J. Rong, S. Sharma, R. S. Vasan, M. G. Larson, A. Armando, S. Mora, O. Quehenberger, E. A. Dennis, S. Cheng and M. Jain (2019). "Directed Non-targeted Mass Spectrometry and Chemical Networking for Discovery of Eicosanoids and Related Oxylipins." Cell Chem Biol 26(3): 433-442 e434. Wishart, D. S. (2016). "Emerging applications of metabolomics in drug discovery and precision medicine." Nature reviews Drug discovery 15(7): 473.
Metabolomics is a study of small molecules in a variety of tissues such as blood, urine, etc. Applications of metabolomics include monitoring of clinical trials, drug and biomarker discovery, among other applications(Wishart 2016). Improved ability to detect low concentrations of small molecules in a variety of tissues, faster processing times coupled with cost efficient metabolite extraction techniques revolutionized metabolomics, which is now a fast-growing field. When analyzed correctly, this enables phenotyping and the identification of biomarkers years prior to the onset of diseases.

In a typical metabolomics experiment, samples are placed in numbered wells on plates, which then are processed by mass spectrometer well by well, plate by plate. Analysis order is also called run order. One can interpret the resulting data as a high-dimensional time series, with the sample order as time and the relative intensity of individual metabolites per sample as values. Large metabolomic studies often involve analysis of thousands of tissue samples that can take multiple days to weeks to process. In such large-scale studies, metabolomic measurements are prone to batch effects and instrumental drifts (Figure)(Dunn et al. 2011, Watrous et al. 2019). Drifts may result in increase or decrease in intensity with respect to run order. Additionally, adjustments to the instrument by technicians introduce abrupt jumps, mostly occurring between plates. These drifts and jumps, which are only an artifact of the measurement, need to be removed in a pre-processing step.

A number of signal-correction methods (also called normalization methods) can be used at pre-processing. The simplest ones are batch-correction methods by mean centering, median scaling. A commonly used normalization method in omics experiments is “ComBat” (Johnson et al. 2007), does not correct for the drifts. Dunn et. al. fit low-order nonlinear locally estimated smoothing splines (LOESS) to the quality control data with respect to the order of injection (QC-RLSC)(Dunn et al. 2011). However, none of these methods have been tested in rigorous simulations and most importantly change point detection is not used. We have developed algorithm that corrects for batch effects and drifts using splines and testing for autocorrelations (https://f1000research.com/posters/9-746). Change (or break) point detection considers the localization of abrupt distributional changes in ordered observations(Killick et al. 2012, Truong et al. 2020). The classical scenario of changes in the mean of a univariate Gaussian variable has been thoroughly studied. Here, the data is i.i.d. Gaussian with constant mean and variance between breakpoints. We propose to relax this assumption and model the mean (corresponding to the drift) of the Gaussian random variables with splines.

Work Plan:

1. Read introductions to metabolomics and change point detection.

2. Develop a prototype based on splines and binary segmentation or dynamic programming for drift correction of a single metabolite. Find an adequate stopping criterion, possibly motivated by a BIC-like criterion. Analyze the performance on a simulated dataset.

3. Dynamic programming requires a number of fits quadratic in the number of observations. (Killick et al. 2012) reduced this to linear time in the average case. For changes in the mean, fits of neighboring segments can be recovered with O(1) updates. See whether a similar approach is possible for splines.

4. The resulting method should be able to correct drifts for all metabolites separately in a reasonable time.

5. Analyze the result and compare it to existing methods in simulated environments and in real word metabolomic data.

6. If the time allows: structural breaks often co-occur for multiple metabolites. See whether changes can be estimated for all metabolites simultaneously. This probably requires to modify the stopping criterion. Apply typical analyses/estimations done following the drift correction on metabolomics data. See whether our change point detection approach improves their performance.

There will be the possibility to write a scientific publication about the project outcome in place of the thesis.
Prerequisites:
1. An interest in biological applications and willingness to learn about and adapt to metabolomics data analysis pipelines and approaches
2. Knowledge of statistical methods
3. Experience in Python and/or R programming
4. Interest in developing novel methodology independently and implementing ideas in algorithms.

References:

Dunn, W. B., D. Broadhurst, P. Begley, E. Zelena, S. Francis-McIntyre, N. Anderson, M. Brown, J. D. Knowles, A. Halsall and J. N. Haselden (2011). "Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry." Nature protocols 6(7): 1060-1083.

Johnson, W. E., C. Li and A. Rabinovic (2007). "Adjusting batch effects in microarray expression data using empirical Bayes methods." Biostatistics 8(1): 118-127.

Killick, R., P. Fearnhead and I. A. Eckley (2012). "Optimal detection of changepoints with a linear computational cost." Journal of the American statistical Association 107(500): 1590-1598.

Truong, C., L. Oudre and N. Vayatis (2020). "Selective review of offline change point detection methods." Signal Processing 167: 107299.

Watrous, J. D., T. J. Niiranen, K. A. Lagerborg, M. Henglin, Y. J. Xu, J. Rong, S. Sharma, R. S. Vasan, M. G. Larson, A. Armando, S. Mora, O. Quehenberger, E. A. Dennis, S. Cheng and M. Jain (2019). "Directed Non-targeted Mass Spectrometry and Chemical Networking for Discovery of Eicosanoids and Related Oxylipins." Cell Chem Biol 26(3): 433-442 e434.

Wishart, D. S. (2016). "Emerging applications of metabolomics in drug discovery and precision medicine." Nature reviews Drug discovery 15(7): 473.
Goal
The goal of this project is to apply change point detection to the metabolomics data to simultaneously estimate jump locations as change points and model drifts between the jumps with splines. This would allow for simple pre- processing in settings where natural change locations are unknown and scenarios where jumps occur within a plate. Also, as we do not assume a priori that each plate change coincides with a jump, the resulting model will have fewer degrees of freedom and should thus better preserve the underlying signal.
The goal of this project is to apply change point detection to the metabolomics data to simultaneously estimate jump locations as change points and model drifts between the jumps with splines. This would allow for simple pre- processing in settings where natural change locations are unknown and scenarios where jumps occur within a plate. Also, as we do not assume a priori that each plate change coincides with a jump, the resulting model will have fewer degrees of freedom and should thus better preserve the underlying signal.
Contact Details
malte.londschien@ai.ethz.ch odemler@bwh.harvard.edu
malte.londschien@ai.ethz.ch
odemler@bwh.harvard.edu

Calendar

Earliest start	No date
Latest end	No date

Location

ETH Competence Center - ETH AI Center (ETHZ)

Labels

Master Thesis

Topics

Mathematical Sciences
Information, Computing and Communication Sciences
Biology

Documents

Name	Comment	Size	Actions
MA_Metabolomics_Change_Point_Detection.pdf		519KB	Download