My Research Interests

My main research interests include machine learning, statistical models for data analysis and biomedical applications.

Selected Publications

Arend Torres, Fabricio, Negri,Marcello Massimo, Inversi, Marco, Aellen, Jonathan, & Roth, Volker. (2024, May 7). Lagrangian Flow Networks for Conservation Laws. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=Nshk5YpdWE

URLs

Negri, Marcello Massimo, Arend Torres, Fabricio, & Roth, Volker. (2023). Conditional Matrix Flows for Gaussian Graphical Models. Advances in Neural Information Processing Systems, 36, 25095–25111. https://proceedings.neurips.cc/paper_files/paper/2023/file/4eef8829319316d0b552328715c836c3-Paper-Conference.pdf

URLs

Bachmann, Nadine, von Siebenthal, Chantal, Vongrad, Valentina, Turk, Teja, Neumann, Kathrin, Beerenwinkel, Niko, Bogojeska, Jasmina, Fellay, Jaques, Roth, Volker, Kok, Yik Lim, Thorball, Christian W., Borghesi, Alessandro, Parbhoo, Sonali, Wieser, Mario, Böni, Jürg, Perreau, Matthieu, Klimkait, Thomas, Yerly, Sabine, Battegay, Manuel, et al. (2019). Determinants of HIV-1 reservoir size and long-term dynamics during suppressive ART. Nature Communications, 10(1). https://doi.org/10.1038/s41467-019-10884-9

URLs

Wu, Mike, Hughes, Michael C., Parbhoo, Sonali, Zazzi, Maurizio, Roth, Volker, & Doshi-Velez, Finale. (2018, January 1). Beyond Sparsity: Tree Regularization of Deep Models for Interpretability. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16285

URLs

Prabhakaran, Sandhya, Rey, Melanie, Zagordi, Osvaldo, Beerenwinkel, Niko, & Roth, Volker. (2014). HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(1), 182–191. https://doi.org/ea017b9f-ec09-4077-85ab-ded3af538c48

URLs

Selected Projects & Collaborations

weObserve: Integrating Citizen Observers and High Throughput Sensing Devices for Big Data Collection, Integration, and Analysis

Research Project | 6 Project Members

Even though hypothesis driven research is the fundamental core of scientific advance, scientific progress by monitoring and subsequently analysing is crucial for certain phenomena of the real world. Specialized high throughput sensor devices can be used in controlled lab environments that provide very large data collections. These collections can be analysed to come up with new findings by verifying or falsifying concrete hypotheses. However, the majority of scientific domains is more complex and cannot rely on such rather simple data gathering and processing pipelines: first, the phenomena to be monitored in the real world are complex in their spatial and temporal dynamic, and confounding factors are typically of multi-causal origin. Thus, the phenomena can-not be isolated nor can natural environments be rebuilt in controlled lab environments, which was one of the lessons learned from the Biosphere II programme. Monitoring in the field is essential, but it can hardly be done with conventional sensor technology only at landscape scale, since there are always technical trade-offs between spatial resolution, coverage, temporal resolution and in-terpretability. Furthermore, even if resolution and coverage of sensors is satisfactory, it is not obvious where and when to deploy these high precision/throughput sensors to capture relevant phenomena. In the weObserve project, we will rely on citizen observers to provide semantically rich information directly from the field to complement and enrich existing sensor data. In addition, monitoring data from citizen observers will be used to anticipate where interesting phenomena are supposed to take place, to use societal knowledge and judgement on relevance of phenomena and to deploy high resolution sensing devices in these areas. From a technical point of view, weObserve will address the collection, integration, and processing of heterogeneous data in applica-tions which cannot rely on off-the-shelf sensing devices for moni-monitoring purposes. Heterogeneity includes the volume of data provided via different channels, the precision, the coverage in time and location, and also the predictability. Data collection will there-fore seamlessly combine several types of data sources: (i) high throughput sensing devices which produce very large volumes of data and cover large areas, but with rather low resolution, (ii) Citizen Observers which provide semantically rich data, but with varying levels of precision and substantial sampling bias, and (iii) specific high resolution sensing devices that need to be manually deployed. Data integration will deal with such heterogeneous data. For subsequent analysis, the origin (provenance) and uncer-tainty of individual data items needs to be kept in an integrated data set. Data analysis will detect hidden patterns and the main explanatory factors in data collections. Integration of domain knowledge into the analysis process will be essential for detecting sampling biases and confound-ing factors. Specific emphasis needs to be put on analysis models that can deal with multiple data queues varying in size, reliability, representation and resolution. A further important aspect con-cerns visualization of detected patterns as a means for improving communication between those using the data for scientific purposes and those collecting it. WeObserve will design, implement, integrate, and evaluate the individual parts of the data collection/integration/analysis pipeline in two selected applications, namely i.) monitoring of soil degradation and landslides, and ii.) monitoring of bird migration, with complementary requirements and different ways to gather data.

Computer aided Methods for Diagnosis and Early Risk Assessment for Parkinson`s Disease Dementia

Research Project | 4 Project Members

Neurodegenerative disorders begin insidiously in midlife and are relentlessly progressive. Currently, there exists no established curative or protective treatment, and they constitute a major and increasing health problem and, in consequence, an economic burden in aging populations globally. Parkinson's disease (PD), following Alzheimer's disease (AD), is the second most common neurodegenerative disorder worldwide, estimated to occur in approximately 1% of population above 60 and at least in 3% in individuals above 80 years of age. In Switzerland, about 15'000 persons are diagnosed with PD. In addition to motor signs, which due to recent medical progress can be treated satisfactorily in most cases, non-motor symptoms and signs severely affect the well-being of patients. They include mood disorders, psychosis, cognitive decline, disorders of circadian rhythms, as well as vegetative and cardiovascular dysregulation. Neurodegeneration in PD progresses for years before clinical diagnosis is possible, at which time e.g. 80% of dopaminergic neurons in the Substantia nigra are lost already. Therefore, any clinical targeting disease modification, prognosis and personalized treatment including guiding the indication for deep brain stimulation (DBS) requires reliable and valid biomarkers. The main goal of this research project is the identification of a pertinent set of genetic and neurophysiological markers for diagnosis and early risk assessment of PD-dementia. Our approach has a distinct interdisciplinary basis, in that it fosters close collaborations between physicians, neuroscientists, psychiatrists, psychologists, computer scientists and statisticians. Based on current research findings we postulate that a combination of (1) quantitative electroencephalographic measures (QEEG, e.g. frequency power and connectivity patterns and network analysis), (2) genetic biomarkers (e.g. MAPT, COMT, GBA, APOE) and (3) neuropsychological assessment improves early recognition and monitoring of cognitive decline in PD. To test this hypothesis, this project proposes an interdisciplinary long-term study of patients diagnosed with PD without signs of dementia, among them a subgroup of patients undergoing DBS. The workup of the proposed study includes collection of clinical, neuropsychological, neurophysiological and genotyping data at the baseline, as well as at 3, 4 and 5 years follow-ups. Sophisticated statistical models that can deal with noisy measurements, missing values and heterogeneous data types will be used to extract the best combination of biomarkers and neuropsychological variables for diagnosis and prediction of prognosis of PD-dementia. Besides this clinical perspective, this project further aims at deciphering the unknown disease mechanisms in PD both on a genetic and neurophysiological level, with particular emphasis of the interplay of genetic markers and temporal changes in the functional connectivity of the brain over time.

Copula Distributions in Machine Learning: Models, Inference and Applications

Research Project | 1 Project Members

In the last years, copula models have become popular tools for modeling multivariate data. The underlying idea is to separate the "pure" dependency between random variables from the influence of the marginals. The main focus of research, however, was on parametric (and most often bivariate) copulas in econometrics applications, and only recently "truly" multivariate copula constructions have been considered. Finding principled ways of building these constructions, however, is commonly considered as a hard problem. The machine learning field was largely unaffected by these developments, despite the fact that inferring the dependency structure in high-dimensional data is one of the most fundamental problems in machine learning. On the other hand, machine learners have developed a rich repertoire of methods for structure learning, and exactly these methods have the potential to make copula constructions useful in real-world settings with noisy and partially missing observations. It is, thus, not surprising that there is a constantly increasing number of machine learning publications which aim at using structure learning methods for copula-based inference. In general, however, the use of copulas in machine learning has been restricted to density estimation problems based either on Gaussian copulas, or on aggregating standard bivariate pair copulas, while other directions such as clustering, multi-view learning, compression and dynamical models have not been explored in this context. In this proposal we will try to close this gap by focusing on clustering, on the connection to information theory (which will also include a connection to dynamical systems), and on finding new ways for using non-parametric pair copulas. From an application point of view, the study of such models is interesting, because inferring hidden structure is presumably one of the most successful applications of machine learning methods in domains generating massive and noise-affected data volumes, such as molecular biology. The precise separation of "dependency" and "marginals" in copula models bears the potential to overcome limitations of current techniques, be it too restrictive distributional assumptions or model selection problems. Our proposal is divided into four work-packages which address the following questions: (i) How can copulas be linked to information theory, and what consequences will this link have on modeling dynamical systems? On the application side, these questions are motivated by studying time-resolved gene expression data and node-ranking problems in gene networks. (ii) How can copulas be used for deriving flexible cluster models that can model arbitrary marginals and are robust to noise, missing values and outliers? The motivation comes from a mixed continuous/discrete dataset containing multi-channel EEG recordings and clinical measurements. (iii) How can we use copulas for simultaneously learning network structures and detecting "key modules" in these networks given external relevance information ? Our interest in this question comes from analyzing gene expression data and the aim to detect subnetworks based on clinical variables. (iv) How can we use empirical pair copulas in network learning? The motivation stems from our experience that the problem of selecting "suitable" pair copulas has no obvious solution in practice. In the proposed project, we plan to answer these questions, thereby pushing the state-of-the-art in machine learning problems involving copula distributions.

HIV-1 whole-genome quasispecies analysis in longitudinal clinical samples

Research Project | 4 Project Members

Genetic diversity is a hallmark of pathogen populations, associated with disease progression, immune escape, and drug resistance. We propose to use next-generation sequencing (NGS) for the analysis of viral populations within infected patients to improve viral diagnostics and treatment decisions by reconstructing the entire population structure, including low-frequency mutants and co-occurring mutations. Based on the achievements in the predecessor project (CR32I2_127017), we will develop experimental protocols and computational methods for the analysis of NGS data obtained from intra-patient HIV-1 populations. Our goal is to infer the haplotype sequences and their frequencies over the full length of the 9.2 kb HIV genome. Since the limited read length of current NGS platforms turned out to be the main bottleneck in this endeavor, we will develop two major extensions of our software tools "PredictHaplo" and "QuasiRecomb" to address this limitation. Firstly, we will use Illumina's paired-end option to generate read pairs of 2x250bp length covering the HIV genome. These data are informative about long-range phasing of single-nucleotide variants (SNVs) and will be integrated into probabilistic global haplotype reconstruction using either soft constraints on SNV linkage (PredictHaplo) or silent delete states for the insert (QuasiRecomb). Secondly, we will explore Pacific Biosciences' PacBio RS technology as an alternative long-read sequencing platform with an average read length of 1,500 bp. Using these improved experimental and computational tools, we will analyze and interpret genetic diversity in the context of HIV-1 drug resistance. Specifically, over 100 samples from 40 patients will be analyzed for pre- versus post-treatment changes of viral populations, for low-frequency drug resistant variants and their role in treatment failure, for linkage among drug resistance mutations and evolutionary escape pathways, for recombinants, and for viral phenotypes such as drug resistance and co-receptor usage. Our full-length haplotype approach provides, for the first time, a complete picture of the virus population and will yield new insights into drug resistance development.

HIV-1 whole-genome quasispecies analysis by ultra-deep sequencing and computational haplotype inference to determine the mechanisms of drug resistance development

Research Project | 2 Project Members

HIV-1 whole-genome quasispecies analysis by ultra-deep sequencing and computational haplotype inference to determine the mechanisms of drug resistance development For many biological species, the traditional view of "one organism, one genome" is insufficient to explain their behavior. The genetic diversity within these species plays an important role for their survival in a given environment. In HIV-1 infection, a diverse population of viruses is maintained in each individual host, facilitating the evolutionary escape of HIV-1 from the host's immune response. The high genetic heterogeneity of HIV-1 quasispecies constitutes a major obstacle in the development of an effective vaccine and it limits therapeutic options due to drug resistant mutants. In recent years, the next generation of high-throughput DNA sequencing technologies has been introduced. Because of their high coverage based on sequencing many short reads in parallel, this approach is referred to as ultra-deep sequencing. The new sequencing platforms have the potential to resolve the genetic variation in a sample at unprecedented detail by direct sequencing of the mixture of clones. While traditional Sanger sequencing can only infer a consensus sequence of a sample, ultra-deep sequencing generates a very large number of individual reads from the population of interest. Beyond the standard usage of deep sequencing, we propose a combined approach of ultra-deep sequencing and computational modeling to infer the genetic diversity of intra-host HIV-1 populations. In this joint effort between physician scientists, molecular biologists, computational biologists, and computer scientists, our main goal is to reconstruct the individual full-length haplotypes of HIV-1 populations derived from infected patients and to characterize these quasispecies with respect to interactions among mutations and among individual variants in the context of drug resistance development. To achieve this goal, we will (1) develop and optimize an experimental methodology for ultra-deep sequencing of full-length HIV-1 virus populations, (2) devise computational and statistical methods for haplotype reconstruction from a set of short, error-prone, observed sequence reads, and (3) analyze 100 pre- and post-treatment patient samples with this approach in order to determine the mechanisms driving the evolution of drug resistance. The comprehensiveness of this study due to the whole-genome and population-wide approach is a unique feature that is possible only with this interdisciplinary approach. The proposed investigations will increase our understanding of the complex mechanisms of viral escape from the pressure of antiretroviral drugs. The establishment of the new experimental and computational methodologies will be widely applicable to HIV-1 infected patients in Switzerland and beyond. They will further increase the scientific value of the Swiss HIV Cohort Study, a large long-term Swiss collaboration, and they may considerably increase quality of patient care in the future.

Prof. Dr. Volker Roth

My Research Interests

Selected Publications

Arend Torres, Fabricio, Negri,Marcello Massimo, Inversi, Marco, Aellen, Jonathan, & Roth, Volker. (2024, May 7). Lagrangian Flow Networks for Conservation Laws. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=Nshk5YpdWE

Wu, Mike, Hughes, Michael C., Parbhoo, Sonali, Zazzi, Maurizio, Roth, Volker, & Doshi-Velez, Finale. (2018, January 1). Beyond Sparsity: Tree Regularization of Deep Models for Interpretability. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16285

Selected Projects & Collaborations

weObserve: Integrating Citizen Observers and High Throughput Sensing Devices for Big Data Collection, Integration, and Analysis

Research Project | 6 Project Members

Computer aided Methods for Diagnosis and Early Risk Assessment for Parkinson`s Disease Dementia

Research Project | 4 Project Members

Copula Distributions in Machine Learning: Models, Inference and Applications

Research Project | 1 Project Members

HIV-1 whole-genome quasispecies analysis in longitudinal clinical samples

Research Project | 4 Project Members

HIV-1 whole-genome quasispecies analysis by ultra-deep sequencing and computational haplotype inference to determine the mechanisms of drug resistance development

Research Project | 2 Project Members