UNIverse - Public Research Portal
Project cover

Copula Distributions in Machine Learning: Models, Inference and Applications

Research Project
 | 
01.07.2013
 - 30.06.2016
In the last years, copula models have become popular tools for modeling multivariate data. The underlying idea is to separate the "pure" dependency between random variables from the influence of the marginals. The main focus of research, however, was on parametric (and most often bivariate) copulas in econometrics applications, and only recently "truly" multivariate copula constructions have been considered. Finding principled ways of building these constructions, however, is commonly considered as a hard problem. The machine learning field was largely unaffected by these developments, despite the fact that inferring the dependency structure in high-dimensional data is one of the most fundamental problems in machine learning. On the other hand, machine learners have developed a rich repertoire of methods for structure learning, and exactly these methods have the potential to make copula constructions useful in real-world settings with noisy and partially missing observations. It is, thus, not surprising that there is a constantly increasing number of machine learning publications which aim at using structure learning methods for copula-based inference. In general, however, the use of copulas in machine learning has been restricted to density estimation problems based either on Gaussian copulas, or on aggregating standard bivariate pair copulas, while other directions such as clustering, multi-view learning, compression and dynamical models have not been explored in this context. In this proposal we will try to close this gap by focusing on clustering, on the connection to information theory (which will also include a connection to dynamical systems), and on finding new ways for using non-parametric pair copulas. From an application point of view, the study of such models is interesting, because inferring hidden structure is presumably one of the most successful applications of machine learning methods in domains generating massive and noise-affected data volumes, such as molecular biology. The precise separation of "dependency" and "marginals" in copula models bears the potential to overcome limitations of current techniques, be it too restrictive distributional assumptions or model selection problems. Our proposal is divided into four work-packages which address the following questions: (i) How can copulas be linked to information theory, and what consequences will this link have on modeling dynamical systems? On the application side, these questions are motivated by studying time-resolved gene expression data and node-ranking problems in gene networks. (ii) How can copulas be used for deriving flexible cluster models that can model arbitrary marginals and are robust to noise, missing values and outliers? The motivation comes from a mixed continuous/discrete dataset containing multi-channel EEG recordings and clinical measurements. (iii) How can we use copulas for simultaneously learning network structures and detecting "key modules" in these networks given external relevance information ? Our interest in this question comes from analyzing gene expression data and the aim to detect subnetworks based on clinical variables. (iv) How can we use empirical pair copulas in network learning? The motivation stems from our experience that the problem of selecting "suitable" pair copulas has no obvious solution in practice. In the proposed project, we plan to answer these questions, thereby pushing the state-of-the-art in machine learning problems involving copula distributions.
Members (1)
Profile Photo
Volker Roth
Principal Investigator