UNIverse - Public Research Portal
Profile Photo

Dr. Marco Vogt

Department of Mathematics and Computer Sciences
Profiles & Affiliations

Selected Publications

Vogt, Marco. (2022). Adaptive Management of Multimodel Data and Heterogeneous Workloads [Dissertation]. https://doi.org/10.5451/unibas-ep90279

URLs
URLs

Vogt, Marco, Lengweiler, David, Geissmann, Isabel, Hansen, Nils, Hennemann, Marc, Mendelin, Cédric, Philipp, Sebastian, & Schuldt. Heiko. (2021). Polystore Systems and DBMSs: Love Marriage or Marriage of Convenience? Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12921 LNCS, 65–69. https://doi.org/10.1007/978-3-030-93663-1_6

URLs
URLs

Vogt, Marco, Hansen, Nils, Schönholz, Jan, Lengweiler, David, Geissmann, Isabel, Philipp, Sebastian, Stiemer, Alexander, & Schuldt, Heiko. (2020). Polypheny-DB: Towards Bridging the Gap Between Polystores and HTAP Systems. In Gadepally, Vijay; Mattson, Timothy; Stonebraker, Michael; Kraska, Tim; Wang, Fusheng; Luo, Gang; Kong, Jun; Dubovitskaya, Alevtina (ed.), Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/978-3-030-71055-2_2

URLs
URLs

Selected Projects & Collaborations

Project cover

Polypheny-DDI

Research Project  | 3 Project Members

In recent years, data-driven research has established itself as the fourth pillar in the spectrum of scientific methods, alongside theory, empiric research, and computer-based simulation. In various scientific disciplines, increasingly large amounts of -both structured and unstructured- data are being generated or existing data collections that have originally been isolated from each other are being linked in order to gain new insights. he process of generating knowledge from raw data is called Data Science or Data Analytics. The entire data analytics pipeline is quite complex, and most work focuses on the actual data analysis (using machine learning or statistical methods), while largely neglecting the other elements of the pipeline. This is particularly the case for all aspects related to data management, storage, processing, and retrieval - even though these challenges actually play an essential role. A Distributed Data Infrastructure (DDI) supports a large variety of data management features as demanded by the data analytics pipeline. However, DDIs are usually very heterogeneous in terms of data models, access characteristics, and performance expectations. In addition, DDIs for integrating, continuously updating, and querying data from various heterogeneous applications need to overcome the inherent structural heterogeneity and fragmentation. Recently, polystore databases have gained attention because they help overcome these limitations by allowing data to be stored in one system, yet in different formats and data models and by offering one joint query language. In past work, we have developed Polypheny-DB, a distributed polystore that integrates several different data models and heterogeneous data stores. Polypheny-DB goes beyond most existing polystores and even supports data accesses with mixed workloads (e.g., OLTP and OLAP). However, polystores are limited to rather simple object models, static data and exact queries. When individual data items follow a complex inherent structure and consist of several heterogeneous parts between which dedicated constraints exist, when the access goes beyond exact Boolean queries, when data is not static but continuously produced, and/or when objects need to be preserved in multiple versions, then polystores quickly reach their limits. At the same time, these are typical requirements for data management within a data analytics pipeline. Examples are scientific instruments that continuously produce new data as data streams; social network analysis that requires support for complex object models including structured and unstructured content; data produced by imaging devices that requires sophisticated similarity search support, or frequently changing objects that are subject to time-dependent analyses. The objective of the Polypheny-DDI project is to seamlessly combine the functionality of a polystore database with that of a distributed data infrastructure to meet the requirements of data science applications. It will focus on i.) supporting complex composite object models and enforcing constraints between the constituent parts; ii.) supporting similarity search in multimedia content, and iii.) supporting continuous data streams and temporal/multiversion data.

Project cover

Polypheny-DB: Cost- and Workload-aware Adaptive Data Management

Research Project  | 3 Project Members

In the last few years, it has become obvious that the "one-size-fits-all" paradigm, according to which database systems have been designed for several decades, has come to an end. The reason is that in the very broad spectrum of applications, ranging from business over science to the private life of individuals, demands are more and more heterogeneous. As a consequence, the data to be considered significantly differs in many ways, for example from immutable data to data that is frequently updated; from highly structured data to unstructured data; from applications that need precise data to applications that are fine with approximate and/or outdated results; from applications which demand always consistent data to applications that are fine with lower levels of consistency. Even worse, many applications feature heterogeneous data and/or workloads, i.e., they intrinsically come with data (sub-collections) with different requirements for data storage, access, consistency, etc. that have to be dealt with in the same system. This development can either be coped with a large zoo of specialized systems (for each subset of data with different properties), or by a new type of flexible database system that automatically adapts to the -potentially dynamically changing- needs and characteristics of applications. Such behaviour is especially important in the Cloud where several applications need to be hosted on a shared infrastructure, in order to make economic use of the available resources. In addition, the infrastructure costs that incur in a Cloud are made transparent in a very fine-grained way. The Polypheny-DB project will address these challenges by dynamically optimizing the data management layer, by taking into account the resources needed and an estimation of the expected workload of applications. The core will be a comprehensive cost model that seamlessly addresses these criteria and that will be used, in two subprojects, to optimize i.) data storage and access, and ii.) data distribution. Each of the two subprojects will be addressed by one PhD student. Data Storage and Access: different physical storage media such as spinning discs, flash storage, or main memory come with different properties and performance guarantees for data access, but also significantly differ in terms of the costs for the necessary hardware and the volume of data that can be managed at a certain price. In addition, the access characteristics of applications (e.g., read-only, mostly reads, balanced read/write ratio, high update frequencies) may favour one solution over the other or even require hybrid solutions. The same also applies to the data model to be used and the choice between structured data, semi-structured data, or schema-less data management. Similarly, applications might require data to be approximated to speed up queries or to be compressed to save storage space. In Polypheny-DB, we will jointly address these choices and alternatives by devising a comprehensive cost model for data storage that allows to decide, if necessary at the level of sub-collections, how they can be best stored given an available budget and the requirements for data access. The costs for data storage and access will be continuously monitored, together with an analysis of the expected workload of applications and their storage strategy will be updated dynamically, if necessary. Data Distribution: a high degree of data availability necessitates that data is replicated across different sites in a distributed system. For read-only applications, this also allows to balance the (read) load across sites. However, in the presence of updates, additional costs for distributed transactions incur to make replicas consistent - unless weaker consistency models can be tolerated. In our previous work, we have designed, implemented, and evaluated a protocol that manages data in a distributed environment either by dynamically replicating data and managing consistency based on the costs that incur and the available budget or by partitioning data, in particular by co-locating data items that are frequently accessed jointly to minimize the number of distributed transactions. While both approaches are very effective, none of them alone can address dynamic workloads. In Polypheny-DB, we will develop a comprehensive cost model that seamlessly combines replication and partitioning of (subsets of) data. Again, the cost model will be used to dynamically re-assess replication and/or partitioning decisions and to adapt data distribution from a cost-effectiveness point of view.