Dr. Marco Vogt Department of Mathematics and Computer Sciences Profiles & Affiliations OverviewResearch Publications Publications by Type Projects & Collaborations Academic Activities Academic Self-Administration Junior Development, Doctorate and Advanced Studies Academic Reputation & Networking Projects & Collaborations OverviewResearch Publications Publications by Type Projects & Collaborations Academic Activities Academic Self-Administration Junior Development, Doctorate and Advanced Studies Academic Reputation & Networking Profiles & Affiliations Projects & Collaborations 6 foundShow per page10 10 20 50 Polypheny-DDI Research Project | 3 Project MembersIn recent years, data-driven research has established itself as the fourth pillar in the spectrum of scientific methods, alongside theory, empiric research, and computer-based simulation. In various scientific disciplines, increasingly large amounts of -both structured and unstructured- data are being generated or existing data collections that have originally been isolated from each other are being linked in order to gain new insights. he process of generating knowledge from raw data is called Data Science or Data Analytics. The entire data analytics pipeline is quite complex, and most work focuses on the actual data analysis (using machine learning or statistical methods), while largely neglecting the other elements of the pipeline. This is particularly the case for all aspects related to data management, storage, processing, and retrieval - even though these challenges actually play an essential role. A Distributed Data Infrastructure (DDI) supports a large variety of data management features as demanded by the data analytics pipeline. However, DDIs are usually very heterogeneous in terms of data models, access characteristics, and performance expectations. In addition, DDIs for integrating, continuously updating, and querying data from various heterogeneous applications need to overcome the inherent structural heterogeneity and fragmentation. Recently, polystore databases have gained attention because they help overcome these limitations by allowing data to be stored in one system, yet in different formats and data models and by offering one joint query language. In past work, we have developed Polypheny-DB, a distributed polystore that integrates several different data models and heterogeneous data stores. Polypheny-DB goes beyond most existing polystores and even supports data accesses with mixed workloads (e.g., OLTP and OLAP). However, polystores are limited to rather simple object models, static data and exact queries. When individual data items follow a complex inherent structure and consist of several heterogeneous parts between which dedicated constraints exist, when the access goes beyond exact Boolean queries, when data is not static but continuously produced, and/or when objects need to be preserved in multiple versions, then polystores quickly reach their limits. At the same time, these are typical requirements for data management within a data analytics pipeline. Examples are scientific instruments that continuously produce new data as data streams; social network analysis that requires support for complex object models including structured and unstructured content; data produced by imaging devices that requires sophisticated similarity search support, or frequently changing objects that are subject to time-dependent analyses. The objective of the Polypheny-DDI project is to seamlessly combine the functionality of a polystore database with that of a distributed data infrastructure to meet the requirements of data science applications. It will focus on i.) supporting complex composite object models and enforcing constraints between the constituent parts; ii.) supporting similarity search in multimedia content, and iii.) supporting continuous data streams and temporal/multiversion data. Polypheny: Multi-Model Data Management Research Project | 3 Project MembersPlypheny is a novel, innovative multi-model database systems that seamlessly combines different data models, query languagres and storage systems. Chronos Research Project | 4 Project MembersEvaluations are an important part of research and development. While they are daily business for every database engineer, most engineers are recurrently facing the same problems: How to automate, monitor and analyze the results in a straightforward and intuitive way and how to guarantee reproducibility? With Chronos we have developed an evaluation system which presents a solution to these major challenges. Polypheny-DB: Cost- and Workload-aware Adaptive Data Management Research Project | 3 Project MembersIn the last few years, it has become obvious that the "one-size-fits-all" paradigm, according to which database systems have been designed for several decades, has come to an end. The reason is that in the very broad spectrum of applications, ranging from business over science to the private life of individuals, demands are more and more heterogeneous. As a consequence, the data to be considered significantly differs in many ways, for example from immutable data to data that is frequently updated; from highly structured data to unstructured data; from applications that need precise data to applications that are fine with approximate and/or outdated results; from applications which demand always consistent data to applications that are fine with lower levels of consistency. Even worse, many applications feature heterogeneous data and/or workloads, i.e., they intrinsically come with data (sub-collections) with different requirements for data storage, access, consistency, etc. that have to be dealt with in the same system. This development can either be coped with a large zoo of specialized systems (for each subset of data with different properties), or by a new type of flexible database system that automatically adapts to the -potentially dynamically changing- needs and characteristics of applications. Such behaviour is especially important in the Cloud where several applications need to be hosted on a shared infrastructure, in order to make economic use of the available resources. In addition, the infrastructure costs that incur in a Cloud are made transparent in a very fine-grained way. The Polypheny-DB project will address these challenges by dynamically optimizing the data management layer, by taking into account the resources needed and an estimation of the expected workload of applications. The core will be a comprehensive cost model that seamlessly addresses these criteria and that will be used, in two subprojects, to optimize i.) data storage and access, and ii.) data distribution. Each of the two subprojects will be addressed by one PhD student. Data Storage and Access: different physical storage media such as spinning discs, flash storage, or main memory come with different properties and performance guarantees for data access, but also significantly differ in terms of the costs for the necessary hardware and the volume of data that can be managed at a certain price. In addition, the access characteristics of applications (e.g., read-only, mostly reads, balanced read/write ratio, high update frequencies) may favour one solution over the other or even require hybrid solutions. The same also applies to the data model to be used and the choice between structured data, semi-structured data, or schema-less data management. Similarly, applications might require data to be approximated to speed up queries or to be compressed to save storage space. In Polypheny-DB, we will jointly address these choices and alternatives by devising a comprehensive cost model for data storage that allows to decide, if necessary at the level of sub-collections, how they can be best stored given an available budget and the requirements for data access. The costs for data storage and access will be continuously monitored, together with an analysis of the expected workload of applications and their storage strategy will be updated dynamically, if necessary. Data Distribution: a high degree of data availability necessitates that data is replicated across different sites in a distributed system. For read-only applications, this also allows to balance the (read) load across sites. However, in the presence of updates, additional costs for distributed transactions incur to make replicas consistent - unless weaker consistency models can be tolerated. In our previous work, we have designed, implemented, and evaluated a protocol that manages data in a distributed environment either by dynamically replicating data and managing consistency based on the costs that incur and the available budget or by partitioning data, in particular by co-locating data items that are frequently accessed jointly to minimize the number of distributed transactions. While both approaches are very effective, none of them alone can address dynamic workloads. In Polypheny-DB, we will develop a comprehensive cost model that seamlessly combines replication and partitioning of (subsets of) data. Again, the cost model will be used to dynamically re-assess replication and/or partitioning decisions and to adapt data distribution from a cost-effectiveness point of view. Data Partitioning and Replication in the Cloud Research Project | 3 Project MembersIn this work, we combine protocols for distributed data management, in particular for data replication and for data partitioning, to provide cost-effective, dynamic and adaptable replicated data management in a Cloud environment. Data Archiving in the Cloud Research Project | 3 Project MembersWith the advent of data Clouds that come with nearly unlimited storage capacity combined with low storage costs, the well-established update-in-place paradigm for data management is more and more replaced by a multi-version approach. Especially in a Cloud environment with several geographically distributed data centers that act as replica sites, this allows to keep old versions of data and thus to provide a rich set of read operations with different semantics (e.g., read most recent version, read version not older than, read data as of, etc.). A combination of multi-version data management, replication, and partitioning allows to redundantly store several or even all versions of data items without significantly impacting each single site. However, in order to avoid that single sites in such partially replicated data Clouds are overloaded when processing archive queries that access old versions, query optimization has to jointly consider version selection and load balancing (site selection). In our work, we address novel cost-aware index approaches (called ARCTIC) for version and site selection for a broad range of query types including both fresh data and archive data. 1 1 OverviewResearch Publications Publications by Type Projects & Collaborations Academic Activities Academic Self-Administration Junior Development, Doctorate and Advanced Studies Academic Reputation & Networking
Projects & Collaborations 6 foundShow per page10 10 20 50 Polypheny-DDI Research Project | 3 Project MembersIn recent years, data-driven research has established itself as the fourth pillar in the spectrum of scientific methods, alongside theory, empiric research, and computer-based simulation. In various scientific disciplines, increasingly large amounts of -both structured and unstructured- data are being generated or existing data collections that have originally been isolated from each other are being linked in order to gain new insights. he process of generating knowledge from raw data is called Data Science or Data Analytics. The entire data analytics pipeline is quite complex, and most work focuses on the actual data analysis (using machine learning or statistical methods), while largely neglecting the other elements of the pipeline. This is particularly the case for all aspects related to data management, storage, processing, and retrieval - even though these challenges actually play an essential role. A Distributed Data Infrastructure (DDI) supports a large variety of data management features as demanded by the data analytics pipeline. However, DDIs are usually very heterogeneous in terms of data models, access characteristics, and performance expectations. In addition, DDIs for integrating, continuously updating, and querying data from various heterogeneous applications need to overcome the inherent structural heterogeneity and fragmentation. Recently, polystore databases have gained attention because they help overcome these limitations by allowing data to be stored in one system, yet in different formats and data models and by offering one joint query language. In past work, we have developed Polypheny-DB, a distributed polystore that integrates several different data models and heterogeneous data stores. Polypheny-DB goes beyond most existing polystores and even supports data accesses with mixed workloads (e.g., OLTP and OLAP). However, polystores are limited to rather simple object models, static data and exact queries. When individual data items follow a complex inherent structure and consist of several heterogeneous parts between which dedicated constraints exist, when the access goes beyond exact Boolean queries, when data is not static but continuously produced, and/or when objects need to be preserved in multiple versions, then polystores quickly reach their limits. At the same time, these are typical requirements for data management within a data analytics pipeline. Examples are scientific instruments that continuously produce new data as data streams; social network analysis that requires support for complex object models including structured and unstructured content; data produced by imaging devices that requires sophisticated similarity search support, or frequently changing objects that are subject to time-dependent analyses. The objective of the Polypheny-DDI project is to seamlessly combine the functionality of a polystore database with that of a distributed data infrastructure to meet the requirements of data science applications. It will focus on i.) supporting complex composite object models and enforcing constraints between the constituent parts; ii.) supporting similarity search in multimedia content, and iii.) supporting continuous data streams and temporal/multiversion data. Polypheny: Multi-Model Data Management Research Project | 3 Project MembersPlypheny is a novel, innovative multi-model database systems that seamlessly combines different data models, query languagres and storage systems. Chronos Research Project | 4 Project MembersEvaluations are an important part of research and development. While they are daily business for every database engineer, most engineers are recurrently facing the same problems: How to automate, monitor and analyze the results in a straightforward and intuitive way and how to guarantee reproducibility? With Chronos we have developed an evaluation system which presents a solution to these major challenges. Polypheny-DB: Cost- and Workload-aware Adaptive Data Management Research Project | 3 Project MembersIn the last few years, it has become obvious that the "one-size-fits-all" paradigm, according to which database systems have been designed for several decades, has come to an end. The reason is that in the very broad spectrum of applications, ranging from business over science to the private life of individuals, demands are more and more heterogeneous. As a consequence, the data to be considered significantly differs in many ways, for example from immutable data to data that is frequently updated; from highly structured data to unstructured data; from applications that need precise data to applications that are fine with approximate and/or outdated results; from applications which demand always consistent data to applications that are fine with lower levels of consistency. Even worse, many applications feature heterogeneous data and/or workloads, i.e., they intrinsically come with data (sub-collections) with different requirements for data storage, access, consistency, etc. that have to be dealt with in the same system. This development can either be coped with a large zoo of specialized systems (for each subset of data with different properties), or by a new type of flexible database system that automatically adapts to the -potentially dynamically changing- needs and characteristics of applications. Such behaviour is especially important in the Cloud where several applications need to be hosted on a shared infrastructure, in order to make economic use of the available resources. In addition, the infrastructure costs that incur in a Cloud are made transparent in a very fine-grained way. The Polypheny-DB project will address these challenges by dynamically optimizing the data management layer, by taking into account the resources needed and an estimation of the expected workload of applications. The core will be a comprehensive cost model that seamlessly addresses these criteria and that will be used, in two subprojects, to optimize i.) data storage and access, and ii.) data distribution. Each of the two subprojects will be addressed by one PhD student. Data Storage and Access: different physical storage media such as spinning discs, flash storage, or main memory come with different properties and performance guarantees for data access, but also significantly differ in terms of the costs for the necessary hardware and the volume of data that can be managed at a certain price. In addition, the access characteristics of applications (e.g., read-only, mostly reads, balanced read/write ratio, high update frequencies) may favour one solution over the other or even require hybrid solutions. The same also applies to the data model to be used and the choice between structured data, semi-structured data, or schema-less data management. Similarly, applications might require data to be approximated to speed up queries or to be compressed to save storage space. In Polypheny-DB, we will jointly address these choices and alternatives by devising a comprehensive cost model for data storage that allows to decide, if necessary at the level of sub-collections, how they can be best stored given an available budget and the requirements for data access. The costs for data storage and access will be continuously monitored, together with an analysis of the expected workload of applications and their storage strategy will be updated dynamically, if necessary. Data Distribution: a high degree of data availability necessitates that data is replicated across different sites in a distributed system. For read-only applications, this also allows to balance the (read) load across sites. However, in the presence of updates, additional costs for distributed transactions incur to make replicas consistent - unless weaker consistency models can be tolerated. In our previous work, we have designed, implemented, and evaluated a protocol that manages data in a distributed environment either by dynamically replicating data and managing consistency based on the costs that incur and the available budget or by partitioning data, in particular by co-locating data items that are frequently accessed jointly to minimize the number of distributed transactions. While both approaches are very effective, none of them alone can address dynamic workloads. In Polypheny-DB, we will develop a comprehensive cost model that seamlessly combines replication and partitioning of (subsets of) data. Again, the cost model will be used to dynamically re-assess replication and/or partitioning decisions and to adapt data distribution from a cost-effectiveness point of view. Data Partitioning and Replication in the Cloud Research Project | 3 Project MembersIn this work, we combine protocols for distributed data management, in particular for data replication and for data partitioning, to provide cost-effective, dynamic and adaptable replicated data management in a Cloud environment. Data Archiving in the Cloud Research Project | 3 Project MembersWith the advent of data Clouds that come with nearly unlimited storage capacity combined with low storage costs, the well-established update-in-place paradigm for data management is more and more replaced by a multi-version approach. Especially in a Cloud environment with several geographically distributed data centers that act as replica sites, this allows to keep old versions of data and thus to provide a rich set of read operations with different semantics (e.g., read most recent version, read version not older than, read data as of, etc.). A combination of multi-version data management, replication, and partitioning allows to redundantly store several or even all versions of data items without significantly impacting each single site. However, in order to avoid that single sites in such partially replicated data Clouds are overloaded when processing archive queries that access old versions, query optimization has to jointly consider version selection and load balancing (site selection). In our work, we address novel cost-aware index approaches (called ARCTIC) for version and site selection for a broad range of query types including both fresh data and archive data. 1 1
Polypheny-DDI Research Project | 3 Project MembersIn recent years, data-driven research has established itself as the fourth pillar in the spectrum of scientific methods, alongside theory, empiric research, and computer-based simulation. In various scientific disciplines, increasingly large amounts of -both structured and unstructured- data are being generated or existing data collections that have originally been isolated from each other are being linked in order to gain new insights. he process of generating knowledge from raw data is called Data Science or Data Analytics. The entire data analytics pipeline is quite complex, and most work focuses on the actual data analysis (using machine learning or statistical methods), while largely neglecting the other elements of the pipeline. This is particularly the case for all aspects related to data management, storage, processing, and retrieval - even though these challenges actually play an essential role. A Distributed Data Infrastructure (DDI) supports a large variety of data management features as demanded by the data analytics pipeline. However, DDIs are usually very heterogeneous in terms of data models, access characteristics, and performance expectations. In addition, DDIs for integrating, continuously updating, and querying data from various heterogeneous applications need to overcome the inherent structural heterogeneity and fragmentation. Recently, polystore databases have gained attention because they help overcome these limitations by allowing data to be stored in one system, yet in different formats and data models and by offering one joint query language. In past work, we have developed Polypheny-DB, a distributed polystore that integrates several different data models and heterogeneous data stores. Polypheny-DB goes beyond most existing polystores and even supports data accesses with mixed workloads (e.g., OLTP and OLAP). However, polystores are limited to rather simple object models, static data and exact queries. When individual data items follow a complex inherent structure and consist of several heterogeneous parts between which dedicated constraints exist, when the access goes beyond exact Boolean queries, when data is not static but continuously produced, and/or when objects need to be preserved in multiple versions, then polystores quickly reach their limits. At the same time, these are typical requirements for data management within a data analytics pipeline. Examples are scientific instruments that continuously produce new data as data streams; social network analysis that requires support for complex object models including structured and unstructured content; data produced by imaging devices that requires sophisticated similarity search support, or frequently changing objects that are subject to time-dependent analyses. The objective of the Polypheny-DDI project is to seamlessly combine the functionality of a polystore database with that of a distributed data infrastructure to meet the requirements of data science applications. It will focus on i.) supporting complex composite object models and enforcing constraints between the constituent parts; ii.) supporting similarity search in multimedia content, and iii.) supporting continuous data streams and temporal/multiversion data.
Polypheny: Multi-Model Data Management Research Project | 3 Project MembersPlypheny is a novel, innovative multi-model database systems that seamlessly combines different data models, query languagres and storage systems.
Chronos Research Project | 4 Project MembersEvaluations are an important part of research and development. While they are daily business for every database engineer, most engineers are recurrently facing the same problems: How to automate, monitor and analyze the results in a straightforward and intuitive way and how to guarantee reproducibility? With Chronos we have developed an evaluation system which presents a solution to these major challenges.
Polypheny-DB: Cost- and Workload-aware Adaptive Data Management Research Project | 3 Project MembersIn the last few years, it has become obvious that the "one-size-fits-all" paradigm, according to which database systems have been designed for several decades, has come to an end. The reason is that in the very broad spectrum of applications, ranging from business over science to the private life of individuals, demands are more and more heterogeneous. As a consequence, the data to be considered significantly differs in many ways, for example from immutable data to data that is frequently updated; from highly structured data to unstructured data; from applications that need precise data to applications that are fine with approximate and/or outdated results; from applications which demand always consistent data to applications that are fine with lower levels of consistency. Even worse, many applications feature heterogeneous data and/or workloads, i.e., they intrinsically come with data (sub-collections) with different requirements for data storage, access, consistency, etc. that have to be dealt with in the same system. This development can either be coped with a large zoo of specialized systems (for each subset of data with different properties), or by a new type of flexible database system that automatically adapts to the -potentially dynamically changing- needs and characteristics of applications. Such behaviour is especially important in the Cloud where several applications need to be hosted on a shared infrastructure, in order to make economic use of the available resources. In addition, the infrastructure costs that incur in a Cloud are made transparent in a very fine-grained way. The Polypheny-DB project will address these challenges by dynamically optimizing the data management layer, by taking into account the resources needed and an estimation of the expected workload of applications. The core will be a comprehensive cost model that seamlessly addresses these criteria and that will be used, in two subprojects, to optimize i.) data storage and access, and ii.) data distribution. Each of the two subprojects will be addressed by one PhD student. Data Storage and Access: different physical storage media such as spinning discs, flash storage, or main memory come with different properties and performance guarantees for data access, but also significantly differ in terms of the costs for the necessary hardware and the volume of data that can be managed at a certain price. In addition, the access characteristics of applications (e.g., read-only, mostly reads, balanced read/write ratio, high update frequencies) may favour one solution over the other or even require hybrid solutions. The same also applies to the data model to be used and the choice between structured data, semi-structured data, or schema-less data management. Similarly, applications might require data to be approximated to speed up queries or to be compressed to save storage space. In Polypheny-DB, we will jointly address these choices and alternatives by devising a comprehensive cost model for data storage that allows to decide, if necessary at the level of sub-collections, how they can be best stored given an available budget and the requirements for data access. The costs for data storage and access will be continuously monitored, together with an analysis of the expected workload of applications and their storage strategy will be updated dynamically, if necessary. Data Distribution: a high degree of data availability necessitates that data is replicated across different sites in a distributed system. For read-only applications, this also allows to balance the (read) load across sites. However, in the presence of updates, additional costs for distributed transactions incur to make replicas consistent - unless weaker consistency models can be tolerated. In our previous work, we have designed, implemented, and evaluated a protocol that manages data in a distributed environment either by dynamically replicating data and managing consistency based on the costs that incur and the available budget or by partitioning data, in particular by co-locating data items that are frequently accessed jointly to minimize the number of distributed transactions. While both approaches are very effective, none of them alone can address dynamic workloads. In Polypheny-DB, we will develop a comprehensive cost model that seamlessly combines replication and partitioning of (subsets of) data. Again, the cost model will be used to dynamically re-assess replication and/or partitioning decisions and to adapt data distribution from a cost-effectiveness point of view.
Data Partitioning and Replication in the Cloud Research Project | 3 Project MembersIn this work, we combine protocols for distributed data management, in particular for data replication and for data partitioning, to provide cost-effective, dynamic and adaptable replicated data management in a Cloud environment.
Data Archiving in the Cloud Research Project | 3 Project MembersWith the advent of data Clouds that come with nearly unlimited storage capacity combined with low storage costs, the well-established update-in-place paradigm for data management is more and more replaced by a multi-version approach. Especially in a Cloud environment with several geographically distributed data centers that act as replica sites, this allows to keep old versions of data and thus to provide a rich set of read operations with different semantics (e.g., read most recent version, read version not older than, read data as of, etc.). A combination of multi-version data management, replication, and partitioning allows to redundantly store several or even all versions of data items without significantly impacting each single site. However, in order to avoid that single sites in such partially replicated data Clouds are overloaded when processing archive queries that access old versions, query optimization has to jointly consider version selection and load balancing (site selection). In our work, we address novel cost-aware index approaches (called ARCTIC) for version and site selection for a broad range of query types including both fresh data and archive data.