Thesis icon

Thesis

Structure-aware machine learning over multi-relational databases

Abstract:

We consider the problem of computing machine learning models over multi-relational databases. The mainstream approach involves a costly repeated loop that data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool.

In this thesis, we advocate for an alternative approach that avoids this loops and instead tightly integrates the query and learning tasks into one unified solution. The primary observation is that the data-intensive computation for a variety of learning tasks can be expressed as group-by aggregates over the join of the database relations.

This observation allows us to employ a combination of known and novel state-of-the-art query evaluation techniques, which exploit structure in the query and data to optimize the computation of the aggregates. As a result, we show that, for a class of machine learning models, our integrated, structure-aware approach for the end-to-end learning of models over databases can be asymptotically faster than the mainstream solution that first constructs the feature extraction query. This class of models includes supervised machine learning problems for regression and classification, as well as unsupervised learning problems.

This theoretical development informed the design and implementation of LMFAO (Layered Multiple Functional Aggregate Optimization), an in-memory optimization and execution engine for batches of aggregates over the input database. LMFAO consists of several layers of logical and code optimizations that systematically exploit factorization, sharing of computation, parallelism, and code specialization.

We conducted two types of performance benchmarks. First, we benchmark LMFAO against PostgreSQL, MonetDB, and a commercial database management system for the computation of aggregate batches. Then, we compare the performance of LMFAO against several machine learning packages commonly used in data science for the end-to-end learning pipeline of a variety of models over databases. In all benchmarks, LMFAO is able to outperform the competitors with a speedup of up to three orders of magnitude. In many cases, LMFAO can compute the end-to-end learning pipeline even faster than it takes the machine learning competitors to construct the input training dataset.

Actions


Access Document


Files:

Authors


More by this author
Division:
MPLS
Department:
Computer Science
Role:
Author

Contributors

Role:
Supervisor


Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford


Language:
English
Deposit date:
2020-06-10

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP