Parsing and planning SQL queries with Apache Calcite

Here at Mireo, we created our very own spatiotemporal database called SpaceTime from the ground up. We did this in order to satisfy our customers' demands for a system that could track an unlimited number of vehicles (i.e. 1,000,000+) in real time. You can read more about the challenges of such a feat in the article "Why would a normal person build its own spatiotemporal database?".

We thoroughly studied the problem of massive GPS tracking and, after a long search, finally found and implemented an algorithmic solution for it. The only thing still remaining was to build a software system around that solution so people could actually apply it in the real world. A triviality, right? It turned out not to be so.

Usability is everything

For any software solution to be successful, it is not enough that it's incredibly clever and extremely performant. It must also be usable, or it's completely worthless. The mark of a really great piece of software is that it gives extraordinary power to an ordinary user, enabling him to do with it what he could never dream of doing without it, and to make it seem effortless. And we wanted to do just that – make our customers, who are often quite conservative in regard to new and untested technologies, see SpaceTime not as yet another complicated system they must spend months to learn and then struggle daily with for years, but as a tool truly making their lives easier and their businesses more advanced.

That meant we had to make it dead-simple for an average user to take the enormous power of SpaceTime and apply it to any problem or idea that he has. But in order to do so, we had to hide all the amazing complexity of SpaceTime – its extremely fast indexes, infinite scalability mechanisms, internal partitioning schemes etc. – below the surface presented to the user, and make SpaceTime look natural and familiar to everyone who's ever worked with any sort of a database system (e.g. MySQL, Postgres, SQL Server...). And that meant we had to teach SpaceTime to speak and understand SQL.

Declaration of intent

SQL, short for Structured Query Language, is a de-facto standard language for querying relational (and many non-relational) database management systems (DBMS). It is not exactly a programming language, or at least not an ordinary one. It was designed specifically to make expressing complex operations with data (such as joining, filtering, aggregating) as simple as possible, leaving all the unnecessary details to the DBMS to determine.

A statement in the SQL language is a straightforward, English-like description of various manipulations and operations a user wants the database to perform with some set of data. In other words, it is a declaration of what is to be done, but without providing any directions on how to actually do it. For this reason, SQL belongs to the family of declarative languages, as opposed to imperative (procedural, object-oriented, scripting) languages which are more common in the programming world.

For example, an SQL statement

		SELECT registration_number, SUM(len)/1000.0 AS total_dist 
		FROM st.trips JOIN
		ndb.vehicle ON vid = id 
		GROUP BY registration_number

tells the database to compute the all-time total distance travelled by every vehicle (in kilometers), displaying it next to the registration number of that vehicle. Sending this query to SpaceTime would result in a response similar to this one (depending on the actual data in the database):

registration_number	total_dist
ST0001XX	68563.587
ZG0002XX	8954.224
RI0003XX	11815.494
ZD0004XX	28006.385
ZG0005XX	1177.886
OS0006XX	15491.375

(real distances, but imaginary registration numbers)

It's pretty easy to express such a query in SQL and obtain the desired results in a matter of seconds. But notice there are no details in the query telling the database engine how to get the data, where to put it in memory and in what order to process it. All of that is solely the engine's responsibility, and a wrong decision could easily turn the query run-time from seconds to hours.

Behind the scenes

That means the database query engine must be fairly smart, quite a bit smarter than the usual programming language compiler. The job of a regular compiler is to parse the input program, validate it and then produce the corresponding output as either assembly code or some sort of bytecode. The SQL query engine has to do all that, but also a lot more. After validation, it first has to come up with an abstract execution plan that will satisfy the query description, then it has to optimize the execution plan for maximal performance, and finally the plan has to be adapted to fit the reality of the concrete execution engine. On the way, it has to utilize some abstract mathematical concepts such as relational algebra (which is needed to ensure correct and efficient optimization of the execution plan), as well as take into account countless physical details like indexes, partitioning schemes, network topology, internal record formats etc.

It is by no means obvious how any of this works in practice, and sadly, there are few resources available detailing what actually happens inside an SQL query engine (not counting several books with thousands of pages). Therefore, what follows in this article is one such account, describing in some detail all that's occurring behind the scenes, in the belly of the SpaceTime's query engine.

Step 1 - Parsing

So then, how do we get the final set of results out of a plain simple query string, which to the eyes of SpaceTime's database engine at first looks as incomprehensible as a random sequence of Egyptian hieroglyphs?

The whole process starts with parsing. A query, in order to be understood by the database engine, must first be parsed using an SQL parser, which takes a string of characters and tries to deduce its syntactic structure in the form of a parse tree. It does that by using a set of syntax rules called language grammar, which defines how an SQL query must look like to be considered valid and acceptable to the query engine.

For instance, a rule for parsing SQL SELECT statements might look like this:

		select:
		<SELECT> expressionList
			[<FROM> table]
			[<WHERE> condition]
			[<GROUP> <BY> groupingList]
			[<HAVING> condition]

It declares that a SELECT statement must start with the keyword SELECT, followed by a list of fields and/or expressions to select (could be only one), and optionally any, or all, of the following: a FROM clause specifying the source of data from which to do a select (i.e. a table or a subquery); a WHERE clause filtering selected rows based on a Boolean condition; a GROUP BY clause aggregating rows together based on some keys; and a HAVING clause filtering out some groups if a condition isn't satisfied.

In case the input query doesn't satisfy the rules of SQL grammar, it's declared invalid and the parser refuses to continue, terminating with a corresponding error message. Otherwise, the result of a successful parsing is a parse tree, which is a structured representation of the same query with each part of it (i.e. token) being classified and assigned a specific meaning in the surrounding context.

As an illustration, the previously mentioned query calculating total driving distance for each tracked vehicle is structurally decomposed into the following parse tree:

		[select]:
		SELECT_LIST:
			- [identifier] "registration_number"
			- [operator call] "as"
				- [operator call] "/"
					- [function call]: "SUM"
							- [identifier] "len"
					- [numeric literal] 1000
		FROM:
			[join]:
					CONDITION:
						[operator call] "="
										- [identifier] "vid"
										- [identifier] "id"
					JOIN_TYPE:  inner
					LEFT_SIDE:  "st.trips"
					RIGHT_SIDE: "ndb.vehicle"
		WHERE:    null
		GROUP BY: [identifier] "registration_number"
		HAVING:   null

(Parse tree nodes also contain other info such as location in the query, which aren’t shown for clarity)

From this tree, the role of each token in regard to the whole query can be clearly identified. For example, the token registration_number occurs twice in the query and is classified as an identifier in both cases, but the first time it occurs it determines which field to add to the result dataset, and the second time it says by which field rows of the result are to be grouped in the process of aggregation.

Step 2 - Validation

After parsing, the next phase in the process is query validation. The purpose of validation is to test the semantic correctness of the query, i.e. whether a query written by a user actually makes sense or not. A few examples of queries that don't make sense are: taking the square root out of a vehicle registration number (which is a string), requesting a field sl (speed limit) from trips table when it isn't even there (a whole trip has no speed limit, only individual road segments do), referencing whole tables that don't exist (e.g. semgents, a typo out of segments), calling unknown functions etc. Validation makes sure every object mentioned in the query really exists and every operation specified can actually be done. If it isn't so, the process halts with an explanatory error message.

Another important purpose of validation is to correctly identify to which exact object (field, table, function) any identifier named in the query really refers, and also to assign correct data types to every field, row or expression that therein occurs.

In the case of our sample query, expressions in the SELECT_LIST part of the parse tree after validation look like the following:

		SELECT_LIST:
			- [identifier] "registration_number"  (= field "ndb.vehicle.registration_number") 
			-> varchar
			- [operator call] "as"             	(= built-in AS operator) -> float64
			- [operator call] "/"              	(= built-in division operator) -> float64
				- [function call]: "SUM"       	(= SUM aggregate function) -> int64
				- [identifier] "len"           	(= field "st.trips.len") -> int32
				- [numeric literal] 1000.0       		 	(= floating-point literal) -> float64
			i.e. all identifiers are successfully resolved and all types correctly deduced.

Step 3 - Translation to relational algebra

The next step on the road from the input query to the final result is conversion of the validated parse tree to a very different kind of query representation - a tree of relational operators.

Relational algebra, whence relational operators come from, is a mathematical theory which serves as a formal foundation for designing and reasoning about relational databases such as SpaceTime. Relational algebra deals with abstract transformations over sets of data, such as selection (filtering based on a predicate), projection (choosing and modifying some columns of a row), union (combining several row sets into one), aggregation (computing a scalar function over a set of rows) etc. Each of these transformations has its corresponding operator, and these operators are the building blocks of a relational operator tree. In such a tree, leaf nodes represent input data sets (i.e. database tables) and other nodes represent transformations over data. Conceptually, the data goes from the bottom up and gets transformed by operators until it reaches the root of the tree. The data set that comes out from the root is the final result of the composite operation represented by that tree.

By using relational algebra, both data operations and optimization rules can be expressed in a precise mathematical form, which provides several important benefits to the whole process. Firstly, it allows optimizing the execution plan on an abstract level completely independent from irrelevant physical details. Secondly, abstractness of the representation simplifies its later conversion to the physical plan which may involve executing different parts of the query in different ways and on different query engines (as is the case with SpaceTime), each with its own separately-treated details. Lastly, due to the mathematical exactness of relational algebra, the process of optimization can be strictly verified to ensure the equivalence of the optimized query with the original one.

Our example query, translated to the relational algebra, looks like this:

		PROJECT: exprs = [ registration_number = $0, total_dist = $1 / 1000.0 ]
		AGGREGATE: group_by = [ 0 ], aggs = [ SUM($1) ]
			PROJECT: exprs = [ registration_number = $10, len = $6 ]
			JOIN: type = inner, condition = ($0 = $7)
			SCAN: table = st.trips
			SCAN: table = ndb.vehicle

This is an informal representation – there's also a more strict, mathematical form, which looks far more intimidating.

Project, Aggregate and Join are some of the most common relational operators, while Scan is a special operator which produces data for its parents by scanning and reading contents of the specified table.

It's important to note that this kind of query representation is no longer just a different way of stating what the user wants done with the data. It is also, for the first time, an actual plan, a recipe for how to truly obtain the desired results. Admittedly, it's an abstract one, but a hypothetical relational algebra interpreter could take this tree as-is, evaluate it as described, and produce precisely the results the user requested by writing the original query.

This translation from the validated parse tree to the tree of relational operators is one of the most complex steps on the road, but it's still far from the end of it.

Step 4 - Planning and optimization

An abstract plan of execution obtained in the previous step in the form of a relational operator tree is as yet unfit for execution in the real world. It's completely abstract, and also generally quite inefficient. The purpose of the next stage in the query processing pipeline is to resolve these two issues by optimizing the query plan and then transforming it into a physical execution plan which can be run directly on a query execution engine.

Query optimization is a process in which the original query is transformed according to a set of rules into some other, equivalent query that is faster to run and/or requires less computational resources for its execution. Optimization rules can be of many kinds. Some are based on plain common sense, such as the rule that filtering operations (discarding rows based on a predicate) should come as close to the data sources as possible, so that later operations (e.g. joins, projections) process only those rows that will truly end up in the result set. Some rules are dependent on statistical estimations, such as inverting joins to minimize the expected size of the left-side data set. Other rules simplify individual expressions, replace some function calls with others that are less expensive, or remove dead-ends detected in the plan, such as when there's filtering by a predicate that is always false.

Immediately after optimization comes conversion to the physical plan. What this step entails largely depends on the query execution engine, because physical plan must closely match the engine's internal structure so that it can be directly executed there. For example, many database engines run queries by interpreting bytecode instructions, so in that case the plan must here be compiled to bytecode.

In the case of SpaceTime, which is a database federation system, this step also determines which parts of the query are to be executed on what database engine – whether on MSQL server, MySQL NDB Cluster, SpaceTime server, or KSP streaming engine (for more details about internal SpaceTime structure, see the Mireo SpaceTime – an absurdly fast moving objects database).

In our example, the optimized physical execution plan is the following:

		MsqlProject: exprs = [ registration_number = $0, total_dist = $1 / 1000.0 ]
		MsqlAggregate: group_by = [ 3 ], aggs = [ SUM0($1) ]
			MsqlProject: exprs = [ VID = $2, LEN = $3, ID = $0, REGISTRATION_NUMBER = $1 ]
			MsqlJoin: type = inner, condition = ($2 = $0)
				Ndb2MsqlConvert
				NdbQuery: query = "select id, registration_number from vehicle"
				St2MsqlUnion
				StProject: exprs = [ VID = $0, LEN = $6 ]
					StTableScan: table = trips

Several things can be observed. With respect to the physical execution, parts of the plan are executed on three different database engines - MSQL server does the top-level processing, MySQL NDB Cluster provides static vehicle data, and SpaceTime server gathers data about vehicle trips. Regarding query optimization, two improvements were made to the original plan. Projections were pushed as far down as possible to minimize the total amount of data processed and transferred internally over the network (the transfers happen during execution of Ndb2MsqlConvert and St2MsqlUnion plan nodes). The optimizer also detected that the join operation as it was specified in the original query (joining st.trips with ndb.vehicle) was inefficient because st.trips table typically contains a lot more rows, so the order of the join was inverted. In the case of more complex queries, the number of optimizations that can be performed is far higher, and the outcome on the performance even more significant.

Step 5 - Execution

In the end, a well-optimized physical execution plan is passed on to the query execution engine whose job is to realize it and produce the desired results.

There are generally two approaches to executing the plan. The first and more common one is query interpretation, whereby the query plan is read like a set of instructions (often in bytecode) that are interpreted and executed inside the execution engine. This is used in the majority of today's database systems, even though some of them (e.g. Postgres) have recently also incorporated some elements of the second approach.

That second approach to query execution is Just-In-Time (JIT) compilation, which is what SpaceTime does. JIT compilation means that the physical plan is not treated as a set of instructions to be executed, but rather as a blueprint for writing such instructions which are compiled as a standalone program that can be executed directly without any interpretation. This often provides a substantial gain in execution performance, but at the cost of a potentially more complex query engine design. More details about JIT compilation in SpaceTime can be found in the article "Compiling SQL queries for Mireo SpaceTime".

Whichever approach one takes, the query is now executed, the data collected and communicated between database nodes, then sent to the client (either a console program, a GUI, or a web application) where it is formatted and finally displayed to the eagerly-awaiting user.

Conclusion

Taken as a whole, the process of parsing, validating, planning and executing SQL queries is without question quite a long and complicated one, and implementing it in SpaceTime (or any other database) is by no means an easy task.

Some parts of it, like parsing and validation, are more or less straightforward – there are many proved libraries supplying tools for building custom parsers, and a basic sort of validation can be implemented on-one's-own without too much trouble. Query compilation and execution, whether as bytecode or using JIT, is definitely a hard problem, but it has to be solved either way, whether one uses SQL or not.

The real stumbling block turned out to be relational algebra. Using it for planning and optimizing SQL queries is a must as there are no viable alternatives. But the conversion to it, as well as working with it, is difficult in practice because it requires realizing a large number of abstract concepts and procedures in code. There are books with more than a thousand pages written on the subject, spelling out all the minor details required to implement the process correctly. Therefore, we hoped to save the time and effort by finding a suitable library or framework that would do the work for us.

Surprisingly, there seemed to be none available. Every popular SQL-supporting DBMS definitely must have relational algebra handling built-in somewhere in its engine, but none of them ever made a public release of any standalone library or framework for this particular stage of query processing. Presumably, the coupling between relational algebra handling and the rest of the database engine code in these products is too high to enable separation into a distinct library. Either way, for a moment, it seemed we were left with no other option but to do everything ourselves.

Fortunately, it turned out there is one such framework (and as far as we've seen, it's the only one), called Apache Calcite. But Apache Calcite doesn't only provide support for query planning and optimization using relational algebra. It is a full-blown (and open-source!) solution for designing database systems: it includes both an SQL parser and a validator, and even adds facilities for actual query execution over data coming from various adapters (including custom ones).

We carefully reviewed all the framework's capabilities and concluded that Apache Calcite is a very good match for our requirements. Therefore, we decided to give it a try, and haven't had second thoughts since.

But how exactly did we incorporate it into SpaceTime?That will be the topic of the second part of this article.

Parsing and planning SQL queries with Apache Calcite – Part 1