In the world of parsers, the state-of-the-art in text analysis power and functionality lies in a parser that can build an abstract syntax tree (AST) of your text. The main benefit of an AST is that you use a JSON string that contains classifications of the information that you parsed. It is extremely useful and easy for building services on top of. If the input text is a SQL statement, even more so.
Early in 2019, Google open-sourced the AST-based parser called zetaSQL that is used in production for parsing and formatting queries in Google BigQuery and Cloud Spanner. The repository is here. zetaSQL can be compiled using
bazel and consists mainly of C++ code, although a java implementation is also available.
Diving into the code, the parser has quite a few functionalities, some of which are:
- Formatting a SQL statement (sql_formatter.h)
- Analyzing a SQL statement (analyzer.h)
- Finding syntax errors and returning the error, line, and column to you (parse_helpers.h)
After looking into the repository, we quickly see that it seems easy to build. To save you time, I should point out that the file called
.bazelrc is very important, as it sets the C++ compiler version to the correct level, as supported by the other software by Google. So it is of critical importance to have this file during a build.
Also, the base zetaSQL repository doesn’t offer a
Dockerfile (yet) which allows you to build and compile the latest version of zetaSQL. I offer it here, based on Ubuntu 18.04 and the latest version of
My implementation (zetasql-analyzer-server) solves point 3 above, which is the syntax error finder. It is based on a Docker implementation made by
apstndb called the zetasql-format-server (that solves point 2 above). Under the hood it is a golang server that wraps the formatter (or analyzer) api of zetaSQL and serves an endpoint in Google Cloud Run.
Here is an example output from the analyzer endpoint for an erroneous query
SELEC 1 (the T is missing):
In the example below, the second nest of the query is missing the word
SELECT . You can see that the parser indicates this information.
What about speed and performance?
From the limited number of tests I’ve performed, it seems most responses (even for queries over 600 rows) stays within 1 second, with the mean of 300ms.
Here are timing graphs for the 2 examples from my laptop, with a high-speed internet connection:
The stack that allows this high speed parsing is using a
distroless container image (https://github.com/GoogleContainerTools/distroless) and a
serverless hardware implementation (Cloud Run) using C++ as input language.
Installation and Usage
To install this you are welcome to read the README file here. Let me know where I can help to get it going.
After creating an endpoint in Cloud Run, you can get information from it by issuing a
curl -X POST -H 'Content-type: application/text' --data 'SLECT 1, ' https://<your endpoint goes here>
- zetasql-analyzer-server: https://github.com/ebendutoit/zetasql-analyzer-server
- zetaSQL: https://github.com/google/zetasql
- zetasql-format-server: https://github.com/apstndb/zetasql-format-server