Create data lineage from Trino/Hive queries in digdag log with Python

sqllineage visualize your query log into lineage

An example lineage of Treasure Workflow with sqllineage

What’s data lineage?

Data lineage is something to describe “Where this data comes from and where it goes?”

I learned this term in my previous job. They provided “Cloudera Navigator” which includes data lineage from execution logs of Hive/Spark etc.

lineage of Cloudera Navigator via https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cn_lineage_generation.html
lineage of Cloudera Navigator via https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cn_lineage_generation.html

sqllineage is awesome open source tool for visualizing lineage

Recently, I learned there is a Python package so called sqllinage, that makes analyze and visualize data lineage from SQLs.

sqllineage consists of Python implementation to analyze SQL and web application written in React.

Visualize data lineage from Treasure Data’s workflow logs

I found that Treasure Data’s workflow log outputs SQLs in its log. But it still needs to format pure SQLs.

Then, I create digdaglog2sql to extract SQLs from Treasure Workflow logs.

You can use it with Python 3.7+. Here is the overview of the usage and check details on GitHub.

Install via pip:

pip install --user digdaglog2sql

If you have a workflow log downloaded from Treasure Data, you can convert into SQL as:

digdaglog2sql --input workflow-log.txt --output output.sql

Or, if you want extract SQLs from specific workflow, you can use Session ID of it.

export TD_API_KEY=1234XXX/YYYYYYYY
digdaglog2sql --session-id 12345 --site us --output output.sql

You can fetch SQLs from your hosted digdag as the following:

digdaglog2sql --session-id 12345 --endpoint digdag.example.com --output output.sql

Note that, as of May 5, 2022, sqllineage and sqlparse, which is an important backend of sqllineage, are not fully compatible with Trino and Hive queries.

As of 2022/05/11, the issues in sqllineage around Hive/Trino were fixed and it is available in 1.3.5 on PyPI. It means, you don’t have to have node for sqllineage installation from source.

As of 2022/10/06, the issue in sqlparse was resolved in 0.4.3.

These are the PRs that approaches the issues:

Don’t worry about it. I prepared patched branches on GitHub. You can install sqllineage and sqlparse as the following:

pip install git+https://github.com/chezou/sqlparse.git@trino#egg=sqlparse==0.4.3.dev0
pip install sqllineage

If you see some error on installation of sqllineage, double-check if you have node installed.

Then, you can visualize your SQL file as:

$ sqllineage -g -f output.sql
 * SQLLineage Running on http://localhost:5000/?f=output.sql

Now you can see visualization of data linage, both table level and column level.

SQL lineageの例
SQL lineageの例

Let’s try sqllineage!

Aki Ariga
Aki Ariga
Principal Software Engineer

Interested in Machine Learning, ML Ops, and Data driven business. If you like my blog post, I’m glad if you can buy me a tea 😉

  Gift a cup of Tea

Related