Master the Databricks Unity Catalog information_schema

When I was administering SQL Server and consulting, I often had the opportunity to use the system views. I first encountered these in the entity relationship diagram contained in the cute little three ring binder that contained the Sybase 4.9 user manual. It both amazed me with its elegance and helped me quickly understand the structure and cardinality of the various Sybase components. The same core structure is retained in the fork of Sybase that became SQL Server.

Sometimes I queried and joined these tables to get information, other times I generated SQL statements. For instance, one client had applied the wrong permission to every single table, and I prepared a query that looped through the tables system view and prepared curative SQL using display statements within a cursor. In these situations, I copy the generated SQL from the query result, inspect it, and paste it into a new query window and run it. That sure beats the right click on each table 300 times…

There are myriad uses of the information schema, if you spend enough time with a DBMS. Recently, I wanted to see all of the Databricks Unity Catalog volumes in a single list, to see which areas of azure storage had been covered by volumes. The utility of implementing databricks volumes in azure storage is significant, and a topic for another discussion.

Unity Catalog stores lots of object information in system.information_schema. The names of the tables are self-explanatory for the most part, as you go shopping for the information you want. My first implementation of displaying a list of UC volumes executed SQL queries ‘show catalogs’ and capturing the output in a dataframe, followed by inner loops of ‘show schemas’ and ‘show volumes’. This was kind of a brute force approach, and it occurred to me a more elegant approach could be prepared using the information schema tables.

I note that I use databricks assistant quite a bit now for pyspark coding, preparing syntax details and suggesting error resolutions. The more I use it, the more surprised I am at how often it accurately guesses what I was going to place in subsequent lines of code. Databricks assistant is also good at taking a 75 line sql query and identifying the precise location of a syntax error–whereas the regular error message only states the sql statement in which the error occurred. But I digress…

Below is a python example of navigating the information schema to prepare a simple list. Sorry, I don’t know how to paste in the code while retaining the loop indentation which python requires, but the necessary structure is clear for this simple example. I placed the output into a dataframe to facilitate possible future coding and provide a tabular display. I included an example of how to sort and filter the dataframe, for my own convenience if nothing else.

from pyspark.sql import Row

rows = []
dfVols = spark.sql(f”select * from system.information_schema.volumes”)
for vol in dfVols.collect():
catalog = vol[‘volume_catalog’]
schema = vol[‘volume_schema’]
volume = vol[‘volume_name’]
volume_location = vol[‘storage_location’]
volume_type = vol[‘volume_type’]
volume_owner = vol[‘volume_owner’]
last_altered_by = vol[‘last_altered_by’]
rows.append(
Row(
catalog=catalog,
schema=schema,
volume=volume,
volume_type=volume_type,
volume_owner=volume_owner,
volume_location=volume_location,
last_altered_by=last_altered_by
)
)

df_volumes = spark.createDataFrame(rows)
display(df_volumes.filter(df_volumes.volume_type == ‘EXTERNAL’).orderBy(“catalog”, “schema”, “volume”))

Mark C Knutson

All content Copyright 2023 Mark C Knutson

Master the Databricks Unity Catalog information_schema