CodeBreakCafe | Generating BigQuery Table DDLs
Code Break Cafe

Generating BigQuery Table DDLs

Published on June 12, 2025 by pawel

As a data engineer working with Google BigQuery, I often need to document and recreate table structures accurately. Recently, I built a Python script to streamline the process of generating DDL (Data Definition Language) statements for tables in a BigQuery dataset, complete with schema details, partitioning, clustering, and constraints like primary and foreign keys. In this blog post, I’ll walk you through what this script does, why I created it, and how it simplifies my workflow. Let’s dive in!

Dive Into the Code

I’ve shared the getAllTables.py script in a dedicated GitHub repository,

The Motivation Behind the Script

BigQuery is a powerful data warehouse, but extracting complete table definitions—including schema, constraints, and configurations—can be tedious without automation. I needed a way to:

  1. Retrieve the DDL for all tables in a specific dataset.
  2. Include critical details like column definitions, table descriptions, partitioning, clustering, and primary/foreign key constraints.
  3. Save these DDLs as SQL files for documentation, version control, or deployment in other environments.
  4. Optionally filter tables by name to focus on specific subsets of data.

Manually querying BigQuery’s INFORMATION_SCHEMA or piecing together table metadata was time-consuming and error-prone. So, I decided to write a Python script to automate this process, leveraging the Google Cloud BigQuery client library and my trusty service account credentials.

What the Script Does

Let me break down the script’s functionality from my perspective, as I built it to solve my specific needs.

1. Setting Up Authentication

To interact with BigQuery, I use a service account JSON key file for authentication. In the script, I specify the path to this file (e.g., c:/skey/service_account.json) and initialize the BigQuery client with my project ID (xxx). This sets up a secure connection to my dataset (yyy) so I can query table metadata and schemas.

2. Fetching Table Metadata

The script starts by listing all tables in the specified dataset using client.list_tables(). If I provide a table_filter (e.g., to only include tables with “sales” in their names), it narrows down the list to match my criteria. This is super helpful when I’m working with large datasets and only need DDLs for a subset of tables.

For each table, the script retrieves detailed metadata using client.get_table(). This gives me access to the table’s schema, description, partitioning, and clustering configurations—everything I need to reconstruct the table.

3. Generating DDL Statements

The core of the script is the generate_table_ddl() function, which crafts a CREATE OR REPLACE TABLE statement for each table. Here’s what it includes:

  • Columns: For each column, it specifies the name, data type (e.g., STRING, INTEGER), and mode (e.g., NULLABLE, REQUIRED). If a column has a description, it’s added as an OPTIONS clause.
  • Table Options: If the table has a description, it’s included in the DDL.
  • Partitioning: If the table is partitioned (e.g., by DAY or MONTH on a specific field), the script adds a PARTITION BY clause.
  • Clustering: If clustering fields are defined, it includes a CLUSTER BY clause with the relevant columns.
  • Constraints: The script queries BigQuery’s INFORMATION_SCHEMA to retrieve primary and foreign key constraints. These are added as ALTER TABLE statements, marked as NOT ENFORCED since BigQuery doesn’t enforce constraints but supports them for documentation.

For constraints, I wrote a separate get_constraints() function that queries INFORMATION_SCHEMA.TABLE_CONSTRAINTS, KEY_COLUMN_USAGE, and CONSTRAINT_COLUMN_USAGE to extract primary and foreign key details. This was tricky because foreign key metadata (like referenced tables and columns) can sometimes be incomplete, so I added fallbacks to handle missing data gracefully.

4. Saving DDLs to Files

Once the DDL is generated, the script saves it to a .sql file in a specified output directory (e.g., table_ddls). Each table gets its own file, named after the table ID (e.g., my_table.sql). This makes it easy to version-control the DDLs in Git or share them with my team. The script also prints the DDL to the console for quick review, along with confirmation that the file was saved.

5. Error Handling

I built in robust error handling to catch issues like invalid credentials, missing tables, or file-writing errors. If something goes wrong (e.g., a table lacks constraint metadata), the script logs the error and continues processing the remaining tables, ensuring I don’t lose progress.

How I Use the Script

To run the script, I configure a few parameters at the bottom of the file:

  • project_id: My BigQuery project (xxx).
  • dataset_id: The dataset I’m targeting (yyy).
  • credentials_path: The path to my service account JSON key.
  • table_filter: An optional string to filter tables by name (e.g., sales for tables related to a specific project).
  • output_dir: Where to save the generated .sql files.

Then, I simply run the script with Python:

python getAllTables.py

The script lists all matching tables, generates their DDLs, and saves them to files while printing progress to the console. For example, for a table named sales_data, I might get a file table_ddls/sales_data.sql containing:

CREATE OR REPLACE TABLE `xxx.yyy.sales_data` (
  `id` INTEGER,
  `sale_date` DATE,
  `amount` FLOAT64

)

PARTITION BY `sale_date`

CLUSTER BY `id`
OPTIONS(description="Sales data table");

ALTER TABLE `xxx.yyy.sales_data` ADD PRIMARY KEY (`id`) NOT ENFORCED;

Why This Script Saves Me Time

Before writing this script, I spent hours manually querying BigQuery’s metadata tables and stitching together DDLs. Now, I can generate DDLs for an entire dataset in minutes, complete with constraints and configurations. The output files are perfect for:

  • Documentation: Sharing table structures with my team or auditors.
  • Migration: Recreating tables in another environment (e.g., a dev or staging project).
  • Version Control: Tracking schema changes over time in a Git repository.
  • Automation: Feeding DDLs into CI/CD pipelines for automated deployments.

The table filter is a lifesaver when I only need DDLs for specific tables, and the error handling ensures I don’t get stuck debugging a single problematic table.

Challenges and Lessons Learned

One challenge was dealing with BigQuery’s INFORMATION_SCHEMA for constraints. Foreign key metadata can be sparse, so I had to add logic to handle cases where referenced tables or columns are missing. Another lesson was ensuring the DDL syntax matches BigQuery’s expectations, especially for partitioning and clustering clauses.

I also learned to appreciate Python’s google.cloud.bigquery library—it’s powerful but requires careful error handling, especially when querying metadata or writing files.

What’s Next?

I’m already thinking about enhancements:

  • Adding support for views and materialized views.
  • Including table statistics (e.g., row counts, size) in the output.
  • Generating a single SQL file with all DDLs for easier batch execution.
  • Supporting other BigQuery features like table expiration policies.

For now, though, this script is a game-changer for my BigQuery workflows. It’s a reliable, reusable tool that saves me time and ensures consistency in my table definitions.

If you’re working with BigQuery and need to generate DDLs, I hope my script inspires you to automate your own workflows. Feel free to adapt it to your needs—just make sure you’ve got your service account credentials ready!

Log in to add a comment.

No comments yet.