Blog

Connecting to and querying BigQuery from Python

Four steps to read data from your BigQuery warehouse directly into Python

connecting-bigquery-python

In this post, youā€™ll learn how to connect to a Google BigQuery warehouse with Python. Thereā€™s a few different ways to do this, but weā€™ll use the officialĀ Google Cloud Python ClientĀ (google-cloud-bigquery). Weā€™ll walk through:

1. Installing the Google Cloud Bigquery Python Client (google-cloud-bigquery) 2. Authenticating and connecting to your BigQuery data warehouse 3. Running a query! At this point, youā€™ve successfully connected to and queried BigQuery from Python, and you can write any query you want. Job done? Not quite. Weā€™ll also cover: 4. Reading data from a BigQuery query into aĀ pandas DataFrame Bonus: Writing a DataFrame back into a BigQuery table.

If youā€™re running through this live, it should only take you around 10 minutes to go from zero to successful query. Let's get into it!

šŸ Installing the Google Cloud Python Client

Thereā€™s nothing particularly special about the environment needed for this, so if you already know you have a working Python 3.6+ installation, you can skip to the next section,Ā Installing the google-cloud-python package.

Prepping your Python environment

There are two likely scenarios here for how you're accessing Python:

a. Youā€™re using aĀ JupyterĀ notebook or aĀ Jupyter notebook alternative:

You should be pretty much good to go in this case. If you want to make sure, you can run this Python command in a cell and look for a response that's >= 3.6:

from platform import python_version
print(python_version())
Checking the Python version of a notebook
Checking the Python version of a notebook

b. Youā€™re using the terminal / command line / some other Python IDE directly:

  1. Open up your terminal (Terminal app on Mac, command prompt or Powershell on windows, etc.)

  2. Run python --version to check what Python version is installed. Mine prints out Python 3.9.12, but as long as yours is 3.6 or greater, youā€™re perfect.

If you get aĀ command not found: pythonĀ error, or your output isĀ Python 2.x, try runningĀ python3 --version. If that works, then you have separate installations of Python 3 and Python 2, and for the rest of this tutorial youā€™ll need to substituteĀ python3Ā forĀ pythonĀ andĀ pip3Ā forĀ pipĀ in the example commands.

IfĀ python3 --versionĀ also does not work, then you donā€™t have Python installed. The easiest way to get this working is toĀ download the official installerĀ for your machine.

PS: We wonā€™t go deep into the setup of virtual environments here, but if youā€™re doing a lot of Python work directly at the command line, youā€™ll want to read up on them.Ā VirtualenvĀ works well, and lots of people also likeĀ conda.

Installing the google-cloud-bigquery package

This is probably the easiest step of the whole tutorial! You just have to run one command to install the official BigQuery Python Connector:

pip install google-cloud-bigquery

If youā€™re using a Jupyter notebook, add aĀ !Ā before pip to let this command run.

!pip install google-cloud-bigquery

And if you had to runĀ python3 -- versionĀ earlier to get a working output, youā€™ll run pip3 instead of pip to install the package:

pip3 install google-cloud-bigquery

ThisĀ pip installĀ command will spin through a bunch of messages and progress bars, and maybe a few warnings about your ā€œpip versionā€ or various package deprecations. This is OK, and unless anything actually says ā€œERRORā€ in red, you can probably ignore it.

Now weā€™re ready to connect to BigQuery.

Authenticating and connecting to BigQuery

In this next part, weā€™ll be working with sensitive information: your BigQuery service account credentials.Ā You shouldnā€™t ever store these directly in code or check them into git.Ā You never know what you might unthinkingly do with that codeā€” send it to a coworker to help, post it on Stack Overflow with a question, check it into a public git repo...

Storing BigQuery credentials

BigQuery service account credentials are provided as a JSON file, so we donā€™t need to stress about using environment variables or anything. Just make sure you keep this file somewhere safe, and donā€™t check it into git!

If you donā€™t already have a service account with BigQuery scopes,Ā this is a nice resourceĀ that walks you through the steps to generate one.

Opening a connection to BigQuery

Now letā€™s start working in Python. Open a new Python session (either in the terminal by runningĀ pythonĀ orĀ python3, or by opening your choice of Jupyter notebook or Jupyter notebook alternative.

We start by importing the BigQuery connector library that we just installed:

from google.cloud import bigquery
from google.oauth2 import service_account

Next, weā€™ll open a connection to BigQuery. With this package, this means creating aĀ clientĀ object that holds an authenticated session you can use to query data.

credentials = service_account.Credentials.from_service_account_file('path/to/your/service-account-file.json')

project_id = 'your-bigquery-project-id-here'
client = bigquery.Client(credentials= credentials,project=project_id)

If this all runs successfully, then congratulationsā€” youā€™ve connected to BigQuery and are ready to run queries!

Running a query

BigQuery has some great public datasets that we can use to test our connection, so you can copy/paste this, but itā€™s pretty simple. You create aĀ client.query()Ā object with the SQL text you want to execute, and then fetch theĀ result()Ā from that query to run it.

query = client.query("""
   SELECT * FROM `bigquery-public-data.fda_food.food_events` LIMIT 1000
 """)

results = query.result()
for row in results:
	print(row)

You can replace that fda_food query with any SQL query you want to run.

But this isnā€™t the best way to work with lots of dataā€” you donā€™t just want to look at the results, you want to do things with them! You could replace theĀ print(row)Ā command in that for loop to do something else... but there's easier ways to work with data in Python!

Reading data into a pandas DataFrame

Since most data analytics and data science projects useĀ pandasĀ to crunch data, what we really want is to get results from a BigQuery query into a pandas DataFrame. Luckily, the library weā€™re using makes that really easy to do.

All you need to do to fetch the query results as a DataFrame is addĀ to_dataframe()Ā instead ofĀ results(). You may need to also install theĀ db-dtypesĀ python package by runningĀ pip install db-dtypes.

query = client.query("""
   SELECT * FROM `bigquery-public-data.fda_food.food_events` LIMIT 1000
 """)

result_df = query.to_dataframe()

This will assign the results of your query to a pandas DataFrame calledĀ result_df. You can print a sample of the results to make sure it worked:

result_df.head()
Reading data from BigQuery into a pandas DataFrame
Reading data from BigQuery into a pandas DataFrame

And youā€™re off! You can use this method to execute any BigQuery query and read the results directly into a pandas DataFrame. Now you can use any pandas functions or libraries from the greater Python ecosystem on your data, jumping into a complex statistical analysis, machine learning, geospatial analysis, or even modifying and writing data back to your data warehouse.

Writing data back to BigQuery

Sometimes, an analysis isnā€™t one way, and you want to put the results of your work back into BigQuery. If youā€™ve already configured the google-cloud-bigquery package as described above, itā€™s not too much extra work to pipe data back into the warehouse.

To do this, youā€™ll need to install one more package:Ā pyarrow. The BigQuery connector uses this to compress a DataFrame prior to writeback.

pip install pyarrow

Letā€™s say we start with the results of that previous query, and just want to write them to another table in our own project.

from google.cloud import bigquery
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file('path/to/your/service-account-file.json')

project_id = 'your-bigquery-project-id-here'client = bigquery.Client(credentials= credentials,project=project_id)

query = client.query("""
   SELECT * FROM `bigquery-public-data.fda_food.food_events` LIMIT 1000
 """)

result_df = query.to_dataframe()

## you could do any manipulation to the dataframe you want here

table_id = 'your_dataset_name.new_table_name'

## optional: you can pass in a job_config with a schema if you want to define
## a specific schema
# job_config = bigquery.LoadJobConfig(schema=[
#     bigquery.SchemaField("field_name", "STRING"),
# ])

load_job = client.load_table_from_dataframe(
    df, table_id,
# job_config=job_config
)  # this will execute the load job

result = load_job.result()
result.done() # prints True if done
A successful DataFrame writeback to BigQuery
A successful DataFrame writeback to BigQuery

Itā€™s easy to verify the success of this writeback job by running a quick query against the newly created table. Hereā€™s everything all together:

Putting it all together
Putting it all together

This simple template can be repurposed to read or write any data on a Google BigQuery connection. For more (albeit a bit dense) information, you can check out the complete Google Documentation onĀ using the BigQuery API.

Happy querying!

PS: If you useĀ HexĀ (hint: youā€™re on our website right now!) you donā€™t have to deal with any of these contexts and cursors, and can just write SQL in aĀ SQL cellĀ and get the results as a DataFrame.

This is something we think a lot about at Hex, where we're creating a platform that makes it easy to build and share interactive data products which can help teams be more impactful. If this is is interesting, click below to get started, or to check out opportunities to join our team.