Four steps to read data from your BigQuery warehouse directly into Python
In this post, youāll learn how to connect to a Google BigQuery warehouse with Python. Thereās a few different ways to do this, but weāll use the officialĀ Google Cloud Python ClientĀ (google-cloud-bigquery). Weāll walk through:
1. Installing the Google Cloud Bigquery Python Client (google-cloud-bigquery) 2. Authenticating and connecting to your BigQuery data warehouse 3. Running a query! At this point, youāve successfully connected to and queried BigQuery from Python, and you can write any query you want. Job done? Not quite. Weāll also cover: 4. Reading data from a BigQuery query into aĀ pandas DataFrame Bonus: Writing a DataFrame back into a BigQuery table.
If youāre running through this live, it should only take you around 10 minutes to go from zero to successful query. Let's get into it!
Thereās nothing particularly special about the environment needed for this, so if you already know you have a working Python 3.6+ installation, you can skip to the next section,Ā Installing the google-cloud-python package.
There are two likely scenarios here for how you're accessing Python:
a. Youāre using aĀ JupyterĀ notebook or aĀ Jupyter notebook alternative:
You should be pretty much good to go in this case. If you want to make sure, you can run this Python command in a cell and look for a response that's >= 3.6:
from platform import python_version
print(python_version())
b. Youāre using the terminal / command line / some other Python IDE directly:
Open up your terminal (Terminal app on Mac, command prompt or Powershell on windows, etc.)
Run python --version
to check what Python version is installed. Mine prints out Python 3.9.12
, but as long as yours is 3.6 or greater, youāre perfect.
If you get aĀ command not found: python
Ā error, or your output isĀ Python 2.x
, try runningĀ python3 --version
. If that works, then you have separate installations of Python 3 and Python 2, and for the rest of this tutorial youāll need to substituteĀ python3
Ā forĀ python
Ā andĀ pip3
Ā forĀ pip
Ā in the example commands.
IfĀ python3 --version
Ā also does not work, then you donāt have Python installed. The easiest way to get this working is toĀ download the official installerĀ for your machine.
PS: We wonāt go deep into the setup of virtual environments here, but if youāre doing a lot of Python work directly at the command line, youāll want to read up on them.Ā VirtualenvĀ works well, and lots of people also likeĀ conda.
This is probably the easiest step of the whole tutorial! You just have to run one command to install the official BigQuery Python Connector:
pip install google-cloud-bigquery
If youāre using a Jupyter notebook, add aĀ !
Ā before pip to let this command run.
!pip install google-cloud-bigquery
And if you had to runĀ python3 -- version
Ā earlier to get a working output, youāll run pip3 instead of pip to install the package:
pip3 install google-cloud-bigquery
ThisĀ pip install
Ā command will spin through a bunch of messages and progress bars, and maybe a few warnings about your āpip versionā or various package deprecations. This is OK, and unless anything actually says āERRORā in red, you can probably ignore it.
Now weāre ready to connect to BigQuery.
In this next part, weāll be working with sensitive information: your BigQuery service account credentials.Ā You shouldnāt ever store these directly in code or check them into git.Ā You never know what you might unthinkingly do with that codeā send it to a coworker to help, post it on Stack Overflow with a question, check it into a public git repo...
BigQuery service account credentials are provided as a JSON file, so we donāt need to stress about using environment variables or anything. Just make sure you keep this file somewhere safe, and donāt check it into git!
If you donāt already have a service account with BigQuery scopes,Ā this is a nice resourceĀ that walks you through the steps to generate one.
Now letās start working in Python. Open a new Python session (either in the terminal by runningĀ python
Ā orĀ python3
, or by opening your choice of Jupyter notebook or Jupyter notebook alternative.
We start by importing the BigQuery connector library that we just installed:
from google.cloud import bigquery
from google.oauth2 import service_account
Next, weāll open a connection to BigQuery. With this package, this means creating aĀ client
Ā object that holds an authenticated session you can use to query data.
credentials = service_account.Credentials.from_service_account_file('path/to/your/service-account-file.json')
project_id = 'your-bigquery-project-id-here'
client = bigquery.Client(credentials= credentials,project=project_id)
If this all runs successfully, then congratulationsā youāve connected to BigQuery and are ready to run queries!
BigQuery has some great public datasets that we can use to test our connection, so you can copy/paste this, but itās pretty simple. You create aĀ client.query()
Ā object with the SQL text you want to execute, and then fetch theĀ result()
Ā from that query to run it.
query = client.query("""
SELECT * FROM `bigquery-public-data.fda_food.food_events` LIMIT 1000
""")
results = query.result()
for row in results:
print(row)
You can replace that fda_food query with any SQL query you want to run.
But this isnāt the best way to work with lots of dataā you donāt just want to look at the results, you want to do things with them! You could replace theĀ print(row)
Ā command in that for loop to do something else... but there's easier ways to work with data in Python!
Since most data analytics and data science projects useĀ pandasĀ to crunch data, what we really want is to get results from a BigQuery query into a pandas DataFrame. Luckily, the library weāre using makes that really easy to do.
All you need to do to fetch the query results as a DataFrame is addĀ to_dataframe()
Ā instead ofĀ results()
. You may need to also install theĀ db-dtypes
Ā python package by runningĀ pip install db-dtypes
.
query = client.query("""
SELECT * FROM `bigquery-public-data.fda_food.food_events` LIMIT 1000
""")
result_df = query.to_dataframe()
This will assign the results of your query to a pandas DataFrame calledĀ result_df
. You can print a sample of the results to make sure it worked:
result_df.head()
And youāre off! You can use this method to execute any BigQuery query and read the results directly into a pandas DataFrame. Now you can use any pandas functions or libraries from the greater Python ecosystem on your data, jumping into a complex statistical analysis, machine learning, geospatial analysis, or even modifying and writing data back to your data warehouse.
Sometimes, an analysis isnāt one way, and you want to put the results of your work back into BigQuery. If youāve already configured the google-cloud-bigquery package as described above, itās not too much extra work to pipe data back into the warehouse.
To do this, youāll need to install one more package:Ā pyarrow
. The BigQuery connector uses this to compress a DataFrame prior to writeback.
pip install pyarrow
Letās say we start with the results of that previous query, and just want to write them to another table in our own project.
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('path/to/your/service-account-file.json')
project_id = 'your-bigquery-project-id-here'client = bigquery.Client(credentials= credentials,project=project_id)
query = client.query("""
SELECT * FROM `bigquery-public-data.fda_food.food_events` LIMIT 1000
""")
result_df = query.to_dataframe()
## you could do any manipulation to the dataframe you want here
table_id = 'your_dataset_name.new_table_name'
## optional: you can pass in a job_config with a schema if you want to define
## a specific schema
# job_config = bigquery.LoadJobConfig(schema=[
# bigquery.SchemaField("field_name", "STRING"),
# ])
load_job = client.load_table_from_dataframe(
df, table_id,
# job_config=job_config
) # this will execute the load job
result = load_job.result()
result.done() # prints True if done
Itās easy to verify the success of this writeback job by running a quick query against the newly created table. Hereās everything all together:
This simple template can be repurposed to read or write any data on a Google BigQuery connection. For more (albeit a bit dense) information, you can check out the complete Google Documentation onĀ using the BigQuery API.
Happy querying!
PS: If you useĀ HexĀ (hint: youāre on our website right now!) you donāt have to deal with any of these contexts and cursors, and can just write SQL in aĀ SQL cellĀ and get the results as a DataFrame.