Four steps to read data from your warehouse directly into Python
Your local Python environment and a massive Snowflake cluster up in the cloud live in very different worlds. Learning to connect them and query data from Snowflake directly into your Python environment or Jupyter notebook means you can get whatever data you need right into the tool where you need it.
There are a few different ways to connect to and query Snowflake from Python. Weāll cover the most common approach in depth, and briefly mention some alternative methods. We'll walk through:
1. Installing the Snowflake Connector for Python (snowflake-connector-python) 2. Authenticating and connecting to your Snowflake data warehouse 3. Running a query! At this point, youāve successfully connected to and queried Snowflake from Python, and you can write any query you want. Job done? Not quite. Weāll also cover: 4. Reading data from a Snowflake query into aĀ pandas DataFrame
If youāre running through this live, it should only take you around 10 minutes to go from zero to successful query. Let's get into it!
Thereās nothing particularly special about the environment needed for this, so if you already know you have a working Python 3.6+ installation, you can skip to the next section,Ā Installing the snowflake-connector-python package.
There are two likely scenarios here for how you're accessing Python:
a. Youāre using aĀ JupyterĀ notebook or aĀ Jupyter notebook alternative:
You should be pretty much good to go in this case. If you want to make sure, you can run this Python command in a cell and look for a response that's >= 3.6:
from platform import python_version
print(python_version())
b. Youāre using the terminal / command line / some other Python IDE directly:
Open up your terminal (Terminal app on Mac, command prompt or Powershell on windows, etc.)
Run python --version
to check what Python version is installed. Mine prints out Python 3.9.12
, but as long as yours is 3.6 or greater, youāre perfect.
If you get aĀ command not found: python
Ā error, or your output isĀ Python 2.x
, try runningĀ python3 --version
. If that works, then you have separate installations of Python 3 and Python 2, and for the rest of this tutorial youāll need to substitueĀ python3
Ā forĀ python
Ā andĀ pip3
Ā forĀ pip
Ā in the example commands.
IfĀ python3 --version
Ā also does not work, then you donāt have Python installed. The easiest way to get this working is toĀ download the official installerĀ for your machine.
PS: We wonāt go deep into the setup of virtual environments here, but if youāre doing a lot of Python work directly at the command line, youāll want to read up on them.Ā VirtualenvĀ works well, and lots of people also likeĀ conda.
This step is very easy! It only takes one command to install the official Snowflake Python connector. Weāll also install pandas at the same time, since weāll use it later.
pip install "snowflake-connector-python[pandas]"
Remember, if you had to runĀ python3 -- version
Ā earlier to get a working output, you need to runĀ pip3
Ā instead ofĀ pip
Ā to install the package. And if youāre writing this in a Jupyter notebook, add aĀ !
Ā before pip to let this command run as a system call.
!pip install "snowflake-connector-python[pandas]"
ThisĀ pip install
Ā command will spin through a bunch of messages and progress bars, and maybe a few warnings about your āpip versionā or various package deprecations. This is OK, and unless anything actually says āERRORā in red, you can probably ignore it.
Now weāre ready to connect to the Snowflake data warehouse.
In this next part, weāll be working with sensitive information: your Snowflake authentication credentials.Ā You shouldnāt ever store these directly in code.Ā You never know what you might unthinkingly do with that codeā send it to a coworker to help, post it on Stack Overflow with a question, check it into a public git repo...
Instead of directly entering your user and password credentials, weāll use environment variables. This lets us reference the values from Python without directly storing them in the Python code, entrusting their safekeeping to our computerās environment.
In a terminal (not a Python session) run the following command, replacing the placeholders with your actual username and password:
export snowflakeuser=<your_snowflake_username>
export snowflakepass=<your_snowflake_password>
In a Jupyter notebook, you can also choose to set environment variables right from the notebook using theĀ %env
Ā 'magic' command. Make sure to delete the cell after it's run, though, or else the credentials will still be sitting right there! Some online hosted Jupyter notebooks have their own secret variable managers, and you can use those instead.
%env snowflakeuser=<your_snowflake_username>
%env snowflakepass=<your_snowflake_password>
These commands won't print any kind of output or status response, but they will have saved these sensitive values to your current environment. Weāll learn how to reference them in a little bit.
Now letās start working in Python. Open a new Python session, either in the terminal by runningĀ python
Ā /Ā python3
, or by opening your choice of notebook tool.
Weāll import the packages that we need to work with:
import pandas as pd
import os
import snowflake.connector
Now we can create a connection to Snowflake. Youāll need a few other pieces of information about your Snowflake instance for this (shown as placeholders below), but these are not sensitive and donāt need to be stored as environment variables.
ctx = snowflake.connector.connect(
user=os.environ['snowflakeuser'],
password=os.environ['snowflakepass'],
account='your-snowflake-account-identifier-here',
warehouse='your-warehouse-name',
role='your-role-name',
database='your-database-name',
schema ='your-default-schema-name'
)
cs = ctx.cursor()
Not sure where to find this information?
Your account identifier is everything betweenĀ https://
Ā andĀ .snowflakecomputing.com
Ā in your Snowflake web URL. If your URL isĀ https://ab12345.us-east-2.aws.snowflakecomputing.com
, then your account identifier isĀ ab12345.us-east-2.aws
.
You can find data warehouse, role, database, and schema information from theĀ ContextĀ menu in a new SnowflakeĀ worksheet.
These commands establish a connection and open a ācursorā to your Snowflake data warehouse. Quick dictionary interlude:
ctx
stands for ācontextā, a common term used when accessing databases in Python. It refers to an object that holds āstateā (aka context) about the current status of the database connection: if itās still authenticated, elapsed time since it was created, etc.
cs
stands for "cursor", the standard terminology for an object used to actually access records in a database. Kind of like how you can āpointā at things on the screen with your mouse cursor, this virtual cursor acts as a pointer to rows in your database and lets you point at and select them.
Also, notice how weāre using the os.environ['snowflakepass']
syntax to reference the environment variable containing our password. This means if a bad actor gets their hands on this code, they still wonāt have anything sensitive.
If those commands ran successfully, weāre all authenticated and ready to query!
First, weāll run a simple test query just to establish that our connection worked properly.
sql = "select 1"
cs.execute(sql)
first_row = cs.fetchone()
print(first_row[0])
If that returnedĀ 1
, then congratulationsāĀ youāve just run a query against your Snowflake data warehouse!Ā You can replaceĀ "select 1"
Ā with any SQL query you want to run, and theĀ cs
Ā object will contain the results.
Now that you're running a real query, you probably want to fetch more than one row from the results. In addition toĀ fetchone()
, the cursor object has aĀ fetchall()
Ā method that helps iterate over the entire result set:
results = cs.fetchall()
for row in results:
print('%s, %s' % (row[0], row[1]))
But this isnāt the best way to work with lots of dataā you donāt just want to look at the results, you want to do things with them! You could manually replace thatĀ print()
 line with something that dumps all the rows into a data structure you can use⦠or, as we'll explain now, you can use the fetch_pandas_all()
Ā method to do all that automatically!
For more information and other ways to fetch results from a cursor,Ā dive into the official Snowflake docs.
Since most data analytics and data science projects useĀ pandasĀ to crunch data, what we really want is to get results from a Snowflake query into a pandas DataFrame. Luckily, the library weāre using makes that really easy to do.
Letās use a real query instead of thatĀ SELECT 1
, so we have some more data to work with. Iāve got a demo dataset with a table calledĀ DIM_CUSTOMERS
, in a DB calledĀ ANALYTICS
. You won't be able to run this query if you copy/paste it, so you'll want to replace it with a query against your own data.
If you get an error likeĀ Database 'your-db-name' does not exist or not authorized
Ā when running a query, double-check yourĀ ctx
Ā config from earlier and make sure it includes a role that is authorized to access the database and schema youāre querying. Remember, if you create a new context, you then need to re-run the command to establish a new cursor.
sql = 'SELECT * FROM "ANALYTICS"."PROD"."DIM_CUSTOMERS"'
cs.execute(sql)
df = cs.fetch_pandas_all()
This will assign the results of your query to a pandas DataFrame calledĀ df
. You can print a sample of the results to make sure it worked:
df.head()
And youāre off to the races! You can use this method to execute any Snowflake query and read the results directly into a pandas DataFrame.
Once you've got data in this format, you can use any pandas functions or libraries from the greater Python ecosystem on your data, jumping into a complex statistical analysis, machine learning, geospatial analysis, or even modifying andĀ writing data backĀ to your data warehouse.
Part of the joy (or chaos) of Python is that thereās always more than one way to do something. Thereās lots of other options for connecting to Snowflake, but here are the most relevant alternatives:
SQLAlchemy: SQLAlchemy is a generic database toolkit for connecting to any database from Python. Itās not specifically tailored to Snowflake, but works with it just fine. Check out their docs for detailed instructions on connecting SQLAlchemy to Python.
pandas.read_sql: This isnāt a replacement for the entire process, since you still have to create a Snowflake connection, but instead of running your SQL with a cursor and then using fetch_pandas_all
, you could do it all from pandas directly. In my anecdotal experience, this is less performant than the other methods, though it still works fine.
df = pd.read_sql('SELECT * FROM "ANALYTICS"."PROD"."DIM_CUSTOMERS"', ctx)