Read/write data from/to Google BigQuery with Dask.
This package uses the BigQuery Storage API. Please refer to the data extraction pricing table for associated costs while using Dask-BigQuery.
dask-bigquery can be installed with pip:
pip install dask-bigquery
or with conda:
conda install -c conda-forge dask-bigquery
For reading from BiqQuery, you need the following roles to be enabled on the account:
BigQuery Read Session UserBigQuery Data Viewer,BigQuery Data Editor, orBigQuery Data Owner
Alternately, BigQuery Admin would give you full access to sessions and data.
For writing to BigQuery, the following roles are sufficient:
BigQuery Data EditorStorage Object Creator
The minimal permissions to cover reading and writing:
BigQuery Data EditorBigQuery Read Session UserStorage Object Creator
By default, dask-bigquery will use the Application Default Credentials. When running code locally, you can set this to use your user credentials by running
$ gcloud auth application-default loginUser credentials require interactive login. For settings where this isn't possible, you'll need to create a service account. You can set the Application Default Credentials to the service account key using the GOOGLE_APPLICATION_CREDENTIALS environment variable:
$ export GOOGLE_APPLICATION_CREDENTIALS=/home/<username>/google.jsonFor information on obtaining the credentials, use Google API documentation.
dask-bigquery assumes that you are already authenticated.
import dask_bigquery
ddf = dask_bigquery.read_gbq(
project_id="your_project_id",
dataset_id="your_dataset",
table_id="your_table",
)
ddf.head()Assuming that client and workers are already provisioned with default credentials:
import dask
import dask_bigquery
ddf = dask.datasets.timeseries(freq="1min")
res = dask_bigquery.to_gbq(
ddf,
project_id="my_project_id",
dataset_id="my_dataset_id",
table_id="my_table_name",
)Before loading data into BigQuery, to_gbq writes intermediary Parquet to a Google Storage bucket. Default bucket name is <your_project_id>-dask-bigquery. You can provide a diferent bucket name by setting the parameter: bucket="my-gs-bucket". After the job is done, the intermediary data is deleted.
# service account credentials
creds_dict = {"type": ..., "project_id": ..., "private_key_id": ...}
res = to_gbq(
ddf,
project_id="my_project_id",
dataset_id="my_dataset_id",
table_id="my_table_name",
credentials=credentials,
)To run the tests locally you need to be authenticated and have a project created on that account. If you're using a service account, when created you need to select the role of "BigQuery Admin" in the section "Grant this service account access to project".
You can run the tests with
$ pytest dask_bigquery
if your default gcloud project is set, or manually specify the project ID with
DASK_BIGQUERY_PROJECT_ID pytest dask_bigquery
This project stems from the discussion in this Dask issue and this initial implementation developed by Brett Naul, Jacob Hayes, and Steven Soojin Kim.