Google Sheets
Google Sheets is a cloud-based spreadsheet application offered by Google as part of its Google Workspace suite.
This Google Sheets dlt
verified source and
pipeline example
loads data using “Google Sheets API” to the destination of your choice.
Sources and resources that can be loaded using this verified source are:
Name | Description |
---|---|
google_spreadsheet | Retrieves data from a Google Spreadsheet |
range_names | Processes the range and yields data from each range |
spreadsheet_info | Information about the spreadsheet and the ranges processed |
Setup Guide
Grab credentials
There are two methods to get authenticated for using this verified source:
- OAuth credentials
- Service account credential
Here we'll discuss how to set up both OAuth tokens and service account credentials. In general, OAuth tokens are preferred when user consent is required, while service account credentials are better suited for server-to-server interactions. Here we recommend using service account credentials. You can choose the method of authentication as per your requirement.
Google service account credentials
You need to create a GCP service account to get API credentials if you don't have one. To create one, follow these steps:
Sign in to console.cloud.google.com.
Create a service account if needed.
Enable "Google Sheets API", refer Google documentation for comprehensive instructions on this process.
Generate credentials:
- Navigate to IAM & Admin in the console's left panel, and then select Service Accounts.
- Identify the service account you intend to use, and click on the three-dot menu under the "Actions" column next to it.
- Create a new JSON key by selecting "Manage Keys" > "ADD KEY" > "CREATE".
- You can download the ".json" file containing the necessary credentials for future use.
Google OAuth credentials
You need to create a GCP account to get OAuth credentials if you don't have one. To create one, follow these steps:
Open a GCP project in your GCP account.
Enable the Sheets API in the project.
Search credentials in the search bar and go to Credentials.
Go to Credentials -> OAuth client ID -> Select Desktop App from the Application type and give an appropriate name.
Download the credentials and fill "client_id", "client_secret" and "project_id" in "secrets.toml".
Go back to credentials and select the OAuth consent screen on the left.
Fill in the App name, user support email(your email), authorized domain (localhost.com), and dev contact info (your email again).
Add the following scope:
"https://www.googleapis.com/auth/spreadsheets.readonly"
Add your email as a test user.
Generate
refresh_token
:After configuring "client_id", "client_secret" and "project_id" in "secrets.toml". To generate the refresh token, run the following script from the root folder:
python google_sheets/setup_script_gcp_oauth.py
Once you have executed the script and completed the authentication, you will receive a "refresh token" that can be used to set up the ".dlt/secrets.toml".
Prepare your data
Share Google Sheet with the email
Note: For service account authentication, use the client_email. For OAuth authentication, use the email associated with the app creation and refresh token generation.
To allow the API to access the Google Sheet, open the sheet that you wish to use and do the following:
Select the share button in the top left corner.
In Add people and groups, add the "client_email" or "user_email" with at least viewer privileges.
Provide the spreadsheet ID/URL and explicit range names
When setting up the pipeline, you can use either the browser-copied URL of your spreadsheet:
https://docs.google.com/spreadsheets/d/1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4/edit?usp=sharing
or spreadsheet id (which is a part of the url)
1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4
typically you pass it directly to the google_spreadsheet function or in config.toml as defined here.
You can provide specific ranges to google_spreadsheet
pipeline, as detailed in following.
Guidelines about headers
Make sure your data has headers and is in the form of well-structured table.
The first row of any extracted range should contain headers. Please make sure:
- The header names are strings and are unique.
- All the columns that you intend to extract have a header.
- The data starts exactly at the origin of the range - otherwise a source will remove padding, but it
is a waste of resources.
When a source detects any problems with headers or table layout, it will issue a WARNING in the log. Hence, we advise running your pipeline script manually/locally and fixing all the problems.
- Columns without headers will be removed and not extracted.
- Columns with headers that do not contain any data will be removed.
- If there are any problems with reading headers (i.e. header is not string or is empty or not unique): the headers row will be extracted as data and automatic header names will be used.
- Empty rows are ignored
dlt
will normalize range names and headers into table and column names - so they may be different in the database than in Google Sheets. Prefer small cap names without special characters.
Guidelines about named ranges
We recommend to use Named Ranges to indicate which data should be extracted from a particular spreadsheet, and this is how this source will work by default - when called without setting any other options. All the named ranges will be converted into tables, named after them and stored in the destination.
You can let the spreadsheet users add and remove tables by just adding/removing the ranges, you do not need to configure the pipeline again.
You can indicate exactly the fragments of interest, and only this data will be retrieved, so it is the fastest.
You can name database tables by changing the range names.
In range_names, you can enter as follows:
range_names = ["Range_1","Range_2","Sheet1!A1:D10"]
You can pass explicit ranges to the Google Spreadsheet "ranged_names" as:
Name Example Sheet names ["Sheet1","Sheet2","custom_sheet_name"] Named ranges ["range_name1","range_name2"] Any range ["Sheet1!A1:B7","Sheet2!B3:E15"]
If you are not happy with the workflow above, you can:
Disable it by setting
get_named_ranges
option toFalse
.Enable retrieving all sheets/tabs with get_sheets option set to
True
.Pass a list of ranges as supported by Google Sheets in range_names.
Note: To retrieve all named ranges with "get_named_ranges" or all sheets with "get_sheets" methods, pass an empty
range_names
list asrange_names = []
. Even when you use a set "get_named_ranges" to false pass the range_names as an empty list to get all the sheets with "get_sheets" method.
Initialize the verified source
To get started with your data pipeline, follow these steps:
Enter the following command:
dlt init google_sheets duckdb
This command will initialize the pipeline example with Google Sheets as the source and duckdb as the destination.
If you'd like to use a different destination, simply replace
duckdb
with the name of your preferred destination.After running this command, a new directory will be created with the necessary files and configuration settings to get started.
For more information, read the guide on how to add a verified source.
Add credentials
In the
.dlt
folder, there's a file calledsecrets.toml
. It's where you store sensitive information securely, like access tokens. Keep this file safe. Here's its format for service account authentication:[sources.google_sheets.credentials] ##CHECK IT
project_id = "project_id" # please set me up!
client_email = "client_email" # please set me up!
private_key = "private_key" # please set me up!From the ".json" that you downloaded earlier, copy
project_id
,private_key
, andclient_email
under[sources.google_sheets.credentials]
.Alternatively, if you're using OAuth credentials, replace the fields and values with those you grabbed for OAuth credentials.
The secrets.toml for OAuth authentication looks like:
[sources.google_sheets.credentials] ##CHECK IT
client_id = "client_id" # please set me up!
client_secret = "client_secret" # please set me up!
refresh_token = "refresh_token" # please set me up!
project_id = "project_id" # please set me up!Finally, enter credentials for your chosen destination as per the docs.
Next you need to configure ".dlt/config.toml", which looks like:
[sources.google_sheets]
spreadsheet_url_or_id = "Please set me up!"
range_names = ["Please set me up!"]In range_names, you can enter values as discussed in Guidelines about named ranges.
Provide the spreadsheet URL or just its ID as the identifier. E.g., use either the full link:
spreadsheet_identifier = "https://docs.google.com/spreadsheets/d/1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4/edit?usp=sharing"
or spreadsheet id (which is a part of the url)
spreadsheet_identifier="1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4"
Note: You have an option to pass "range_names" and "spreadsheet_identifier" directly to the google_spreadsheet function or in ".dlt/config.toml"
For more information, read the General Usage: Credentials.
Run the pipeline
Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command:
pip install -r requirements.txt
You're now ready to run the pipeline! To get started, run the following command:
python google_sheets_pipeline.py
Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command:
dlt pipeline <pipeline_name> show
For example, the
pipeline_name
for the above pipeline example isgoogle_sheets_pipeline
, you may also use any custom name instead.
For more information, read the guide on how to run a pipeline.
Data types
The dlt
normalizer uses the first row of data to infer types and attempts to coerce subsequent rows, creating variant columns if unsuccessful. This is standard behavior.
If dlt
did not correctly determine the data type in the column, or you want to change the data type for other reasons,
then you can provide a type hint for the affected column in the resource.
Also, since recently dlt
's no longer recognizing date and time types, so you have to designate it yourself as timestamp
.
Use the apply_hints
method on the resource to achieve this.
Here's how you can do it:
for resource in resources:
resource.apply_hints(columns={
"total_amount": {"data_type": "double"},
"date": {"data_type": "timestamp"},
})
In this example, the total_amount
column is enforced to be of type double and date
is enforced to be of type timestamp.
This will ensure that all values in the total_amount
column are treated as double
, regardless of whether they are integers or decimals in the original Google Sheets data.
And date
column will be represented as dates, not integers.
For a single resource (e.g. Sheet1
), you can simply use:
source.Sheet1.apply_hints(columns={
"total_amount": {"data_type": "double"},
"date": {"data_type": "timestamp"},
})
To get the name of resources, you can use:
print(source.resources.keys())
To read more about tables, columns, and datatypes, please refer to our documentation here.
dlt
will not modify tables after they are created.
So if you changed data types with hints,
then you need to delete the dataset
or set dev_mode=True
.
Sources and resources
dlt
works on the principle of sources and
resources.
Source google_spreadsheet
This function loads data from a Google Spreadsheet. It retrieves data from all specified ranges, whether explicitly defined or named, and obtains metadata for the first two rows within each range.
def google_spreadsheet(
spreadsheet_url_or_id: str = dlt.config.value,
range_names: Sequence[str] = dlt.config.value,
credentials: Union[
GcpOAuthCredentials, GcpServiceAccountCredentials
] = dlt.secrets.value,
get_sheets: bool = False,
get_named_ranges: bool = True,
) -> Iterable[DltResource]:
...
spreadsheet_url_or_id
: ID or URL of the Google Spreadsheet.
range_names
: List of ranges (in Google Sheets format) to be transformed into tables.
credentials
: GCP credentials with Google Sheets API access.
get_sheets
: If True, imports all spreadsheet sheets into the database.
get_named_ranges
: If True, imports either all named ranges or those
specified into the database.
Resource range_names
This function processes each range name provided by the source function, loading its data into separate tables in the destination.
dlt.resource(
process_range(rows_data, headers=headers, data_types=data_types),
name=name,
write_disposition="replace",
)
process_range
: Function handles rows from a specified Google Spreadsheet range, taking data rows,
headers, and data types as arguments.
name
: Specifies the table's name, derived from the spreadsheet range.
write_disposition
: Dictates how data is loaded to the destination.
Please Note:
- Empty rows are ignored.
- Empty cells are converted to None (and then to NULL by dlt).
- Data in columns without headers will be dropped.
Resource spreadsheet_info
This resource loads the info about the sheets and range names into the destination as a table. This table refreshes after each load, storing information on loaded ranges:
- Spreadsheet ID and title.
- Range name as given to the source.
- String and parsed representation of the loaded range.
dlt.resource(
metadata_table,
write_disposition="merge",
name="spreadsheet_info",
merge_key="spreadsheet_id",
)
metadata_table
: Contains metadata about the spreadsheet and the ranges processed.
name
: Denotes the table name, set here as "spreadsheet_info".
write_disposition
: Dictates how data is loaded to the destination.
Read more.
merge_key
: Parameter is used to specify the column used to identify records for merging. In this
case,"spreadsheet_id", means that the records will be merged based on the values in this column.
Read more.
Customization
Create your own pipeline
If you wish to create your own pipelines, you can leverage source and resource methods from this verified source.
Configure the pipeline by specifying the pipeline name, destination, and dataset as follows:
pipeline = dlt.pipeline(
pipeline_name="google_sheets", # Use a custom name if desired
destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post)
dataset_name="google_spreadsheet_data" # Use a custom name if desired
)To load data from explicit range names:
load_data = google_spreadsheet(
"https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL
range_names=["range_name1", "range_name2"], # Range names
get_sheets=False,
get_named_ranges=False,
)
load_info = pipeline.run(load_data)
print(load_info)Note: You can pass the URL or spreadsheet ID and range names explicitly or in ".dlt/config.toml".
To load all the range_names from spreadsheet:
load_data = google_spreadsheet(
"https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL
get_sheets=False,
get_named_ranges=True,
)
load_info = pipeline.run(load_data)
print(load_info)Pass an empty list to range_names in ".dlt/config.toml" to retrieve all range names.
To load all the sheets from spreadsheet:
load_data = google_spreadsheet(
"https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL
get_sheets=True,
get_named_ranges=False,
)
load_info = pipeline.run(load_data)
print(load_info)Pass an empty list to range_names in ".dlt/config.toml" to retrieve all sheets.
To load all the sheets and range_names:
load_data = google_spreadsheet(
"https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL
get_sheets=True,
get_named_ranges=True,
)
load_info = pipeline.run(load_data)
print(load_info)Pass an empty list to range_names in ".dlt/config.toml" to retrieve all sheets and range names.
To load data from multiple spreadsheets:
load_data1 = google_spreadsheet(
"https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL
range_names=["Sheet 1!A1:B10"],
get_named_ranges=False,
)
load_data2 = google_spreadsheet(
"https://docs.google.com/spreadsheets/d/3jo4HjqouQnnCIZAFa2rL6vT91YRN8aIhts22SKKO390/edit#gid=0", #Spreadsheet URL
range_names=["Sheet 1!B1:C10"],
get_named_ranges=True,
)
load_info = pipeline.run([load_data1,load_data2])
print(load_info)To load with table rename:
load_data = google_spreadsheet(
"https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL
range_names=["Sheet 1!A1:B10"],
get_named_ranges=False,
)
data.resources["Sheet 1!A1:B10"].apply_hints(table_name="loaded_data_1")
load_info = pipeline.run(load_data)
print(load_info)
Using Airflow with Google Spreadsheets:
Consider the following when using Google Spreadsheets with Airflow:
Efficient Data Retrieval
- Our source fetches all required data with just two API calls, regardless of the number of specified data ranges. This allows for swift data loading from google_spreadsheet before executing the pipeline.
Airflow Specificity
- With Airflow, data source creation and execution are distinct processes.
- If your execution environment (runner) is on a different machine, this might cause the data to be loaded twice, leading to inefficiencies.
Airflow Helper Caution
- Avoid using
scc decomposition
because it unnecessarily creates a new source instance for every specified data range. This is not efficient and can cause redundant tasks.
Recommended Airflow Deployment
Below is the correct way to set up an Airflow DAG for this purpose:
Define a DAG to run daily, starting from say February 1, 2023. It avoids catching up for missed runs and ensures only one instance runs at a time.
Data is imported from Google Spreadsheets and directed BigQuery.
When adding the Google Spreadsheet task to the pipeline, avoid decomposing it; run it as a single task for efficiency.
@dag(
schedule_interval='@daily',
start_date=pendulum.datetime(2023, 2, 1),
catchup=False,
max_active_runs=1,
default_args=default_task_args
)
def get_named_ranges():
tasks = PipelineTasksGroup("get_named_ranges", use_data_folder=False, wipe_local_data=True)
# import your source from pipeline script
from google_sheets import google_spreadsheet
pipeline = dlt.pipeline(
pipeline_name="get_named_ranges",
dataset_name="named_ranges_data",
destination='bigquery',
)
# do not use decompose to run `google_spreadsheet` in single task
tasks.add_run(pipeline, google_spreadsheet("1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580"), decompose="none", trigger_rule="all_done", retries=0, provide_context=True)
Additional Setup guides
- Load data from Google Sheets to BigQuery in python with dlt
- Load data from Google Sheets to Azure Cosmos DB in python with dlt
- Load data from Google Sheets to AWS S3 in python with dlt
- Load data from Google Sheets to AWS Athena in python with dlt
- Load data from Google Sheets to Dremio in python with dlt
- Load data from Google Sheets to Neon Serverless Postgres in python with dlt
- Load data from Google Sheets to Timescale in python with dlt
- Load data from Google Sheets to MotherDuck in python with dlt
- Load data from Google Sheets to DuckDB in python with dlt
- Load data from Google Sheets to Microsoft SQL Server in python with dlt