Data Challenge Lab Home

Google sheets [wrangle]

(Builds on: Parsing basics)

# Libraries
library(tidyverse)
library(googlesheets)

# Parameters
  # URL for Gapminder example
url_gapminder <- "https://docs.google.com/spreadsheets/d/1BzfL0kZUz1TsI5zxJF1WNF01IxvC67FbOJUiiGMZ_mQ/"

In this reading, we’ll show you how to use the googlesheets package by Jenny Bryan to (surprise!) extract data from Google sheets. Google sheets are a surprisingly useful way to collect and collaboratively work with data, and the googlesheets package makes it easy to pull that data into R. One useful workflow involves using Google forms to collect form data into a Google sheet, and then using the googlesheets package to extract and analyze that data in R.

Public sheets

Some Google sheets are public, which means that anyone can read them. Take a look at this example of data from Gapminder.

Each Google sheet has a sheet key. You’ll need this key to load a sheet with googlesheets. Here’s how to get the sheet key from a sheet’s URL.

sheet_key <- extract_key_from_url(url_gapminder)

sheet_key
## [1] "1BzfL0kZUz1TsI5zxJF1WNF01IxvC67FbOJUiiGMZ_mQ"

Once you have the sheet key, you can use it to create a googlesheets object.

gs <- gs_key(sheet_key)

class(gs)
## [1] "googlesheet" "list"

You can also list all the worksheets in a single Google sheet with gs_ws_ls().

gs_ws_ls(gs)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

To read in a worksheet, use gs_read(). gs_read() takes a googlesheets object and, optionally, the name of a specific worksheet.

asia <- 
  gs %>% 
  gs_read(ws = "Asia")
## Accessing worksheet titled 'Asia'.

## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_double(),
##   lifeExp = col_double(),
##   pop = col_double(),
##   gdpPercap = col_double()
## )
asia
## # A tibble: 396 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 386 more rows

Private sheets

Accessing private sheets requires you to authenticate to Google. Authentication is done with this command.

# Give googlesheets permission to access spreadsheet
gs_auth()

A page will open in your browser and you’ll be prompted to log into Google. Once you’ve logged in, googlesheets will create a file called .httr-oauth in your current directory. NEVER CHECK THIS INTO GIT OR UPLOAD IT TO GITHUB. To ensure that you don’t actually push your .httr-oauth, gs_auth() will add .httr-oauth to the .gitignore file in your current directory.

The .httr-oauth file allows you to avoid logging into Google in the future. You don’t upload this file to GitHub because if someone else had it, they could use it to access your Google files.

A common error you’ll encounter involves a googlesheets function being unable find the .httr-oauth file. If you’re using an RStudio project, your working directory is usually the top level of the project, and so gs_auth() will put .httr-oauth in the top-level of the directory. However, you’ll often be working with an R Markdown document or R script located in a subfolder, and so the googlesheets function will look for .httr-oauth in that subfolder. To avoid this problem, you can copy .httr-oauth over to your subfolder, so that you have one .httr-oauth the top level of the project and one in your subfolder.

Once authenticated into Google, you now need to find the sheet key for the relevant Google sheet. Like we did earlier, you can find the sheet’s URl and then call extract_key_from_url() like we did earlier.

With private sheets, another option is to use gs_ls() to see all the sheets to which you have access. gs_ls() orders by modification time, with the most recently modified sheets first.

gs_ls() %>% view()

Locate the relevant sheet and copy its sheet key. Create a variable for the key in your parameters (e.g., sheet_key <- "XdkfjOkdjk"), and then read in the sheet using

df <- 
  gs_key(sheet_key) %>%
  gs_read()