library(rvest) # Used for web scraping and extracting HTML content
library(xml2) # Helps in working with XML and HTML data
library(tidyverse)
Starbucks
Web scraping
Web scraping consists of extracting information from a web page. The difficulty or ease of extracting this information depends on how well the page is constructed. In more complex cases, the information may be behind a captcha or within an interactive panel that depends on user input.
In this simple example, I will show how to find the locations of all Starbucks stores in Brazil. The full list of active Starbucks stores can be found on the Starbucks Brasil website. As usual, we will use the tidyverse
along with the rvest
and xml2
packages.
The website
The full list of active Starbucks stores can be found on the Starbucks Brasil website. To read the page, we use read_html.
= "https://starbucks.com.br/lojas"
url = xml2::read_html(url) # Fetches the HTML content of the page page
The “xpath” shows the path to a specific element on the page. For example, to find the Starbucks logo in the top-left corner of the page, we can use the following code:
# Extracts the HTML element for the logo image
%>%
page html_element(xpath = "/html/body/div[1]/div[1]/header/nav/div/div[1]/a/img")
{html_node}
<img alt="Starbucks Logo" src="/public/img/icons/starbucks-nav-logo.svg">
To learn more about xpaths, you can consult this cheatsheet.
In general, on well-constructed pages, the name of elements will be quite self-explanatory. In the case above, the “alt” attribute already indicates that the object is the Starbucks logo, and the “src” directs to an image file in svg format called starbucks-nav-logo. Unfortunately, this won’t always be the case. On some pages, elements can be quite confusing.
To extract a specific attribute, we use the html_attr
function.
%>%
page html_element(
xpath = "/html/body/div[1]/div[1]/header/nav/div/div[1]/a/img"
%>%
) # Extracts the "src" attribute (URL to the image)
html_attr("src")
[1] "/public/img/icons/starbucks-nav-logo.svg"
If you combine this last link with “www.starbucks.com.br”, you should arrive at an image of the company’s logo1.
To find the big list of stores in the left panel, we will take advantage of the fact that the div
holding this list has a unique class called "place-list"
. It’s easy to verify this directly in your browser. If you use Chrome, for instance, just right-click on the panel and click on Inspect.
As I mentioned above, things aren’t always well organized. Note that since we want to extract multiple elements and multiple (all) attributes, we use the variants: html_elements
and html_attrs
.
<- page %>%
list_attr # Selects all divs under "place-list" that hold store info
html_elements(xpath = '//div[@class="place-list"]/div') %>%
# Extracts all attributes of the selected elements
html_attrs()
The extracted object is a list where each element is a text vector containing the following information. We have the store name, latitude/longitude, and the address.
# Extracts the first store's information from the list
pluck(list_attr, 1)
class data-latitude data-longitude
"place-item r-place-15" "-23.5658059" "-46.6508012"
data-name data-street data-index
"Shopping Top Center" "Avenida Paulista, 854" "0"
At this point, the web scraping process is complete. Once again, the process was easy because the data is well structured on the Starbucks page. Now, we just need to clean the data.
Data Cleaning
I won’t go in depth about the data cleaning process. Basically, we need to convert each element of the list into a data.frame
, stack the results, and then adjust the data types of each column.
# Convert the elements into data.frame
<- map(list_attr, \(x) as.data.frame(t(x)))
dat # Stack the results
<- bind_rows(dat)
dat
<- dat %>%
clean_dat as_tibble() %>%
# Rename the columns
rename_with(~str_remove(.x, "data-")) %>%
rename(lat = latitude, lng = longitude) %>%
# Select the columns of interest
select(index, name, street, lat, lng) %>%
# Convert lat/lng to numeric
mutate(
lat = as.numeric(lat),
lng = as.numeric(lng),
index = as.numeric(index),
name = str_trim(name)
)
Map
The table above is already in a pretty satisfactory format. We can check the data by building a simple map.
library(sf)
library(leaflet)
<- st_as_sf(clean_dat, coords = c("lng", "lat"), crs = 4326, remove = FALSE)
starbucks
leaflet(starbucks) %>%
addTiles() %>%
addMarkers(label = ~name) %>%
addProviderTiles("CartoDB") %>%
setView(lng = -46.65590, lat = -23.561197, zoom = 12)
It’s worth noting that data extracted via web scraping almost always contains some noise. In this case, the data seems relatively clean after a bit of processing. The addresses are not always very informative, like in the case of “Rodovia Hélio Smidt, S/N,” but this happens because many stores are located inside hospitals, shopping malls, or airports.
With this data, we can already perform interesting analyses. For example, we can find out that there are five Starbucks stores just on Avenida Paulista.
%>%
starbucks filter(str_detect(street, "Avenida Paulista"))
Simple feature collection with 5 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -46.65895 ymin: -23.56784 xmax: -46.64809 ymax: -23.55785
Geodetic CRS: WGS 84
# A tibble: 5 × 6
index name street lat lng
* <dbl> <chr> <chr> <dbl> <dbl>
1 0 Shopping Top Center Avenida Paulista, 854 -23.6 -46.7
2 1 Shopping Cidade São Paulo Avenida Paulista, 1154 -23.6 -46.7
3 2 Paulista Trianon Avenida Paulista, 1499 -23.6 -46.7
4 3 Paulista 500 Avenida Paulista, 500 -23.6 -46.6
5 6 Shopping Center 3 Avenida Paulista, 2064 -23.6 -46.7
geometry
* <POINT [°]>
1 (-46.6508 -23.56581)
2 (-46.65438 -23.5631)
3 (-46.6558 -23.56226)
4 (-46.64809 -23.56784)
5 (-46.65895 -23.55785)
We can also count the number of stores in each airport. Apparently, there are eight stores at Guarulhos Airport, which seems like quite a high number to me.
%>%
starbucks st_drop_geometry() %>%
filter(str_detect(name, "Aeroporto")) %>%
mutate(
name_airport = str_remove(name, "de "),
name_airport = str_extract(name_airport, "(?<=Aeroporto )\\w+"),
name_airport = if_else(is.na(name_airport), "Confins", name_airport),
.before = "name"
%>%
) count(name_airport, sort = TRUE)
# A tibble: 9 × 2
name_airport n
<chr> <int>
1 GRU 8
2 Brasília 3
3 Florianópolis 3
4 Confins 2
5 Galeão 2
6 Viracopos 2
7 Congonhas 1
8 Curitiba 1
9 Santos 1
Finally, we can note that many Starbucks stores are located inside shopping malls. A simple calculation shows that around 75 stores are located inside malls, close to 50% of the total units2.
%>%
starbucks st_drop_geometry() %>%
filter(str_detect(name, "Shopping|shopping")) %>%
nrow()
[1] 75
Building
Finding each city
From this data, we can add more information. Using the geobr
package, we can identify in which cities the stores are located.
= geobr::read_municipality(showProgress = FALSE)
dim_city = st_transform(dim_city, crs = 4326)
dim_city ::sf_use_s2(FALSE)
sf
= starbucks %>%
starbucks st_join(dim_city) %>%
relocate(c(name_muni, abbrev_state), .before = lat)
Now we can see which cities have the most Starbucks locations.
%>%
starbucks st_drop_geometry() %>%
count(name_muni, abbrev_state, sort = TRUE)
# A tibble: 43 × 3
name_muni abbrev_state n
<chr> <chr> <int>
1 São Paulo SP 45
2 Rio De Janeiro RJ 11
3 Guarulhos SP 9
4 Curitiba PR 8
5 Brasília DF 6
6 Campinas SP 6
7 Florianópolis SC 5
8 Jundiaí SP 4
9 Porto Alegre RS 4
10 Ribeirão Preto SP 3
# ℹ 33 more rows
Ou seja, há mais Starbucks somente na Paulista do que em quase todas as demais cidades do Brasil.
Google Places
Adding information using Google Places API
The Google Places API allows access to data from Google Maps. The googleway
package integrates this data into R
already in tidy format.
library(googleway)
I’ll create a simple search to return all Starbucks locations in Brazil. A full search across the entire country would take too long, so I will use the coordinates I found via web scraping as a starting point.
The function below searches for the term “starbucks” at all the points I provide. To simplify, the function returns only a few of the columns.
# Function to grab starbucks info
<- function(lat, lng) {
get_starbucks_info
# Search for 'Starbucks' using the provided latitude and longitude.
= google_places(
places search_string = "starbucks", # Search term "starbucks"
location = c(lat, lng) # Coordinates (lat and lng) for the search
)
# Define the columns of interest to keep from the results
= c(
sel_cols "name", # Store name
"formatted_address", # Store address
"lat", # Latitude
"lng", # Longitude
"rating", # Store rating
"user_ratings_total", # Number of user ratings
"business_status" # Business status (e.g., operational or closed)
)
# Process the results and select the relevant columns
$results %>%
places::unnest("geometry") %>% # Extract the nested 'geometry' field
tidyr::unnest("location") %>% # Extract the nested 'location' field (lat and lng)
tidyr::select(dplyr::all_of(sel_cols)) # Select only the columns of interest
dplyr }
O código abaixo roda a função acima em todos os 142 estabelecimentos, encontrados na página oficial do Starbucks Brasil.
# Remove geometry and keep only coordinates
<- starbucks %>%
coords_starbucks st_drop_geometry() %>%
as_tibble() %>%
select(index, name, lat, lng)
= purrr::map2(
starbucks_info $lat,
coords_starbucks$lng,
coords_starbucks
get_starbucks_info
)
<- starbucks_info %>%
dat bind_rows(.id = "search_id") %>%
distinct()
To clean the data, I will keep only the active stores that contain “Starbucks” in their name. Additionally, I will pair the data with my web scraping dataset using st_nearest_feature(x, y)
. This function finds the nearest point in y
for each point in x
.
<- dat |>
dat # Keep only stores with "Starbucks" in the name and that are operational
filter(str_detect(name, "Starbucks"), business_status == "OPERATIONAL") |>
# Arrange the results by address
arrange(formatted_address)
# Convert to a spatial data frame using longitude and latitude
<- dat %>%
google_data # Set coordinate reference system to WGS 84 (EPSG:4326)
st_as_sf(coords = c("lng", "lat"), crs = 4326)
# Find the nearest Starbucks locations from the web scraping data (starbucks)
# for each point in google_data
<- st_nearest_feature(google_data, starbucks)
inds
# Extract the metadata of the nearest points from the web scraping data and
# convert to a tibble
<- starbucks %>%
metadata slice(inds) %>%
st_drop_geometry() %>% # Remove spatial geometry
as_tibble()
# Rename the columns in google_data and bind the metadata from the web scraping data
<- google_data |>
google_data rename(google_name = name, google_address = formatted_address) |>
bind_cols(metadata) # Combine google_data with the corresponding metadata
Final Map
The interactive map below shows all Starbucks locations in São Paulo. The color of each circle represents its rating, and the size of the circle represents the number of reviews. The units along the Avenida Paulista corridor, for example, have high average ratings and a large number of reviews. One of the worst-rated units seems to be the one near Mackenzie University, which has a rating of 2.1 and 15 reviews. In the Eastern Zone, the store at Shopping Aricanduva also has a slightly lower rating, 3.9 with 158 reviews.
<- filter(google_data, name_muni == "São Paulo")
sp
<- sp |>
sp mutate(
rad = findInterval(user_ratings_total, c(25, 100, 1000, 2500, 5000))*2 + 5
)
<- colorNumeric("RdBu", domain = sp$rating)
pal
<- stringr::str_glue(
labels "<b> {sp$name} </b> <br>
<b> Rating </b>: {sp$rating} <br>
<b> No Ratings </b> {sp$user_ratings_total}"
)
<- lapply(labels, htmltools::HTML)
labels
leaflet(sp) |>
addTiles() |>
addCircleMarkers(
radius = ~rad,
color = ~pal(rating),
label = labels,
stroke = FALSE,
fillOpacity = 0.5
|>
) addLegend(pal = pal, values = ~rating) |>
addProviderTiles("CartoDB")
Conclusion
Web scraping is a popular data extraction technique that allows us to quickly build interesting datasets. In this case, the process was quite simple, but as I mentioned, it can be much more complex.
Footnotes
https://starbucks.com.br/public/img/icons/starbucks-nav-logo.svg↩︎
Here, we are assuming that the “name” tag always includes the word “shopping” if the store is located inside a mall. This number might eventually be underestimated if there are stores inside malls that don’t have the word “shopping” in their name. Strictly speaking, we also haven’t verified whether the “shopping” tag is always associated with an active shopping mall.↩︎