Finding All Starbucks in Brazil

In this post, I show how to find all Starbucks in Brazil using web scraping in R. I also show how to use the Google Places API to add ratings information for all Starbucks units.
starbucks
web-scrapping
data-science
brazil
tutorial-R
finding
english
Author

Vinicius Oike

Published

July 14, 2024

Starbucks

Web scraping

Web scraping consists of extracting information from a web page. The difficulty or ease of extracting this information depends on how well the page is constructed. In more complex cases, the information may be behind a captcha or within an interactive panel that depends on user input.

In this simple example, I will show how to find the locations of all Starbucks stores in Brazil. The full list of active Starbucks stores can be found on the Starbucks Brasil website. As usual, we will use the tidyverse along with the rvest and xml2 packages.

library(rvest)   # Used for web scraping and extracting HTML content
library(xml2)    # Helps in working with XML and HTML data
library(tidyverse)

The website

The full list of active Starbucks stores can be found on the Starbucks Brasil website. To read the page, we use read_html.

url = "https://starbucks.com.br/lojas"
page = xml2::read_html(url)   # Fetches the HTML content of the page

The “xpath” shows the path to a specific element on the page. For example, to find the Starbucks logo in the top-left corner of the page, we can use the following code:

# Extracts the HTML element for the logo image
page %>%
  html_element(xpath = "/html/body/div[1]/div[1]/header/nav/div/div[1]/a/img")
{html_node}
<img alt="Starbucks Logo" src="/public/img/icons/starbucks-nav-logo.svg">

To learn more about xpaths, you can consult this cheatsheet.

In general, on well-constructed pages, the name of elements will be quite self-explanatory. In the case above, the “alt” attribute already indicates that the object is the Starbucks logo, and the “src” directs to an image file in svg format called starbucks-nav-logo. Unfortunately, this won’t always be the case. On some pages, elements can be quite confusing.

To extract a specific attribute, we use the html_attr function.

page %>%
  html_element(
    xpath = "/html/body/div[1]/div[1]/header/nav/div/div[1]/a/img"
    ) %>%
  # Extracts the "src" attribute (URL to the image)
  html_attr("src")  
[1] "/public/img/icons/starbucks-nav-logo.svg"

If you combine this last link with “www.starbucks.com.br”, you should arrive at an image of the company’s logo1.

Starbucks logo

To find the big list of stores in the left panel, we will take advantage of the fact that the div holding this list has a unique class called "place-list". It’s easy to verify this directly in your browser. If you use Chrome, for instance, just right-click on the panel and click on Inspect.

As I mentioned above, things aren’t always well organized. Note that since we want to extract multiple elements and multiple (all) attributes, we use the variants: html_elements and html_attrs.

list_attr <- page %>%
  # Selects all divs under "place-list" that hold store info
  html_elements(xpath = '//div[@class="place-list"]/div')  %>%
  # Extracts all attributes of the selected elements
  html_attrs()  

The extracted object is a list where each element is a text vector containing the following information. We have the store name, latitude/longitude, and the address.

# Extracts the first store's information from the list
pluck(list_attr, 1)
                  class           data-latitude          data-longitude 
"place-item r-place-15"           "-23.5658059"           "-46.6508012" 
              data-name             data-street              data-index 
  "Shopping Top Center" "Avenida Paulista, 854"                     "0" 

At this point, the web scraping process is complete. Once again, the process was easy because the data is well structured on the Starbucks page. Now, we just need to clean the data.

Data Cleaning

I won’t go in depth about the data cleaning process. Basically, we need to convert each element of the list into a data.frame, stack the results, and then adjust the data types of each column.

# Convert the elements into data.frame
dat <- map(list_attr, \(x) as.data.frame(t(x)))
# Stack the results
dat <- bind_rows(dat)

clean_dat <- dat %>%
  as_tibble() %>%
  # Rename the columns
  rename_with(~str_remove(.x, "data-")) %>%
  rename(lat = latitude, lng = longitude) %>%
  # Select the columns of interest
  select(index, name, street, lat, lng) %>%
  # Convert lat/lng to numeric
  mutate(
    lat = as.numeric(lat),
    lng = as.numeric(lng),
    index = as.numeric(index),
    name = str_trim(name)
    )

Map

The table above is already in a pretty satisfactory format. We can check the data by building a simple map.

library(sf)
library(leaflet)

starbucks <- st_as_sf(clean_dat, coords = c("lng", "lat"), crs = 4326, remove = FALSE)

leaflet(starbucks) %>%
  addTiles() %>%
  addMarkers(label = ~name) %>%
  addProviderTiles("CartoDB") %>%
  setView(lng = -46.65590, lat = -23.561197, zoom = 12)

It’s worth noting that data extracted via web scraping almost always contains some noise. In this case, the data seems relatively clean after a bit of processing. The addresses are not always very informative, like in the case of “Rodovia Hélio Smidt, S/N,” but this happens because many stores are located inside hospitals, shopping malls, or airports.

With this data, we can already perform interesting analyses. For example, we can find out that there are five Starbucks stores just on Avenida Paulista.

starbucks %>%
  filter(str_detect(street, "Avenida Paulista"))
Simple feature collection with 5 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -46.65895 ymin: -23.56784 xmax: -46.64809 ymax: -23.55785
Geodetic CRS:  WGS 84
# A tibble: 5 × 6
  index name                      street                   lat   lng
* <dbl> <chr>                     <chr>                  <dbl> <dbl>
1     0 Shopping Top Center       Avenida Paulista, 854  -23.6 -46.7
2     1 Shopping Cidade São Paulo Avenida Paulista, 1154 -23.6 -46.7
3     2 Paulista Trianon          Avenida Paulista, 1499 -23.6 -46.7
4     3 Paulista 500              Avenida Paulista, 500  -23.6 -46.6
5     6 Shopping Center 3         Avenida Paulista, 2064 -23.6 -46.7
               geometry
*           <POINT [°]>
1  (-46.6508 -23.56581)
2  (-46.65438 -23.5631)
3  (-46.6558 -23.56226)
4 (-46.64809 -23.56784)
5 (-46.65895 -23.55785)

We can also count the number of stores in each airport. Apparently, there are eight stores at Guarulhos Airport, which seems like quite a high number to me.

starbucks %>%
  st_drop_geometry() %>%
  filter(str_detect(name, "Aeroporto")) %>%
  mutate(
    name_airport = str_remove(name, "de "),
    name_airport = str_extract(name_airport, "(?<=Aeroporto )\\w+"),
    name_airport = if_else(is.na(name_airport), "Confins", name_airport),
    .before = "name"
  ) %>%
  count(name_airport, sort = TRUE)
# A tibble: 9 × 2
  name_airport      n
  <chr>         <int>
1 GRU               8
2 Brasília          3
3 Florianópolis     3
4 Confins           2
5 Galeão            2
6 Viracopos         2
7 Congonhas         1
8 Curitiba          1
9 Santos            1

Finally, we can note that many Starbucks stores are located inside shopping malls. A simple calculation shows that around 75 stores are located inside malls, close to 50% of the total units2.

starbucks %>%
  st_drop_geometry() %>%
  filter(str_detect(name, "Shopping|shopping")) %>%
  nrow()
[1] 75

Building

Finding each city

From this data, we can add more information. Using the geobr package, we can identify in which cities the stores are located.

dim_city = geobr::read_municipality(showProgress = FALSE)
dim_city = st_transform(dim_city, crs = 4326)
sf::sf_use_s2(FALSE)

starbucks = starbucks %>%
  st_join(dim_city) %>%
  relocate(c(name_muni, abbrev_state), .before = lat)

Now we can see which cities have the most Starbucks locations.

starbucks %>%
  st_drop_geometry() %>%
  count(name_muni, abbrev_state, sort = TRUE) 
# A tibble: 43 × 3
   name_muni      abbrev_state     n
   <chr>          <chr>        <int>
 1 São Paulo      SP              45
 2 Rio De Janeiro RJ              11
 3 Guarulhos      SP               9
 4 Curitiba       PR               8
 5 Brasília       DF               6
 6 Campinas       SP               6
 7 Florianópolis  SC               5
 8 Jundiaí        SP               4
 9 Porto Alegre   RS               4
10 Ribeirão Preto SP               3
# ℹ 33 more rows

Ou seja, há mais Starbucks somente na Paulista do que em quase todas as demais cidades do Brasil.

Google Places

Adding information using Google Places API

The Google Places API allows access to data from Google Maps. The googleway package integrates this data into R already in tidy format.

library(googleway)

I’ll create a simple search to return all Starbucks locations in Brazil. A full search across the entire country would take too long, so I will use the coordinates I found via web scraping as a starting point.

The function below searches for the term “starbucks” at all the points I provide. To simplify, the function returns only a few of the columns.

# Function to grab starbucks info
get_starbucks_info <- function(lat, lng) {
  
  # Search for 'Starbucks' using the provided latitude and longitude.
  places = google_places(
    search_string = "starbucks",   # Search term "starbucks"
    location = c(lat, lng)         # Coordinates (lat and lng) for the search
  )
  
  # Define the columns of interest to keep from the results
  sel_cols = c(
    "name",                        # Store name
    "formatted_address",           # Store address
    "lat",                         # Latitude
    "lng",                         # Longitude
    "rating",                      # Store rating
    "user_ratings_total",          # Number of user ratings
    "business_status"              # Business status (e.g., operational or closed)
  )
  
  # Process the results and select the relevant columns
  places$results %>%
    tidyr::unnest("geometry") %>%   # Extract the nested 'geometry' field
    tidyr::unnest("location") %>%   # Extract the nested 'location' field (lat and lng)
    dplyr::select(dplyr::all_of(sel_cols))   # Select only the columns of interest
}

O código abaixo roda a função acima em todos os 142 estabelecimentos, encontrados na página oficial do Starbucks Brasil.

# Remove geometry and keep only coordinates
coords_starbucks <- starbucks %>%
  st_drop_geometry() %>%
  as_tibble() %>%
  select(index, name, lat, lng)

starbucks_info = purrr::map2(
  coords_starbucks$lat,
  coords_starbucks$lng,
  get_starbucks_info
  )

dat <- starbucks_info %>%
  bind_rows(.id = "search_id") %>%
  distinct()

To clean the data, I will keep only the active stores that contain “Starbucks” in their name. Additionally, I will pair the data with my web scraping dataset using st_nearest_feature(x, y). This function finds the nearest point in y for each point in x.

dat <- dat |> 
  # Keep only stores with "Starbucks" in the name and that are operational
  filter(str_detect(name, "Starbucks"), business_status == "OPERATIONAL") |> 
  # Arrange the results by address
  arrange(formatted_address)

# Convert to a spatial data frame using longitude and latitude
google_data <- dat %>%
  # Set coordinate reference system to WGS 84 (EPSG:4326)
  st_as_sf(coords = c("lng", "lat"), crs = 4326)  

# Find the nearest Starbucks locations from the web scraping data (starbucks)
# for each point in google_data
inds <- st_nearest_feature(google_data, starbucks)

# Extract the metadata of the nearest points from the web scraping data and
# convert to a tibble
metadata <- starbucks %>%
  slice(inds) %>%
  st_drop_geometry() %>%  # Remove spatial geometry
  as_tibble()

# Rename the columns in google_data and bind the metadata from the web scraping data
google_data <- google_data |> 
  rename(google_name = name, google_address = formatted_address) |> 
  bind_cols(metadata)  # Combine google_data with the corresponding metadata

Final Map

The interactive map below shows all Starbucks locations in São Paulo. The color of each circle represents its rating, and the size of the circle represents the number of reviews. The units along the Avenida Paulista corridor, for example, have high average ratings and a large number of reviews. One of the worst-rated units seems to be the one near Mackenzie University, which has a rating of 2.1 and 15 reviews. In the Eastern Zone, the store at Shopping Aricanduva also has a slightly lower rating, 3.9 with 158 reviews.

sp <- filter(google_data, name_muni == "São Paulo")

sp <- sp |> 
  mutate(
    rad = findInterval(user_ratings_total, c(25, 100, 1000, 2500, 5000))*2 + 5
  )

pal <- colorNumeric("RdBu", domain = sp$rating)

labels <- stringr::str_glue(
  "<b> {sp$name} </b> <br>
   <b> Rating </b>: {sp$rating} <br>
   <b> No Ratings </b> {sp$user_ratings_total}"
)

labels <- lapply(labels, htmltools::HTML)

leaflet(sp) |> 
  addTiles() |> 
  addCircleMarkers(
    radius = ~rad,
    color = ~pal(rating),
    label = labels,
    stroke = FALSE,
    fillOpacity = 0.5
  ) |> 
  addLegend(pal = pal, values = ~rating) |> 
  addProviderTiles("CartoDB")

Conclusion

Web scraping is a popular data extraction technique that allows us to quickly build interesting datasets. In this case, the process was quite simple, but as I mentioned, it can be much more complex.

Footnotes

  1. https://starbucks.com.br/public/img/icons/starbucks-nav-logo.svg↩︎

  2. Here, we are assuming that the “name” tag always includes the word “shopping” if the store is located inside a mall. This number might eventually be underestimated if there are stores inside malls that don’t have the word “shopping” in their name. Strictly speaking, we also haven’t verified whether the “shopping” tag is always associated with an active shopping mall.↩︎