Census Tracts in Brazil

Census tracts are the smallest administrative divisions that provide socio-economic and demographic data. They are highly useful for statistical and spatial analysis, offering a robust and reliable set of information at a fine geographical scale.
census
brazil
sao-paulo
data-science
data-visualization
maps
leaflet
ggplot2
Author

Vinicius Oike

Published

June 14, 2024

Understanding Census Tracts in Brazil

Census tracts are the smallest administrative areas for which socioeconomic and demographic data are available. Broadly speaking, census tracts are small areas that exhibit similar socioeconomic and demographic patterns. In another post, I presented all of the Brazilian administrative and statistical subdivisions.

Census tracts are the strata used by National Bureau of Statistics and Geography (IBGE) in their decennial Census. The shape of each census tract usually respects administrative borders, land barriers, public spaces (e.g. parks, beaches, etc.), and follows the shape of roads, highways, or city blocks.

Though they are typically small, census tracts size vary. Large plots of uninhabited land are commonly grouped into a single tract. In dense urban areas, census tracts are very small.

Census tracts exhibit relatively homogeneous socioeconomic and demographic characteristics. This makes census tracts a very useful statistical tool in regression analysis and classification.

The map below shows the 2022 census tracts in Curitiba, a major city in the Southern part of Brazil.

Code
curitiba_setores <- cur_setores %>%
  mutate(across(pop:dom_prt_ocup, as.numeric)) %>%
  mutate(
    area_km2 = as.numeric(area_km2),
    pop_dens_area = pop / area_km2,
    pop_dens_hh = pop / dom_prt_ocup
  )

labels <- sprintf(
  "<b>Code Tract<b/>: %s <br>
  <b>Code Subdistrict<b/>: %s <br>
  <b>Name Subdistrict<b/>: %s",
  curitiba_setores$code_tract,
  curitiba_setores$code_subdistrict,
  curitiba_setores$name_subdistrict
)

labels <- lapply(labels, htmltools::HTML)

# center <- curitiba_setores %>%
#   summarise(geom = st_union(.)) %>%
#   st_centroid() %>%
#   st_coordinates()

curitiba_center <- c("lng" = -49.28824, "lat" = -25.47788)

leaflet(curitiba_setores) |> 
  addTiles() |> 
  addPolygons(weight = 2, fillColor = "gray80", color = "#feb24c", label = labels) |> 
  addProviderTiles(provider = "CartoDB.Positron") |> 
  setView(curitiba_center[[1]], curitiba_center[[2]], zoom = 15)

Getting the data

IBGE

The data can be downloaded from IBGE’s website.

geobr

The package geobr (available for both Python and R) provides a convenient way to import census tracts directly into a session. The example code below shows how to import the most recent shape file for São Paulo’s census tracts1.

# Import census tracts for São Paulo
spo_tract <- geobr::read_census_tract(3550308, year = 2022)

Using census tracts

Census tracts come with a large variety of socioeconomic and demographic data. The 2022 Census currently only has a limited set of variables, that include total population and the total number of households. More data should be released in the future.

The 2010 census tracts offers a much richer set of variables including demographic information on age, sex, and race as well as income and education. There is a clear trade-off however as this data is almost 15 years old.

Demographic data

The maps below show basic demographic information from the 2022 Census highlighting the central district in Curitiba. For simplicity, I omit the color legend but darker shades of blue represent higher values, while lighter/whiter shades of blue represent lower values. Technically, these are quintile maps, where the underlying numerical data was ordered and binned into five equal-sized groups.

Code
setores_centro <- cur_setores |> 
  # Administração Regional da Matriz
  filter(code_subdistrict == 41069020501) |> 
  mutate(
    pop = as.numeric(pop),
    pop_quintile = ntile(pop, 5),
    area_km2 = as.numeric(area_km2),
    pop_dens = pop / area_km2 * 10,
    pop_dens = ntile(pop_dens, 5),
    pop_dom = ntile(pop_dom, 5)
    )

theme_map <- theme_void() +
  theme(
    legend.position = "none",
    plot.title = element_text(size = 14, hjust = 0.5)
  )

m1 <- ggplot(setores_centro) +
  geom_sf() +
  ggtitle("Census tracts") +
  theme_map

m2 <- ggplot(setores_centro) +
  geom_sf(aes(fill = pop_quintile), lwd = 0.15, color = "white") +
  scale_fill_fermenter(direction = 1, breaks = 1:5) +
  ggtitle("Population") +
  theme_map

m3 <- ggplot(setores_centro) +
  geom_sf(aes(fill = pop_dens), lwd = 0.15, color = "white") +
  ggtitle("Pop. Density") +
  scale_fill_fermenter(direction = 1, breaks = 1:5) +
  theme_map

m4 <- ggplot(setores_centro) +
  geom_sf(aes(fill = pop_dom), lwd = 0.15, color = "white") +
  ggtitle("Persons per Household") +
  scale_fill_fermenter(direction = 1, breaks = 1:5) +
  theme_map

Household income per capita

More sophisticated analysis can be made using census tracts, but - currently - this is only possible using 2010 data. The map below shows household income data at a census tract level for the entire city of Curitiba. The map shows higher levels of income in the city center and lower levels of income in the city’s periphery.

While the absolute values of income are certainly out of date, the overall spatial distribution of the data might still be similar. Since this maps uses deciles of income, instead of the actual income value, it can still communicate valuable information. This is, of course, a strong hypothesis that might be more or less valid in different contexts.

Code
curitiba_income <- curitiba_income |> 
  mutate(decile = ntile(income_pc, 10))

ggplot(curitiba_income) +
  geom_sf(aes(fill = decile, color = decile)) +
  scale_fill_fermenter(
    name = "HH Income per capita (deciles)",
    palette = "Spectral",
    breaks = 1:10,
    labels = c("Bottom 10%", rep("", 8), "Top 10%"),
    na.value = "gray50",
    direction = 1) +
  scale_color_fermenter(
    name = "HH Income per capita (deciles)",
    palette = "Spectral",
    breaks = 1:10,
    labels = c("Bottom 10%", rep("", 8), "Top 10%"),
    na.value = "gray50",
    direction = 1) +
  labs(title = "Curitiba: Household Income per capita (deciles)") +
  ggthemes::theme_map() +
  theme(
    plot.title = element_text(size = 18, hjust = 0.5),
    legend.position = c(0.9, 0.05)
  )

The 2022 Census

Brazil’s most recent demographic Census was delayed due to Covid-19 and governmental budget cuts. I wrote a post (in Portuguese) that describes this turbulent context and how it might have affected the quality of the information collected.

There was a significant increase in the number of census tracts in Brazil. While the total number of cities barely changed, the total number of census tracts increased by more than 40%, reaching over 454 thousand.

The table below summarizes the changes in the past 3 editions of the Census.

Year Num. Cities Num. Census Tracts Population
2000 1.058 254.855 169.590.693
2010 5.565 316.545 190.755.799
2022 5.571 454.047 203.080.756

Higher Resolution

An obvious improvement from the 2010 census tracts is the gain in spatial resolution. The maps below show the central area of Curitiba where there was a 33% increase in the number of total census tracts.

Important considerations

Some important considerations to keep in mind when working with census tracts.

  1. Shape is not time consistent.
  2. Number of census tracts varies.
  3. Area of census tracts can be very different.

Census tracts follow the evolution of cities. As cities grow in size and complexity, so do the shape and number of the tracts. In the 2010 Census, São Paulo had nearly 19.000 census tracts, in the most recent Census it had 27.600 tracts, a 45% increase. Note that the borders of the city remained unchanged during this period.

New census tracts aren’t necessarily subsets of previously existing tracts. This means that comparing census tracts through time is not so straightforward. A way to standardize census tracts is to “dissolve” them into a statistical grid and either (1) use this higher resolution grid directly; or (2) use the grid as an intermediary step to interpolate different census data into a common standard.

Census tracts can vary a lot in size. It’s a common practice to include large chunks of uninhabited land (e.g. public parks) into a single huge census tract. This means that creating “intensive” variables such as population density can be tricky in some cases.

While the size of census tracts varies, the underlying demographic data can be much more well-behaved. The histograms below show the distribution of (1) total population and the (2) average number of persons per household among all census tracts in Curitiba. Population data is relatively normal-shaped, with a right skew; note that the distribution also has a small spike at zero, since there are several non-populated census tracts. To improve this visualization, I remove outliers from the second plot2.

Dealing with spatial inconsistency

Spatial interpolation

A simple strategy to deal with census tracts’ spatial inconsistency is to define a common spatial grid. This means choosing either a (1) grid of triangles; (2) grid of squares; or (3) grid of hexagons. In this example, I choose a simple squared grid.

The process of converting one set of data (stored in a particular shape) into another shape is called spatial interpolation. Spatial interpolation is also referred to as areal interpolation or dasymetric interpolation.

Essentially, the problem we are tryting to solve is: we have some data stored in a (big) shape and we wish to estimate the same data in another (smaller) shape. In this case, we will dissolve population count data, stored in the shape of the 2010 census tracts and 2020 census tracts, into a finer squared grid.

To convert the data from one shape to the other we implicitly assume that the variable (population) is uniformly distributed over the shape’s space. That is, we assume that every single person is evenly distributed across each census tract. This assumption works well in small densely populated tracts, but doesn’t hold as well in larger tracts.

Spatial grid for Census Data

The maps below show the same Census household data in a squared 500x500m grid. While I omit the color legend of the plots, I scaled them equally as to make them directly comparable; also, darker shades of green indicate higher values, while lighter shades of green indicate lower values.

The 2010 Census data was directly imported using the censobr R package. To estimate a simple areal interpolation I use the areal package. Executing the areal interpolation involves a few intermediary steps such as choosing a valid UTM CRS.

The final result shows the number of households in 2010 and 2022 in a common spatial grid. This process can be replicated across the entire city to allow an easier comparison of the data.

Code
library(censobr)
library(areal)
# Get demographic data for 2010 census
basico <- censobr::read_tracts(dataset = "Basico")
# Get number of households (private)
basico <- basico %>%
  filter(code_muni == 4106902) %>%
  select(code_tract, dom = V001) %>%
  collect()
# Join with census tract shape
setores_centro_antigo <- left_join(setores_centro_antigo, basico, by = "code_tract")

# Create a grid
grid_centro <- setores_centro %>%
  st_union() %>%
  st_make_valid() %>%
  st_transform(crs = 32722) %>%
  st_make_grid(cellsize = 500) %>%
  st_as_sf() %>%
  st_transform(crs = 4674) %>%
  mutate(gid = row_number(.))

# Join census data with the new grid
grid_2010 <- setores_centro_antigo %>%
  select(dom) %>%
  st_interpolate_aw(grid_centro, extensive = TRUE) 
# Join 2020 census data with the new grid
grid_2020 <- setores_centro %>%
  select(dom_prt) %>%
  st_interpolate_aw(grid_centro, extensive = TRUE)

d1 <- grid_2010 %>%
  st_centroid() %>%
  st_join(grid_centro) %>%
  st_drop_geometry() %>%
  as_tibble() %>%
  rename(dom_2010 = dom)

d2 <- grid_2020 %>%
  st_centroid() %>%
  st_join(grid_centro) %>%
  st_drop_geometry() %>%
  as_tibble() %>%
  rename(dom_2020 = dom_prt)

full_grid <- grid_centro %>%
  left_join(d1) %>%
  left_join(d2)

bbox <- setores_centro |> 
  st_union() %>%
  st_transform(crs = 32722) %>%
  st_buffer(dist = 100) %>%
  st_transform(crs = 4674) %>%
  st_bbox()

m3 <- full_grid %>%
  mutate(
    chg = (dom_2020 / dom_2010 - 1) * 100,
    chg_bin = cut(chg, breaks = c(-Inf, 0, 50, 100, Inf))
  )  %>%
  st_make_valid() %>%
  filter(!is.na(chg_bin)) %>%
  ggplot() +
  geom_sf(aes(fill = chg_bin), color = "white") +
  scale_fill_brewer(palette = "Greens") +
  guides(fill = "none") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5))

Conclusion

Census tracts are small statistical areas that exhibit similar socioeconomic and demographic patterns. They are useful for statistical and spatial analysis, offering a robust and reliable set of information at a fine geographical scale.

Census tracts change through time, making it difficult to compare them. A workaround is to dissolve them into common spatial grid to facilitate comparisons.

Footnotes

  1. As of June 2024, the shape↩︎

  2. Census tracts where there either 0 persons per household or more than 4 persons per household are considered outliers.↩︎