library(PNADcIBGE)
library(dplyr)
library(srvyr)
library(survey)
library(kableExtra)
format_number <- function(x, digits = 0) {
x <- round(x, digits = digits)
x <- format(x, big.mark = ".", decimal.mark = ",")
return(x)
}
print_table <- function(data, col_names = NA) {
knitr::kable(data, digits = 2, align = "c", col.names = col_names) |>
kableExtra::kable_styling(
full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed")
)
}
pnad19 <- PNADcIBGE::get_pnadc(year = 2019, interview = 1)Introduction
When dealing with survey and census data we usually also encounter microdata. Microdata is simply the disaggregated data that is collected in these surveys usually at the individual, household, or establishment level. A “special” kind of microdata comes from complex surveys (sometimes refered simply as surveys). Complex surveys differ from conventional surveys in their methodology: most importantly, complex surveys are not random! This is a design choice that helps make research in the field more efficient and economically feasible.
Since this microdata is not randomly collected, conventional statistical measures will be biased. Even simple statistics like the mean or the median must be properly weighted to accurately reflect population totals.
Identifying the survey
The PNADc survey can be identified both at the individual and the household level. To identify a household we construct the following key UPA + V1008 + V1014. Likewise, to identify an individual we construct the key UPA + V1008 + V1014 + V2003.
The survey package facilitates working with complex surveys. To make the data wrangling process smoother we suggest using the svryr package that emulates dplyr functionality for complex survey data frames.
The pnad19 object is a survey object but the underlying data frame can be accessed via pnad19$variables.
# Convert to tbl_svy
#> as_survey is needed to manipulate the survey object with dplyr functions
pnad19 <- as_survey(pnad19)pnad19 <- pnad19 |>
mutate(
id_household = paste0(UPA, V1008, V1014),
id_individual = paste0(UPA, V1008, V1014, V2003))As of more recent releases of the PNADcIBGE package there already is a variable called ID_DOMICILIO that identifies each household. There isn’t an explicit variable to identify each individual, since each line implicitlty is a unique individual.
all.equal(pnad19$variables$ID_DOMICILIO, pnad19$variables$id_household)#> Remove duplicate rows
household <- pnad19$variables |>
distinct(id_household, .keep_all = TRUE)
#> Create with survey::svydesign without bootstraping (slightly faster)
#> also convert to column names to lowercase
household_simple <- pnad19$variables |>
distinct(id_household, .keep_all = TRUE) |>
janitor::clean_names()
svyhh <- survey::svydesign(
ids = ~id_household,
strata = ~estrato,
weights = ~v1032,
nest = TRUE,
data = household_simple
)
# Using PNADcIBGE::pnadc_design (must have original names for columns!)
#> If precise confidence intervals are needed this is the recommended function
svyhh <- PNADcIBGE::pnadc_design(household)
#> as_survey is needed to manipulate the survey object with dplyr functions
svyhh <- as_survey(svyhh)
svyhh$variables <- rename_with(svyhh$variables, tolower)Check results
#> Estimate the average number of dwellers per household by state
hhdwellers <- svyhh |>
group_by(uf) |>
summarise(avg = survey_mean(vd2003, na.rm = TRUE))
print_table(hhdwellers)The code below gets the official IBGE table from the SIDRA API using the sidrar package. Note that IBGE rounds its numbers to the first decimal. This makes sense since standard errors are about the first or second decimal case.
#> Get the official table generated by SIDRA
table_ibge <- sidrar::get_sidra(
x = 6578,
geo = "State",
variable = 10163,
period = "2019")
table_ibge |>
janitor::clean_names() |>
dplyr::select(unidade_da_federacao, valor) |>
print_table()Only point estimates
Survey functions can take a very long time to compute, mainly due to the processing time needed to calibrate estimations of standard-errors. If one only needs point estimates for simple statistics it is much easier to use xtabs and weighted sums and averages.
The code below compares computing time and shows that results are equal so there is no lost in precision of points estimates. We use the simple tictoc package to time the code but a more thorough analysis should use a more advanced package such as microbenchmark.
library(tictoc)
tic("slow method")
tab_slower <- svyhh |>
group_by(uf) |>
summarise(avg = survey_mean(vd2003, na.rm = TRUE))
toc()
tic("fast method")
tab_faster <- svyhh$variables |>
group_by(uf) |>
summarise(avg = weighted.mean(vd2003, v1032))
toc()
all.equal(tab_slower$avg, tab_faster$avg)The same is true when counting categorical variables. The code below counts the number of household by type (apartment, house, other) in each metropolitan region.
tic("slow method")
tab_slower <- svyhh |>
group_by(rm_ride, s01001) |>
survey_count()
toc()
tic("fast method")
tab_faster <- xtabs(v1032 ~ rm_ride + s01001, data = svyhh$variables)
toc()Comparing the results is not as straightforward since each function outputs a different object. The slower method outputs a grouped data.frame while the faster xtabs method outputs a table object.
head(tab_slower)The outout from xtabs can be converted to a data.frame and is coerced into a convenient long format. The output is a contingency table from cross-classifying factors. Each row is a group of variables and the Freq column is the estimate of the total number of times that group appears in the population.
hhtype <- as.data.frame(tab_faster)
head(hhtype)In the example above we estimate a total of … houses in the metropolitan region of Manaus.
Comparing both outputs we see that they differ in the last rows. This happens because of missing values in the rm_ride column. The slower survey method does not drop these missing by default.
all.equal(tab_slower$n, hhtype$Freq)To see this we can filter the missing values in the slower method.
tab_slower |>
filter(is.na(rm_ride))Survey basics
Computing statistics
Computing summary statistics such as means, totals, and quantiles is straightforward using survey functions. However, some knowledge on the survey’s methodology is needed to guarantee meaningful and accurate results. It will also be important to note that some cases requires thoughtful choices by the researcher.
To compute the total number of households by type we use svytotal
svytotal(~s01001, svyhh)Or, using the srvyr syntax
svyhh |>
group_by(s01001) |>
survey_tally()Or, finally using xtabs.
xtabs(v1032 ~ s01001, data = svyhh$variables)In the following examples we will only use srvyr syntax since it imitates dplyr syntax which is more commonly known. To compute the average value of total family income we use survey_mean.
svyhh |>
summarise(income = survey_mean(vd5010))When dealing with income data one might find it useful to remove households without income. This will naturally lead to higher estimates.
svyhh |>
# remove household with 0 income
filter(vd5010 > 0) |>
summarise(income = survey_mean(vd5010))We can also estimate grouped averages. The code below estimates the average household income by family type. In the following, Unipessoal represents someone living alone, Nuclear represents all 2 or more people family arrangements, Composta represents a nuclear family + one or more non-relatives that live together in the same household (e.g. maid), and Estendida represents all other cases.
svyhh |>
# remove household with 0 income
filter(vd5010 > 0) |>
# group by family type
group_by(vd2004) |>
summarise(income = survey_mean(vd5010))Quantile estimation is made using survey_quantile. Technical aspects can become more complex since quantiles do not depend smoothly on the underlying variable. Ties are also possible, though survey_mean will linearly interpolate the values by default.
svyhh |>
summarise(income = survey_quantile(vd5010, seq(0.1, 0.9, 0.1)))Since the median is a very commonly used quantile a wrapper survey_median is available:
svyhh |>
# remove household with 0 income
filter(vd5010 > 0) |>
# group by family type
group_by(vd2004) |>
summarise(
avg = survey_mean(vd5010),
med = survey_median(vd5010))Estimating proportions and ratios is accomplished using survey_prop and survey_ratio. At the national level, we estimate that 85.7% of households are houses and only 14.3% are apartments. Note that we consider only houses and apartments since these are much more common.
svyhh |>
filter(s01001 %in% c("Casa", "Apartamento")) |>
group_by(s01001) |>
summarise(prop = survey_prop())We could arrive at the same estimate without confidence intervals using tally and then computing the proportion. This is useful since we can quickly tally using xtabs.
svyhh |>
filter(s01001 %in% c("Casa", "Apartamento")) |>
group_by(s01001) |>
survey_tally() |>
mutate(prop = n / sum(n))The code below relates homeownership to household types. Using survey_prop we find grouped proportions, so each homeownership group sums to 100%.
svyhh <- svyhh |>
mutate(
ownership = case_when(
stringr::str_detect(s01017, "^Próprio") ~ "Home Owner",
s01017 == "Alugado" ~ "Rent",
TRUE ~ "Other"),
ownership = factor(ownership))
svyhh |>
filter(s01001 %in% c("Casa", "Apartamento")) |>
group_by(ownership, s01001) |>
summarise(prop = survey_prop())Alternatively, if we wish to compute the proportion over the entire data we use the interact function. This way we find that the most common group are homeowners of houses (62.5%) followed by renters of houses (14.6%).
svyhh |>
filter(s01001 %in% c("Casa", "Apartamento")) |>
group_by(interact(ownership, s01001)) |>
summarise(prop = survey_prop())As can be seen in the examples above, both by using filter and group_by we restrict our estimation to specific subpopulations of the sample. More formally this is domain estimating and while the technical aspects are dealt with by the survey functions one should keep in mind that too much restrictions/interactions will lead to imprecise estimations.
The code below computes the ownership x household types proportions for three major metropolitan regions in Brazil. Note how the standard errors are much larger than when we estimated this for the entire country.
tab_rm <- svyhh |>
filter(
stringr::str_detect(
rm_ride,
"(São Paulo)|(Rio de Janeiro)|(Belo Horizonte)"),
s01001 %in% c("Casa", "Apartamento"),
vd5010 > 0) |>
group_by(rm_ride, ownership, s01001) |>
summarise(prop = survey_prop())
tab_rmThe table below shows the proportion of houses x apartments that are rented in each of the metro regions.
tab_rm |>
filter(ownership == "Rent") |>
tidyr::pivot_wider(
id_cols = "rm_ride",
names_from = "s01001",
values_from = "prop",
values_fn = ~ round(. * 100, 1)) |>
print_table()Linear regression and hypothesis tests
Linear models in general are implemented via the svyglm function where glm is short for Generalized Linear Model. Default specifications result in a linear model. For other types of linear models, such as binomial models, we can use family = quasibinomial().
The linear regression model stated below serves only as an example and is broadly based on Acolin & Green (2017), Measuring housing affordability in São Paulo metropolitan region: Incorporating location. In their paper, the authors create a simple hedonic model to decompose house rents. Below we propose an even simpler hedonic model to explain house rent variation.
svyhh_reg <- svyhh |>
filter(
# Consider only houses and apartments
s01001 %in% c("Casa", "Apartamento"),
# Remove 0 income families
vd5010 > 0,
!is.na(s01012a)) |>
mutate(
# (Respose variable) Natural logarithm of effective or implied monthly rent
rent = log(s01019),
# Truncate number of dwellers
ndwellers = ifelse(v2001 >= 5, 5, v2001),
# Truncate number of rooms
rooms = ifelse(s01005 >= 5, 5, s01005),
# "One-hot encoding" for number of bathrooms
# Important note: there exist households without bathrooms in the sample
bath1 = ifelse(s01011a == 1, 1L, 0L),
bath2 = ifelse(s01011a == 2, 1L, 0L),
bath3 = ifelse(s01011a >= 3, 1L, 0L),
# Natural logarithm
income = log(vd5010),
# Group roof material categorical variable
roof_material = case_when(
stringr::str_detect(s01003, "^Telha") ~ "Tile",
s01003 == "Somente laje de concreto" ~ "Concrete",
TRUE ~ "Other"),
# Group type of sewage categorical variable
sewage = case_when(
s01012a == "Rede geral, rede pluvial" ~ "Sewer",
stringr::str_detect(s01012a, "^Fossa") ~ "Sceptic Tank",
TRUE ~ "Other"),
# Dummy indicating if garbage is collected
is_garbage_collected = ifelse(
stringr::str_detect(s01013, "^Coletado"), 1L, 0L),
# Dummy indicating if house has electricity
has_electricity = ifelse(
stringr::str_detect(s01015, "^Diária"), 1L, 0L),
# Dummy indicating if household is an apartment
is_apto = factor(ifelse(s01001 == "Apartamento", 1L, 0L)))
model.lm <- svyglm(
rent ~ income + is_apto + ndwellers + rooms + bath1 + bath2 + bath3 +
roof_material + sewage + is_garbage_collected + has_electricity +
as.factor(rm_ride),
svyhh_reg
)Below I present the results of the estimation. Two crucial informations are lacking in this model: (1) the house floor area, and (2) the precise location of the house. Even so, only using external quality variables (roof material, type of sewage system, etc.) and proxies for floor area (number of rooms) we have a decent fit.
model.lm |>
broom.helpers::tidy_and_attach() |>
broom.helpers::tidy_select_variables(
include = -starts_with("as")
)The logistic regression below is stated only for the sake of an example. It creates a dummy variable is_rent to identify if a household is for rent. As explanatory variables we use the (1) truncated number of dwellers, (2) the truncated number of rooms, (3) the type of family arrangement, (4) the natural logarithm of total family income, and (5) a dummy variable indicating if the house is an apartment. I also use metro-regions as a fixed-effect variable by wrapping it with as.factor. The code for the model is presented below.
svyhh_reg <- svyhh |>
mutate(
# (Response variable) dummy indicating if household is rental
is_rented = factor(ifelse(s01017 == "Alugado", 1L, 0L)),
# Truncate the number of dwellers
ndwellers = ifelse(v2001 >= 5, 5, v2001),
# Truncate the number of rooms
rooms = ifelse(s01005 >= 5, 5, s01005),
# Natural logarithm of income
income = log(vd5010),
# Lump alternative family arrangements into "Other"
family_type = ifelse(
vd2004 %in% c("Estendida", "Composta"), "Other", vd2004),
# Dummy indicating if home is an apartment
is_apto = factor(ifelse(s01001 == "Apartamento", 1L, 0L))) |>
filter(
# Consider only houses and apartments
s01001 %in% c("Casa", "Apartamento"),
# Remove 0 income families
vd5010 > 0)
# Estimate logisitc model
model.logistic <- svyglm(
is_rented ~ income + is_apto + ndwellers + rooms + family_type +
as.factor(rm_ride),
svyhh_reg,
family = quasibinomial())Results are summarized in the following table. We see that variables are statistically significant and have intuitive signs. The binary variable for apartment indicates a higher likelihood of a household being rented given that is an apartment.
model.logistic |>
broom.helpers::tidy_and_attach() |>
broom.helpers::tidy_select_variables(include = -starts_with("as"))We can compute confidence intervals for these estimates:
confint(model.logistic, parm = "is_apto1")And compute an odds-ratio for the estimates. In the following we find that an apartment is associated with a 2.13 times increase in the average chance of a household being rented.
betas <- coef(model.logistic)
exp(betas[names(betas) == "is_apto1"])Interpreting standard-errors
Standard errors give an idea of the precision of the parameter estimates. For broad analysis we may want to ignore this and only consider point estimates but when comparing different groups we should take them into account.
More generally, one should not use population estimates from survey data absolutely. Take the estimate of the total number of household by type.
xtabs(v1032 ~ s01001, data = svyhh$variables)According to estimator, there are 60,418,462 houses in Brazil. For practical purposes we seldom are interested in the total number of houses with such precision and it would be naive to assume that the survey has this amount of precision. A broad estimate of 60,418 thousand would, generally speaking, be enough. Moreover, if we compute the standard errors we can estimate a 95% confidence interval.
The interval below states that the true number of houses in Brazil is between 60188669 and 60648255, or 60,188 - 60,648 thousand.
hhtotal <- svytotal(~s01001, svyhh)
confint(hhtotal)Standard errors are more useful when we wish to compare groups in a meaningful manner. In the example of the average number of dwellers we calculated that in Paraíba there are 3.11 persons per household and that in Rio Grande do Norte there are 3.13 persons per household. Can we safely conclude that the average number of dwellers per household is smaller in Paraíba?
Since both estimates have an approximate standard error of 0.03 we can construct a 95% confidence interval around these estimates and compare them. Comparing all states visually we see that some of them cannot be distinguished
hhdwellers <- hhdwellers |>
mutate(
ci_lower = avg - 1.96 * avg_se,
ci_upper = avg + 1.96 * avg_se,
uf = forcats::fct_reorder(uf, avg)
)
ggplot(hhdwellers) +
geom_col(aes(x = uf, y = avg), fill = "#344e41") +
geom_errorbar(
aes(x = uf, ymin = ci_lower, ymax = ci_upper),
color = "orange",
width = 0.5) +
scale_x_discrete(labels = function(x) stringr::str_wrap(x, 12)) +
coord_flip() +
labs(
x = NULL,
y = "Average dwellers per household",
title = "North and northeastern household are larger",
caption = "Source: PNADC/A.") +
theme_benvi +
theme(
axis.text.x = element_text(angle = 0),
axis.text.y = element_text(hjust = 0.5)
)Finally, we output these results more clearly. The code below calculates the share of each type of household in each metropolitan region. For clarity we only consider apartments and houses so shares will not sum to 100%.
hhtype <- as.data.frame(tab_faster)
tab_results <- hhtype |>
group_by(rm_ride) |>
mutate(share = Freq / sum(Freq) * 100) |>
filter(s01001 %in% c("Casa", "Apartamento")) |>
tidyr::pivot_wider(
id_cols = "rm_ride",
names_from = "s01001",
values_from = c("Freq", "share")) |>
mutate(across(where(is.numeric), round, 1)) |>
mutate(
rm_ride = as.character(rm_ride),
rm_ride = stringr::str_replace(rm_ride, "Região Metropolitana de", "RM"),
rm_ride = stringr::str_replace(
rm_ride,
"Região Administrativa Integrada\nde Desenvolvimento da",
"RIDE"))
print_table(tab_results)Results
Functions
#> Gets the city name from the rm_ride column
clean_string <- function(x) {
pattern <- c(
"Região Metropolitana d[a-z] ",
"Região Administrativa Integrada de Desenvolvimento da ",
"Grande",
" \\([A-Z][A-Z]\\)",
"Vale do Rio"
)
pattern <- paste(stringr::str_glue("({pattern})"), collapse = "|")
x <- stringr::str_replace_all(x, "\n", " ")
x <- stringr::str_remove_all(x, pattern)
x <- stringr::str_trim(x)
x <- stringr::str_to_title(x)
return(x)
}
#> Adds geographic dimension to a table based on the rm_ride column
add_geo_dimension <- function(data) {
dim_state <- readr::read_csv(here("data/dimension/dim_city_state.csv"))
data <- data |>
dplyr::mutate(
name_muni = clean_string(rm_ride),
abbrev_state = stringr::str_extract(rm_ride, "(?<=\\().+?(?=\\))")
) |>
dplyr::left_join(dim_state, by = c("name_muni", "abbrev_state")) |>
filter(!is.na(code_muni))
return(data)
}
#> Computes mean and 1st, 2nd and 3rd quantiles.
summarize_stat <- function(data, x, ...) {
data <- data |>
dplyr::group_by(...) |>
dplyr::summarize(
avg = srvyr::survey_mean({{ x }}, na.rm = TRUE),
med = srvyr::survey_median({{ x }}, na.rm = TRUE),
q25 = srvyr::survey_quantile({{ x }}, 0.25, na.rm = TRUE),
q75 = srvyr::survey_quantile({{ x }}, 0.75, na.rm = TRUE)
)
return(data)
}
#> Alternative spelling fot summarise function above
summarise_stat <- summarize_statFind unique family arrangements
# Count total family compositions per household
tab_family <- pnad19$variables |>
group_by(id_household, V1032) |>
summarise(family = paste(VD2002, collapse = ",")) |>
count(family, wt = V1032) |>
group_by(family) |>
summarise(count = sum(n)) |>
arrange(desc(count))
#> Find all unique family compositions
s <- stringr::str_split(paste(tab_family$family, collapse = ","), pattern = ",")
s <- stringr::str_trim(unique(unlist(s)))
#> Get only blood relatives excluding "filho" (son)
rm <- "(Pessoa responsável)|(Cônjuge)|(Filho)|(Empregado)|(empregado)|(Pensionista)"
s <- s[!stringr::str_detect(s, rm)]
#> Escape all brackets
s <- stringr::str_replace_all(s, "\\(a\\)", "\\\\(a\\\\)")
# Collapse into a single pattern to detect relatives
relatives <- paste(stringr::str_glue("({s})"), collapse = "|")
#> Create dummies and count variables to detect if family (1) is composed of a
#> single person; (2) is married; (3) has children; (4) lives with relatives -
#> excluding spouse and children; (5) lives with non-relatives
#> Note: married includes partners living together
#> relatives includes "agregados" and "conviventes"
tab_family <- tab_family |>
mutate(
alone = dplyr::if_else(family == "Pessoa responsável", 1L, 0L),
married = stringr::str_count(family, "Cônjuge"),
children = stringr::str_count(family, "Filho"),
relative = as.integer(stringr::str_detect(family, relatives)),
other = as.integer(stringr::str_detect(family, "(mpregado)|(Pensionista)")),
units = 1 + stringr::str_count(family, "\\,")
)
#> Classify family types into 10 most common groups + other
key_ftype <- tab_family |>
mutate(
family_type = case_when(
alone == 1L ~ "One-person household",
married == 1L & units == 2L ~ "Married without children",
married == 1L & children == 1 & units == 3L ~ "Married with 1 child",
married == 1L & children == 2 & units == 4L ~ "Married with 2 children",
married == 1L & children >= 3L & relative == 0L ~ "Married with 3 or more children",
married == 1L & children == 0L & relative > 0L ~ "Married without children and with relatives",
married == 1L & children > 0L & relative > 0L ~ "Married with children and with relatives",
married == 0L & children > 0L & relative == 0L ~ "Single father/mother with children",
married == 0L & children == 0L & relative > 0L ~ "Single without children and with relatives",
married == 0L & children > 0L & relative > 0L ~ "Single with children and with relatives",
TRUE ~ "Other")) |>
select(family, family_type, alone, married, children, relative, other)
#> Group original data by household (id) and join with family types
id_ftype <- pnad19$variables |>
dplyr::group_by(id_household) |>
dplyr::summarise(family = paste(VD2002, collapse = ",")) |>
dplyr::left_join(key_ftype, by = "family")
#> Join the household identified survey data.frame with family types
household <- dplyr::left_join(household, id_ftype)
#> Convert to srvyr object again
svyhh <- PNADcIBGE::pnadc_design(household)
svyhh <- as_survey(svyhh)
svyhh$variables <- dplyr::rename_with(svyhh$variables, tolower)
#> Recreate home-ownership column
svyhh <- svyhh |>
mutate(
ownership = case_when(
stringr::str_detect(s01017, "^Próprio") ~ "Home Owner",
s01017 == "Alugado" ~ "Rent",
TRUE ~ "Other"),
ownership = factor(ownership))Main tables
dir <- here::here("data/br-ibge-pnad/microdata")
dict <- readxl::read_excel(
here(dir, "documentation/dicionario_PNADC_microdados_2019_visita1_20220224.xls"),
skip = 4,
col_names = c(
"init_pos", "char_length", "code", "num", "description", "cat_type",
"cat_desc", "period"))
#> Fill columns downwards
dict <- tidyr::fill(dict, names(dict))#> V5007 - Received rent?
#> S01001 - Type of house ownership (e.g. owned, rented, etc.).
#> S01006 - Number of rooms in the house (1 to 30).
#> S01018 - What was the amount that was paid or that should have been paid as
#> payment for mortgage.
#> S01019 - What was the amount that was paid or that should have been paid as
#> rent service.
#> VD2002 - Status in the household (e.g. responsible, father, mother, son, etc.).
#> VD2003 - Number of people living in the household (1 to 30).
#> VD2004 - Type of the household arrangement (e.g. single, nuclear, etc.).
#> VD3004 - Highest level of education achieved.
#> VD4020 - Effective personal income from all types of work.
#> VD4022 - Effective personal income from all sources.
#> VD5007 - Effective family income from all sources.
#> VD5008 - Per capita family income from all sources.
#> VD5009 - Per capita family income from all sources (binned).
#> Obs: Income variables include individuals with 0 income.
codes <- c("v5007a", "s01018", "s01019", "vd2002", "vd2003", "vd2004",
"vd3004", "vd4020", "vd4022", "vd5007", "vd5008", "vd5009")Household Ownership
#> Estimate total number of households by ownership in all Metro Regions
tab_ownership <- svyhh |>
group_by(rm_ride, ownership, s01001) |>
survey_tally() |>
ungroup() |>
add_geo_dimension()tab_ownership |>
filter(code_muni == 3550308) |>
tidyr::pivot_wider(
id_cols = "ownership",
names_from = "s01001",
values_from = "n") |>
mutate(across(is.numeric, format_number)) |>
print_table(col_names = c("Ownership", "House", "Apartament", "Other"))#> Estimate the proportion of households by ownership in all Metro Regions
tab_ownership_prop <- svyhh |>
group_by(rm_ride, ownership, s01001) |>
summarise(prop = survey_prop()) |>
ungroup() |>
add_geo_dimension()tab_ownership_prop |>
filter(code_muni == 3550308, s01001 %in% c("Casa", "Apartamento")) |>
select(name_muni, ownership, house_type = s01001, prop, prop_se) |>
tidyr::pivot_wider(
id_cols = "ownership",
names_from = "house_type",
values_from = "prop") |>
print_table()Rooms per Household
#> Estimate the number of households by number of rooms (truncated at 4) in
#> all Metro Regions
tab_rooms <- svyhh |>
mutate(rooms = ifelse(s01006 >= 4, 4, s01006)) |>
group_by(rm_ride, rooms, s01001) |>
survey_tally() |>
ungroup() |>
add_geo_dimension()#> Estimate the proportion of households by number of rooms (truncated at 4) in
#> all Metro Regions
tab_rooms_prop <- svyhh |>
mutate(rooms = ifelse(s01006 >= 4, 4, s01006)) |>
group_by(rm_ride, rooms, s01001) |>
summarise(prop = survey_prop()) |>
ungroup() |>
add_geo_dimension()tab_rooms_prop |>
filter(s01001 == "Apartamento") |>
tidyr::pivot_wider(
id_cols = "name_muni",
names_from = "rooms",
values_from = "prop",
values_fn = ~. * 100) |>
mutate(across(where(is.numeric), format_number, digits = 2)) |>
print_table()Dwellers per household
#> Estimate the number of household by number of dwellers (truncated at 5) in
#> all Metro Regions.
tab_dwellers <- svyhh |>
mutate(dwellers = ifelse(vd2003 >= 5, 5, vd2003)) |>
group_by(rm_ride, dwellers, vd2003) |>
survey_tally() |>
ungroup() |>
add_geo_dimension()#> Estimate the proportion of household by number of dwellers (truncated at 5)
#> in all Metro Regions.
tab_dwellers_prop <- svyhh |>
mutate(dwellers = ifelse(vd2003 >= 5, 5, vd2003)) |>
group_by(rm_ride, dwellers, vd2003) |>
summarise(prop = survey_prop()) |>
ungroup() |>
add_geo_dimension()Family arrangement by household
#> Estimate the number of households by family arrangement in all MRs
tab_arrangement <- svyhh |>
group_by(rm_ride, vd2004) |>
survey_tally() |>
ungroup() |>
add_geo_dimension()#> Estimate the proportion of households by family arrangement in all MRs
tab_arrangement_prop <- svyhh |>
group_by(rm_ride, vd2004) |>
summarise(prop = survey_prop()) |>
ungroup() |>
add_geo_dimension()#> Estimate the number of the most common family types
tab_ftypes <- svyhh |>
group_by(rm_ride, family_type) |>
survey_tally() |>
ungroup() |>
add_geo_dimension()#> Estimate the proportion of the most common family types
tab_ftypes_prop <- svyhh |>
group_by(rm_ride, family_type) |>
summarise(prop = survey_prop()) |>
ungroup() |>
add_geo_dimension()Mortgage and rents
#> Estimate how much families pay for mortgage on their homes in all MRs
tab_mortgage <- svyhh |>
filter(s01018 > 0) |>
group_by(rm_ride) |>
summarise(
avg = survey_mean(s01018, na.rm = TRUE),
med = survey_median(s01018, na.rm = TRUE),
q25 = survey_quantile(s01018, 0.25, na.rm = TRUE),
q75 = survey_quantile(s01018, 0.75, na.rm = TRUE))
#> Estimates how much families pay for rent in all MRs
tab_rent <- svyhh |>
filter(s01019 > 0, s01017 == "Alugado") |>
group_by(rm_ride) |>
summarise(
avg = survey_mean(s01019, na.rm = TRUE),
med = survey_median(s01019, na.rm = TRUE),
q25 = survey_quantile(s01019, 0.25, na.rm = TRUE),
q75 = survey_quantile(s01019, 0.75, na.rm = TRUE))Income
# Estimate total family income in all Metro Regions (excludes 0 income families)
tab_income <- svyhh |>
filter(vd5010 > 0) |>
summarize_stat(vd5010, rm_ride)
#> Estimate per capital family income in all Metro Regions (excludes 0 income
#> families).
tab_income_pc <- svyhh |>
filter(vd5010 > 0) |>
summarize_stat(vd5011, rm_ride)
#> Estimates the proportion of families by per capita income in all Metro
#> Regions (excludes 0 income families)
tab_income_pc_binned <- svyhh |>
filter(vd5010 > 0) |>
group_by(rm_ride, vd5009) |>
summarise(prop = survey_prop())tab_income_pc |>
select(rm_ride, avg, avg_se) |>
arrange(desc(avg)) |>
mutate(across(is.numeric, format_number)) |>
print_table(col_names = c("Metro Region", "Average", "Std. Err."))Appendix
Base code to execute analysis
get_pnadc_identified <- function(
year = 2019,
interview = 1,
simplified = FALSE,
is_srvyr = TRUE,
aux_variables = TRUE,
unit = "household") {
pnad <- PNADcIBGE::get_pnadc(year = year, interview = interview)
if (isTRUE(aux_variables)) {
pnad$variables <- dplyr::mutate(
pnad$variables,
# Lump house ownership into broader groups
ownership = dplyr::case_when(
stringr::str_detect(s01017, "^Próprio") ~ "Home Owner",
S01017 == "Alugado" ~ "Rent",
TRUE ~ "Other"),
ownership = factor(ownership),
# Group roof material categorical variable
roof_material = dplyr::case_when(
stringr::str_detect(s01003, "^Telha") ~ "Tile",
s01003 == "Somente laje de concreto" ~ "Concrete",
TRUE ~ "Other"),
# Group type of sewage categorical variable
sewage = dplyr::case_when(
s01012a == "Rede geral, rede pluvial" ~ "Sewer",
stringr::str_detect(s01012a, "^Fossa") ~ "Sceptic Tank",
TRUE ~ "Other"),
# Dummy indicating if garbage is collected
is_garbage_collected = ifelse(
stringr::str_detect(s01013, "^Coletado"), 1L, 0L),
# Dummy indicating if house has electricity
has_electricity = ifelse(
stringr::str_detect(s01015, "^Diária"), 1L, 0L),
# Dummy indicating if household is an apartment
is_apto = factor(ifelse(s01001 == "Apartamento", 1L, 0L))
)
}
if (unit == "household") {
#> Remove duplicate rows
household <- dplyr::distinct(pnad$variables, .keep_all = TRUE)
if (isFALSE(simplified)) {
# If precise confidence intervals are needed this is recommended
svy <- PNADcIBGE::pnadc_design(household)
} else {
# Simple (and quicker) approach
svy <- survey::svydesign(
ids = ~ID_DOMICILIO,
strata = ~Estrato,
weights = ~V1032,
nest = TRUE,
data = pnad
)
}
}
if (unit == "individual") {
if (isFALSE(simplified)) {
svy <- pnad
} else {
# Simple (and quicker) approach
svy <- survey::svydesign(
ids = ~1,
strata = ~Estrato,
weights = ~V1032,
nest = TRUE,
data = pnad
)
}
}
if (isTRUE(is_srvyr)) {
svy <- srvyr::as_survey(svy)
svy$variables <- dplyr::rename_with(svy$variables, tolower)
}
return(svy)
}A bit on the theory of complex surveys
Suppose we take a random sample of 1000 people from Rio Grande do Sul (with a total population of 10,000,000). Any given individual has a \(\pi_{i} = 0.00001\) chance of being selected into this sample. If 200 people from this sample have a dog as a pet we can expect that \(200 \times 10000 = 2.000.000\) people, in fact, have a dog as a pet.
The idea behind this is that any individual \(i\), sampled with probability \(\pi_{i}\) represents \(\frac{1}{\pi}\) individuals in the population. In the example, the total number of dog owners \(X\) is approximated by the fraction of individuals in the sample that have a dog multiplied by their relative weight in the total population. Since we assume random sampling, every individual has the same weight, the same sampling weight. In other words, \(\pi_{i} = \pi \, \forall \, i\).
Formally we have the following estimator
\[ \hat{X} = \frac{1}{\pi_{i}}x_{i} \]
where \(x_{i}\) is the observed value of variable \(x\) for individual \(i\) sampled with probability \(\pi_{i}\). For the simplest case, suppose only a single individual is sampled and that \(X\) is the total wealth of the population.
Given a sample of size \(N\) we define the Horwitz-Thompson estimator \(\hat{T}_{X}\) for the population total \(T_{X}\) of \(X\) as:
\[ \hat{T}_{X} = \sum_{i = 1}^{N}\frac{1}{\pi_{i}}x_{i} \]