---
title: "Introductio to geobr (R)"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to geobr (R)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = identical(tolower(Sys.getenv("NOT_CRAN")), "true"),
  out.width = "100%"
)


```

The [**geobr**](https://github.com/ipeaGIT/geobr) package provides quick and 
easy access to official spatial data sets of Brazil. The package offers a wide 
range of spatial data sets available at various geographic scales and for various 
years with harmonized attributes, projection and fixed topology. All **geobr** 
functions follow a simple and consistent syntax that allows users to seamlessly 
download data and work with it either in memory using sf or out of memory using 
DuckDB and Arrow. This vignette presents a quick intro to **geobr**.


## Installation

You can install **geobr** from CRAN or the development version to use the latest features.

```{r eval=FALSE, message=FALSE, warning=FALSE}
# From CRAN
install.packages("geobr")

# Development version
utils::remove.packages('geobr')
devtools::install_github("ipeaGIT/geobr", subdir = "r-package")

```

Now let's load the libraries we'll use in this vignette.

```{r message=FALSE, warning=FALSE, results='hide'}
library(geobr)
library(sf)
library(dplyr)
library(ggplot2)
```


## General usage

### Available data sets

The geobr package currently covers 30 spatial data sets, including a variety of 
political-administrative and statistical areas used in Brazil. You can view what 
data sets are available using the `list_geobr()` function.

```{r message=FALSE, warning=FALSE}
# Available data sets
datasets <- list_geobr(wide = TRUE)

head(datasets)

```


### Basic syntax

The syntax of all *geobr* functions operate on the same simple logic, so the 
code to download the data becomes intuitive for the user. Here are a few examples.

Download an specific geographic area at a given year:

```{r message=FALSE, warning=FALSE}
# State of Sergipe
state <- read_state(
  year = 2022,
  code_state = "SE",
  showProgress = FALSE
  )

# Municipality of Sao Paulo
muni <- read_municipality(
  year = 2022, 
  code_muni = 3550308, 
  showProgress = FALSE
  )

ggplot() + 
  geom_sf(data = muni, color=NA, fill = '#1ba185') +
  theme_void()
```


Download all geographic areas within a state at a given year:

```{r message=FALSE, warning=FALSE, results='hide'}
# All municipalities in the state of Minas Gerais
muni <- read_municipality(
  year = 2022,
  code_muni = "MG", 
  showProgress = FALSE
  )

head(muni)
```

If the parameter `code_` is not passed to the function, geobr returns the data 
for the whole country by default.

```{r message=FALSE, warning=FALSE}
# read all schools
inter <- read_schools(
  year = 2022,
  showProgress = FALSE
  )

# read all states
states <- read_state(
  year = 2025, 
  showProgress = FALSE
  )

head(states)
```


### Important note about data resolution

All functions to download polygon data such as states, municipalities etc. have 
a `simplified` argument. When `simplified = FALSE`, geobr returns the 
original data set with high resolution at detailed geographic scale (see 
documentation). By default, however, `simplified = TRUE` and geobr returns data
geometries with simplified borders to improve speed of downloading and 
plotting the data.


## Plot the data

Once you've downloaded the data, it is really simple to plot maps using `ggplot2`.

```{r message=FALSE, warning=FALSE, fig.height = 8, fig.width = 8, fig.align = "center"}
# Remove plot axis
no_axis <- theme(axis.title=element_blank(),
                 axis.text=element_blank(),
                 axis.ticks=element_blank())

# Plot all Brazilian states
ggplot() +
  geom_sf(data=states, fill="#2D3E50", color="#FEBF57", size=.15, show.legend = FALSE) +
  labs(subtitle="States", size=8) +
  theme_minimal() +
  no_axis

```



Plot all the municipalities of a particular state, such as Rio de Janeiro:

```{r message=FALSE, warning=FALSE, fig.height = 8, fig.width = 8, fig.align = "center"}

# Download all municipalities of Rio
all_muni <- read_municipality(
  year= 2022,
  code_muni = "RJ", 
  showProgress = FALSE
  )

# plot
ggplot() +
  geom_sf(data=all_muni, fill="#2D3E50", color="#FEBF57", size=.15, show.legend = FALSE) +
  labs(subtitle="Municipalities of Rio de Janeiro, 2000", size=8) +
  theme_minimal() +
  no_axis

```

## Lazy evaluation with DuckDB and Arrow

By default, all functions in geobr use `output = "sf"` and return `sf` objects 
loaded into memory. In some cases, however, it may be preferable to process data 
out of memory for faster and more memory-efficient computation, particularly when 
working with large spatial data sets.

To support these workflows, users can set `output = "duckdb"` to return a lazy 
`duckspatial_df` object. This allows data to be analyzed with DuckDB using the [{duckspatial}](https://cidree.github.io/duckspatial/) package, enabling
efficient out-of-memory spatial operations using a syntax similar to {sf}.

Alternatively, users can set `output = "arrow"` to return an Arrow dataset, which 
can be integrated with the Arrow ecosystem for scalable analytical workflows.


```{r message=FALSE, warning=FALSE}
# return duckdb duckspatial_df
muni_duck <- geobr::read_municipality(
  year = 2022, 
  output = "duckdb"
  )

# return arrow table
muni_arrow <- geobr::read_municipality(
  year = 2022, 
  output = "arrow"
  )

```

## Thematic maps

The next step is to combine data from **geobr** package with other data sets 
to create thematic maps. In this first example, we will be using data from the 
(Atlas of Human Development (by Ipea/FJP and UNPD) to create a choropleth map 
showing the spatial variation of **Life Expectancy at birth** across Brazilian 
states.

#### Merge external data

First, we need a `data.frame` with estimates of Life Expectancy. We then need to 
merge this table to our spatial database. The two-digit abbreviation of state name 
is our key column to join these two data sets.

```{r message=FALSE, warning=FALSE, results='hide'}
# Read data.frame with life expectancy data
df <- data.table::fread(
  system.file("extdata/br_states_lifexpect2017.csv", package = "geobr")
  )

# join the databases
states <- dplyr::left_join(
  x = states, 
  y = df, 
  by = c("name_state" = "uf")
  )

```


#### Plot thematic map

```{r message=FALSE, warning=FALSE, fig.height = 8, fig.width = 8, fig.align = "center" }
ggplot() +
  geom_sf(data=states, aes(fill=ESPVIDA2017), color= NA, size=.15) +
    labs(subtitle="Life Expectancy at birth, Brazilian States, 2014", size=8) +
    scale_fill_distiller(palette = "Blues", name="Life Expectancy", limits = c(65,80)) +
    theme_minimal() +
    no_axis

```

# Using **geobr** together with **censobr**

Following the same steps as above, we can use together **geobr** with our sister 
package [**censobr**](https://ipeagit.github.io/censobr/index.html) to map the
proportion of households connected to a sewage network in Brazilian municipalities 

First, we need to download households data from the Brazilian census using the 
`read_households()` function.

```{r}
library(censobr)
library(arrow)

hs <- read_households(
  year = 2010, 
  showProgress = FALSE
  )

```

Now we're going to (a) group observations by municipality, (b) get the number of 
households connected to a sewage network, (c) calculate the proportion of 
households connected, and (d) collect the results.

```{r, warning = FALSE}
esg <- hs |> 
        collect() |>
        group_by(code_muni) |>                                             # (a)
        summarize(rede = sum(V0010[which(V0207=='1')]),                    # (b)
                  total = sum(V0010)) |>                                   # (b)
        mutate(cobertura = rede / total) |>                                # (c)
        collect()                                                          # (d)

head(esg)
```
Now we only need to download the geometries of Brazilian municipalities from 
**geobr**, merge the spatial data with our estimates and map the results.

```{r, warning = FALSE}
# download municipality geometries
muni_sf <- geobr::read_municipality(
  year = 2010,
  showProgress = FALSE
  )

# merge data
esg_sf <- left_join(muni_sf, esg, by = 'code_muni')

# plot map
ggplot() +
  geom_sf(data = esg_sf, aes(fill = cobertura), color=NA) +
  labs(title = "Share of households connected to a sewage network") +
  scale_fill_distiller(palette = "Greens", direction = 1, 
                       name='Share of\nhouseholds', 
                       labels = scales::percent) +
  theme_void()

```
