--- title: "Codebook and cookbook for v1 (2020-2021) Spanish mobility data" vignette: > %\VignetteIndexEntry{Codebook and cookbook for v1 (2020-2021) Spanish mobility data} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} bibliography: references.bib format: html: toc: true toc-depth: 2 code-overflow: wrap execute: eval: false --- You can view this vignette any time by running: ```{r} spanishoddata::spod_codebook(ver = 1) ``` The mobility data v1 (2020-2021) was originally released by the Ministerio de Transportes, Movilidad y Agenda Urbana (MITMA) , now [Ministerio de Transportes y Movilidad Sostenible (MITMS)](https://www.transportes.gob.es/) [@mitma_data_v1]. The dataset is produced by [Nommon](https://www.nommon.es/){target="_blank"} using the raw data from [Orange España](https://www.orange.es/){target="_blank"}. Even though the raw data is only from one mobile phone operator, the resulting flows and other counts of number of individuals in the data set are already resampled to be representative of the total population of Spain (see details in the official methodology). The tables in the data set provide hourly flows between zones across Spain for every day of the observation period (2020-02-14 to 2021-05-09), and the number of individuals making trips for each zone. This document will introduce you to the available data and provide brief code snippets on how to access it using the `{spanishoddata}` R package. Key sources for this codebook/cookbook include: - [original data collection methodology in Spanish](https://cdn.mitma.gob.es/portal-web-drupal/covid-19/bigdata/mitma_-_estudio_movilidad_covid-19_informe_metodologico_v3.pdf){target="_blank"} + [automatically translated English version of methodology](https://rOpenSpain.github.io/spanishoddata/codebooks/mitma_-_estudio_movilidad_covid-19_informe_metodologico_v3_en.pdf){target="_blank"} - [original data codebook in Spanish](https://opendata-movilidad.mitma.es/README%20-%20formato%20ficheros%20movilidad%20MITMA%2020201228.pdf){target="_blank"} + [automatically translated English version of codebook](https://rOpenSpain.github.io/spanishoddata/codebooks/README_-_formato_ficheros_movilidad_MITMA_20201228_en.pdf){target="_blank"} - [original data download page](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/estudios-de-movilidad-anteriores/covid-19/opendata-movilidad){target="_blank"} - [original data license](https://opendata-movilidad.mitma.es/LICENCIA%20de%20datos%20abiertos%20del%20MITMA%2020201203.pdf){target="_blank"} - [homepage of the v1 study](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/estudios-de-movilidad-anteriores/covid-19){target="_blank"} - [homepage of the open mobility data project of the Ministry of Transport and Sustainable Mobility of Spain](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudio-de-movilidad-con-big-data){target="_blank"} **Note: Kindly consult the documents above for any specific details on the methodology. The codebook here is only a simplified summary.** To access the data we reference in this codebook, please follow these steps: {{< include ../inst/vignette-include/install-package.qmd >}} Load it as follows: ```{r} library(spanishoddata) ``` Using the instructions below, set the data folder for the package to download the files into. You may need up to 30 GB to download all data and another 30 GB if you would like to convert the downloaded data into analysis ready format (a `DuckDB` database file, or a folder of `parquet` files). You can find more info on this conversion in the [Download and convert OD datasets](convert.html) vignette. {{< include ../inst/vignette-include/setup-data-directory.qmd >}} {{< include ../inst/vignette-include/overall-approach.qmd >}} ![The overview of package functions to get the data](../man/figures/package-functions-overview.svg){#fig-overall-flow width="78%"} # 1. Spatial data with zoning boundaries The boundary data is provided at two geographic levels: [`Distrtics`](#districts) and [`Municipalities`](#municipalities). It's important to note that these do not always align with the official Spanish census districts and municipalities. To comply with data protection regulations, certain aggregations had to be made to districts and municipalities". ## 1.1 `Districts` {#districts} `Districts` correspond to official census districts in cities; however, in those with lower population density, they are grouped together. In rural areas, one district is often equal to a municipality, but municipalities with low population are combined into larger units to preserve privacy of individuals in the dataset. Therefore, there are 2850 'districts' compared to the 10494 official census districts on which they are based. To access it: ```{r} districts_v1 <- spod_get_zones("dist", ver = 1) ``` The `districts_v1` object is of class `sf` consisting of polygons. Data structure: | Variable Name | **Description** | |------------------------------------|------------------------------------| | `id` | District `id` assigned by the data provider. Matches with `id_origin`, `id_destination`, and `id` in district level [origin-destination data](#od-data) and [number of trips data](#nt-data). | | `census_districts` | A string with semicolon-separated list of census district semicolon-separated identifiers as classified by the Spanish Statistical Office (INE) that are spatially bound within polygons with `id` above. | | `municipalities_mitma` | A string with semicolon-separated list of municipality identifiers as assigned by the data provider in municipality zones spatial dataset that correspond to a given district `id` . | | `municipalities` | A string with semicolon-separated list of municipality identifiers as classified by the Spanish Statistical Office (INE) that correspond to polygons with `id` above. | | `district_names_in_v2` | A string with semicolon-separated list of names of district polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.html) that covers the year 2022 and onwards that correspond to polygons with `id` above. | | `district_ids_in_v2` | A string with semicolon-separated list of identifiers of district polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.html) that covers the year 2022 and onwards that correspond to polygons with `id` above. | ## 1.2 `Municipalities` {#municipalities} `Municipalities` are made up of official municipalities in those of a certain size; however, they have also been aggregated in cases of lower population density. As a result, there are 2,205 municipalities compared to the 8,125 official municipalities on which they are based. To access it: ```{r} municipalities_v1 <- spod_get_zones("muni", ver = 1) ``` The resulting `municipalities_v1` object is type `sf` consisting of polygons. Data structure: | Variable Name | **Description** | |------------------------------------|------------------------------------| | `id` | District `id` assigned by the data provider. Matches with `id_origin`, `id_destination`, and `id` in municipality level [origin-destination data](#od-data) and [number of trips data](#nt-data). | | `municipalities` | A list of municipality identifiers as classified by the Spanish Statistical Office (INE) that correspond to polygons with `id` above. | | `districts_mitma` | A list of district identifiers as assigned by the data provider in districts zones spatial dataset that correspond to a given municipality `id` . | | `census_districts` | A list of census district identifiers as classified by the Spanish Statistical Office (INE) that are spatially bound within polygons with `id` above. | | `municipality_names_in_v2` | A list of names of municipality polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.html) that covers the year 2022 and onwards that correspond to polygons with `id` above. | | `municipality_ids_in_v2` | A list of identifiers of municipality polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.html) that covers the year 2022 and onwards that correspond to polygons with `id` above. | The spatial data you get via `spanishoddata` package is downloaded directly from the source, the geometries of polygons are automatically fixed if there are any invalid geometries. The zone identifiers are stored in `id` column. Apart from that `id` column, the original zones files do not have any metadata. However, as seen above, using the `spanishoddata` package you get many additional columns that provide a semantic connection between official statistical zones used by the Spanish government and the zones you can get for the v2 data (for 2022 onward). # 2. Mobility data All mobility data is referenced via `id_origin`, `id_destination`, or other location identifiers (mostly labelled as `id`) with the two sets of zones described above. ## 2.1. Origin-destination data {#od-data} The origin-destination data contain the number of trips between `districts` or `municipalities` in Spain for every hour of every day between 2020-02-14 and 2021-05-09. Each flow also has attributes such as the trip purpose (composed of the type of activity (`home`/`work_or_study`/`other`) at both the origin and destination), province of residence of individuals making this trip, distance covered while making the trip. See the detailed attributes below in a table. @fig-flows-barcelona shows an example of total flows in the province of Barcelona on Feb 14th, 2020. ![Origin destination flows in Barcelona on 2020-02-14](media/flows_plot.svg){#fig-flows-barcelona width="70%"} Here are the variables you can find in both the `district` and `municipality` level origin-destination data: | **English Variable Name** | **Original Variable Name** | **Type** | **Description** | |----------------|----------------|----------------|------------------------| | `date` | `fecha` | `Date` | The date of the recorded data, formatted as `YYYY-MM-DD`. | | `id_origin` | `origen` | `factor` | The origin zone `id` of `district` or `municipalitity`. | | `id_destination` | `destino` | `factor` | The destination zone `id` of `district` or `municipalitity`. | | `activity_origin` | `actividad_origen` | `factor` | The type of activity at the origin zone, recoded from `casa`, `otros`, `trabajo_estudio` to `home`, `other`, `work_or_study` respectively. | | `activity_destination` | `actividad_destino` | `factor` | The type of activity at the destination zone, similarly recoded as for `activity_origin` above. | | `residence_province_ine_code` | `residencia` | `factor` | The province code of residence if individuals who were making the trips in `n_trips`, encoded as province codes as classified by the Spanish Statistical Office (INE). | | `residence_province_name` | Derived from `residencia` | `factor` | The full name of the residence province, derived from the province code above. | | `time_slot` | `periodo` | `integer` | The time slot during which the trips occurred. | | `distance` | `distancia` | `factor` | The distance range of the trip, categorized into specific intervals such as `0005-002` (500 m to 2 km), `002-005` (2-5 km), `005-010` (5-10km), `010-050` (10-50 km), `050-100` (50-100 km), and `100+` (more than 100 km). | | `n_trips` | `viajes` | `numeric` | The number of trips for that specific origin-destination pair and time slot. | | `trips_total_length_km` | `viajes_km` | `numeric` | The total length of trips in kilometers, summing up all trips between the origin and destination zones. | | `year` | `year` | `integer` | The year of the recorded data, extracted from the date. | | `month` | `month` | `integer` | The month of the recorded data, extracted from the date. | | `day` | `day` | `integer` | The day of the recorded data, extracted from the date. |
Data transformation note The original data is stored in the [`maestra-2`](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/estudios-de-movilidad-anteriores/covid-19/opendata-movilidad){target="_blank"} folder with suffixes `distritos` (for district zoning) and `municipios` (for municipality zoning). We only use the `district` level data because of several data issues with the `municipality` data documented [here](http://www.ekotov.pro/mitma-data-issues/issues/011-v1-tpp-mismatch-zone-ids-in-table-and-spatial-data.html){target="_blank"} and [here](http://www.ekotov.pro/mitma-data-issues/issues/012-v1-tpp-district-files-in-municipality-folders.html){target="_blank"}, but also because the distric level data contains more columns with useful origin-destination flow characteristics. As a result, you get both the `district` level data and the `municipality` level data with the same columns. `Municipality` level data is simply a re-aggregation of `district` level data using the official relations file where `district` identifiers are mapped to `municipality` identifiers (orginal file is [`relaciones_distrito_mitma.csv`](https://opendata-movilidad.mitma.es/relaciones_distrito_mitma.csv){target="_blank"}).
**Getting the data** To access the data, use the `spod_get()` function. In this example we will use a short interval of dates: ```{r} dates <- c(start = "2020-02-14", end = "2020-02-17") od_dist <- spod_get(type = "od", zones = "dist", dates = dates) od_muni <- spod_get(type = "od", zones = "muni", dates = dates) ``` The data for the specified dates will be automatically downloaded and cached in the `SPANISH_OD_DATA_DIR` directory. Existing files will not be re-downloaded. **Working with the data** The resulting objects `od_dist` and `od_muni` are of class `tbl_duckdb_connection`[^1]. Basically, you can treat these as regular `data.frame`s or `tibble`s. One important difference is that the data is not actually loaded into memory, because if you requested more dates, e.g. a whole month or a year, all that data would most likely not fit into your computer's memory. A `tbl_duckdb_connection` is mapped to the downloaded CSV files that are cached on disk and the data is only loaded in small chunks as needed at the time of computation. You can manipulate `od_dist` and `od_muni` using `{dplyr}` functions such as `select()`, `filter()`, `mutate()`, `group_by()`, `summarise()`, etc. In the end of any sequence of commands you will need to add `collect()` to execute the whole chain of data manipulations and load the results into memory in an R `data.frame`/`tibble` like so: [^1]: For reference: this object also has classes: `tbl_dbi` ,`tbl_sql`, `tbl_lazy` ,and `tbl` . ```{r} library(dplyr) od_mean_hourly_trips_over_the_4_days <- od_dist |> group_by(time_slot) |> summarise( mean_hourly_trips = mean(n_trips, na.rm = TRUE), .groups = "drop") |> collect() od_mean_hourly_trips_over_the_4_days ``` ``` # A tibble: 24 × 2 time_slot mean_hourly_trips 1 18 21.4 2 10 19.3 3 2 14.8 4 15 19.8 5 11 19.9 6 16 19.6 7 22 20.9 8 0 18.6 9 13 21.1 10 19 22.5 # ℹ 14 more rows # ℹ Use `print(n = ...)` to see more rows ``` In this example above, we calculated mean hourly flows over the 4 days of the requested period. The full data for all 4 days was probably never loaded into memory all at once. Rather the available memory of the computer was used up to its maximum limit to make that calculation happen, without ever exceeding the available memory limit. If you were doing the same opearation on 100 or even more days, it would work in the same way and would be possible even with limited memory. This is done transparantly to the user with the help of [`DuckDB`](https://duckdb.org/){target="_blank"} (specifically, with [{duckdb} R package](https://r.duckdb.org/index.html){target="_blank"} @duckdb-r). The same summary operation as provided in the example above can be done with the entire dataset for the full 18 month on a regular laptop with 8-16 GB memory. It will take a bit of time to complete, but it will be done. To speed things up, please also see the [vignette on converting the data](convert.qmd) into formats that will increase the analsysis performance. {{< include ../inst/vignette-include/csv-date-filter-note.qmd >}} ## 2.2. Number of trips data {#nt-data} The "number of trips" data shows the number of individuals in each district or municipality who made trips categorised by the number of trips. | **English Variable Name** | **Original Variable Name** | **Type** | **Description** | |----------------|----------------|----------------|------------------------| | `date` | `fecha` | `Date` | The date of the recorded data, formatted as `YYYY-MM-DD`. | | `id` | `distrito` | `factor` | The identifier of the `district` or `municipality` zone. | | `n_trips` | `numero_viajes` | `factor` | The number of individuals who made trips, categorized by `0`, `1`, `2`, or `2+` trips. | | `n_persons` | `personas` | `factor` | The number of persons making the trips from `district` or `municipality` with zone `id`. | | `year` | `year` | `integer` | The year of the recorded data, extracted from the date. | | `month` | `month` | `integer` | The month of the recorded data, extracted from the date. | | `day` | `day` | `integer` | The day of the recorded data, extracted from the date. |
Data transformation note The original data is stored in the [`maestra-2`](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/estudios-de-movilidad-anteriores/covid-19/opendata-movilidad){target="_blank"} folder with suffixes `distritos` (for district zoning) and `municipios` (for municipality zoning). We only use the `district` level data because of several data issues with the `municipality` data documented [here](http://www.ekotov.pro/mitma-data-issues/issues/011-v1-tpp-mismatch-zone-ids-in-table-and-spatial-data.html){target="_blank"} and [here](http://www.ekotov.pro/mitma-data-issues/issues/012-v1-tpp-district-files-in-municipality-folders.html){target="_blank"}, but also because the distric level data contains more columns with useful origin-destination flow characteristics. As a result, you get both the `district` level data and the `municipality` level data with the same columns. `Municipality` level data is simply a re-aggregation of `district` level data using the official relations file where `district` identifiers are mapped to `municipality` identifiers (orginal file is [`relaciones_distrito_mitma.csv`](https://opendata-movilidad.mitma.es/relaciones_distrito_mitma.csv){target="_blank"}).
**Getting the data** To access it use `spod_get()` with `type` set to "number_of_trips", or just "nt". We can also set `dates` to the maximum possible date range `2020-02-14` to `2021-05-09` to get all the data, as this data is relatively small (under 200 Mb). ```{r} dates <- c(start = "2020-02-14", end = "2021-05-09") nt_dist <- spod_get(type = "number_of_trips", zones = "dist", dates = dates) ``` Because this data is small, we can actually load it completely into memory: ```{r} nt_dist_tbl <- nt_dist |> dplyr::collect() ``` # Advanced use For more advanced use, especially for analysing longer periods (months or even years), please see [Download and convert mobility datasets](convert.html).