--- title: "Data selection" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data selection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette describes how to retrieve data from a coin. The main functions to do this are `get_dset()` and the more flexible `get_data()`. These functions are important to understand, because many COINr functions use them to retrieve data for plotting, analysis and other functions. Both functions are *generics*, which means that they have methods for coins and purses. # Data sets Every time a "building" operation is applied to a coin, such as `Treat()`, `Screen()`, `Normalise()` and so on, a new data set is created. Data sets live in the `.$Data` sub-list of the coin. We can retrieve a data set at any time using the `get_data()` function: ```{r} library(COINr) # build full example coin coin <- build_example_coin(quietly = TRUE) # retrieve normalised data set dset_norm <- get_dset(coin, dset = "Normalised") # view first few rows and cols head(dset_norm[1:5], 5) ``` By default, a data set in the coin consists of indicator columns plus the "uCode" column, which is the unique identifier of each row. You can also ask to attach unit metadata columns, such as unit names, groups, and anything else that was input when building the coin, using the `also_get` argument: ```{r} # retrieve normalised data set dset_norm2 <- get_dset(coin, dset = "Normalised", also_get = c("uName", "GDP_group")) # view first few rows and cols head(dset_norm2[1:5], 5) ``` # Data subsets While `get_dset()` is a quick way to retrieve an entire data set and metadata, the `get_data()` function is a generalisation: it can also be used to obtain a whole data set, but also subsets of data, based on e.g. indicator selection and grouping (columns), as well as unit selection and grouping (rows). ## Indicators/columns A simple example is to extract one or more named indicators from a target data set: ```{r} x <- get_data(coin, dset = "Raw", iCodes = c("Flights", "LPI")) # see first few rows head(x, 5) ``` By default, `get_data()` returns the requested indicators, plus the `uCode` identifier column. We can also set `also_get = "none"` to return only the indicator columns. The `iCode` argument can also accept groups of indicators, based on the structure of the index. In our example, indicators are aggregated into "pillars" (level 2) within groups. We can name an aggregation group and extract the underlying indicators: ```{r} x <- get_data(coin, dset = "Raw", iCodes = "Political", Level = 1) head(x, 5) ``` Here we have requested all the indicators in level 1 (the indicator level), that belong to the group called "Political" (one of the pillars). Specifying the level becomes more relevant when we look at the aggregated data set, which also includes the pillar, sub-index and index scores. Here, for example, we can ask for all the pillar scores (level 2) which belong to the sustainability sub-index (level 3): ```{r} x <- get_data(coin, dset = "Aggregated", iCodes = "Sust", Level = 2) head(x, 5) ``` If this isn't clear, look at the structure of the example index using e.g. `plot_framework(coin)`. If we wanted to select all the indicators within the "Sust" sub-index we would set `Level = 1`. If we wanted to select the sub-index scores themselves we would set `Level = 3`, and so on. The idea of selecting indicators and aggregates based on the structure of the index is useful in many places in COINr, for example examining correlations within aggregation groups using `plot_corr()`. ## Units/rows Units (rows) of the data set can also be selected (also in combination with selecting indicators). Starting with a simple example, let's select specified units for a specific indicator: ```{r} get_data(coin, dset = "Raw", iCodes = "Goods", uCodes = c("AUT", "VNM")) ``` Rows can also be sub-setted using groups, i.e. unit groupings that are defined using variables input with `iMeta$Type = "Group"` when building the coin. Recall that for our example coin we have several groups (a reminder that you can see some details about the coin using its print method): ```{r} coin ``` The first way to subset by unit group is to name a grouping variable, and a group within that variable to select. For example, say we want to know the values of the "Goods" indicator for all the countries in the "XL" GDP group: ```{r} get_data(coin, dset = "Raw", iCodes = "Goods", use_group = list(GDP_group = "XL")) ``` Since we have subsetted by group, this also returns the group column which was used. Another way of sub-setting is to combine `uCodes` and `use_group`. When these two arguments are both specified, the result is to return the full group(s) to which the specified `uCodes` belong. This can be used to put a unit in context with its peers within a group. For example, we might want to see the values of the "Flights" indicator for a specific unit, as well as all other units within the same population group: ```{r} get_data(coin, dset = "Raw", iCodes = "Flights", uCodes = "MLT", use_group = "Pop_group") ``` Here, we have to specify `use_group` simply as a string rather than a list. Since MLT is in the "S" population group, it returns all units within that group. Overall, the idea of `get_data()` is to flexibly return subsets of indicator data, based on the structure of the index and unit groups. # Manual selection As a final point, it's worth pointing out that a coin is simply a list of R objects such as data frames, other lists, vectors and so on. It has a particular format which allows things to be easily accessed by COINr functions. But other than that, its an ordinary R object. This means that even without the helper functions mentioned, you can get at the data simply by exploring the coin yourself. The data sets live in the `.$Data` sub-list of the coin: ```{r} names(coin$Data) ``` And we can access any of these directly: ```{r} data_raw <- coin$Data$Raw head(data_raw[1:5], 5) ``` The metadata lives in the `.$Meta` sub-list. For example, the unit metadata, which includes groups, names etc: ```{r} str(coin$Meta$Unit) ``` The point is that if COINr tools don't get you where you want to go, knowing your way around the coin allows you to access the data exactly how you want.