--- title: "Getting started with DescriptiveStats.OBeu" author: "Aikaterini Chatzopoulou, Kleanthis Koupidis, Charalampos Bratsas" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with DescriptiveStats.OBeu} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `DescriptiveStats.OBeu` estimates the descriptive statistical measures, needed at [OpenBudgets.eu](http://openbudgets.eu/). You can measure central tendency and dispersion of numeric variables along with their distributions and correlations and the frequencies of categorical variables for a given dataset on [OpenBudgets.eu data mining tool platform](http://openbudgets.eu/tools/). The vignette provides an effective way to use functions of `DescriptiveStats.OBeu` with datasets including datasets of OpenBudgets.eu. `tojson` parameter is used in `ds.analysis`, `ds.statistics`, `ds.hist`, `ds.boxplot`, `ds.correlation`, `ds.frequency`, `ds.kurtosis`, `ds.skewness` functions in order to specify if the resulted object should be in json format. First you have to load the library ```{r load, warning=FALSE, include=TRUE} # load DescriptiveStats.OBeu library(DescriptiveStats.OBeu) ``` ## Data in the package The data in the package include the budget of Wuppertal for 2009 to 2020, as a data frame `Wuppertal_df` and as a json link `Wuppertal_openspending` as well as a sample json link `sample_json_link_openspending`, which you can access them using `fromJSON` of `jsonlite` package or copy paste the link to a browser. Wuppertal internal structure ```{r, echo=FALSE, include = TRUE} str(Wuppertal_df) ``` ## Descriptive Statistics in a call `ds.analysis` is used to estimate *minimum*, *maximum*, *range*, *mean*, *median*, *first and third quantiles*, *variance*, *standart deviation*, *skewness* and *kurtosis*, *boxplot*, *histogram parameters* needed for visualization of numeric variables and *frequencies* of factor variables of a given vector, matrix or data frame of data. +---------------+------------------------+-----------------------------------------------------------+ | Component | Output | Description | +===============+========================+===========================================================+ | `statistics` | - Min | - The minimum observed value of the input data | | | - Max | - The maximum observed value of the input data | | | - Range | - The difference between maximum and minimum | | | - Mean | - The average value of the input data | | | - Median | - The median value of the input data | | | - Quantiles | - The 25%, 75% percentiles | | | - Variance | - The variance of the input data | | | - StandardDeviation | - The standard deviation of the input data | | | - Skewness | - The Skewness of the input data | | | - Kurtosis | - The Kurtosis of the input data | +---------------+------------------------+-----------------------------------------------------------+ | `boxplot` | - lo.whisker | - Lower horizontal line out of the box | | | - lo.hinge | - Lower horizontal line of the box | | | - median | - Horizontal line in the box | | | - up.hinge | - Upper horizontal line of the box | | | - up.whisker | - Upper horizontal line out of the box | | | - box.width | - The box width of each variable | | | - lo.out | - Lower outliers | | | - up.out | - Upper outliers | | | - n | - The number of non-NA observations | +---------------+------------------------+-----------------------------------------------------------+ | `histogram` | - cuts | - The boundaries of the histogram classes | | | - counts | - The frequency of each histogram class | | | - mean | - The average value of the input vector | | | - median | - The median value of the input data | +---------------+------------------------+-----------------------------------------------------------+ | `frequencies` | - Variable name | - The name of the calculated variable | | | - frequencies | - The frequency value | | | - "_row" | - Name of the categories of the variable | | | - relative.frequencies | - Relative frequency values | +---------------+------------------------+-----------------------------------------------------------+ | `correlation` | - Variable name | - The name of the calculated variable | | | - Correlation value | - The correlation value | | | - "_row" | - The corresponding correlation variable | +---------------+------------------------+-----------------------------------------------------------+ Table: *`ds.analysis`* components `ds.analysis` returns by default a list object, we set `tojson` parameter `TRUE`, `outliers` parameter `FALSE`, `fr.select = "Produktbereich"`. Correlation component is empty because there is one numeric variable. ```{r analysis} wuppertalanalysis = ds.analysis(Wuppertal_df,outliers=FALSE, fr.select = "Produktbereich", tojson=TRUE) # json string format jsonlite::prettify(wuppertalanalysis) # use prettify of jsonlite library to add indentation to the returned JSON string ``` `ds.analysis` uses internally the functions `ds.statistics`,`ds.hist`,`ds.boxplot`,`ds.correlation` and `ds.frequency`. However, these functions can be used independently and depends on the user requirements. ## Statistical measures `ds.statistics` is used to estimate *minimum*, *maximum*, *range*, *mean*, *median*, *first and third quantiles*, *variance*, *standart deviation*, *skewness* and *kurtosis* values of a given vector, matrix or data frame of data. `ds.statistics` returns by default a list object: ```{r descriptive stats} ds.statistics(Wuppertal_df) # list format ``` The results can be extracted in json format for further use you should set the parameter `tojson` to `TRUE`: ```{r descriptive stats to json} wuppertalstats = ds.statistics(Wuppertal_df, tojson = TRUE) # json format jsonlite::prettify(wuppertalstats) # use prettify of jsonlite library to add indentation to the returned JSON string ``` ## Histogram `ds.hist` computes the parameters needed to visualize a histogram of a numeric input vector, specifying the `breaks` as in base `hist` function. ```{r hist} ds.hist(Wuppertal_df$Amount, breaks= "Sturges") # list format ``` Return the results as json string: ```{r hist json} wuppertalhist = ds.hist(Wuppertal_df$Amount, breaks= "Sturges", tojson=TRUE) # json format jsonlite::prettify(wuppertalhist) # use prettify of jsonlite library to add indentation to the returned JSON string ``` ## Boxplot The `ds.boxplot` returns the parameters needed for a boxplot visualization of an input vector, matrix or data frame. If `outl` is `TRUE` the outliers will be computed at the selected `out.level` level (default is `1.5` times the *Interquartile Range*) and the width level is determined 0.15 times the square root of the size of the input data. `ds.boxplot` uses the numeric variables of the input data, you do not have to exclude factor or character variables. ```{r boxplot json} wuppertalbox = ds.boxplot(Wuppertal_df, width = 0.15 , outl = FALSE, tojson=TRUE) # json format jsonlite::prettify(wuppertalbox) # use prettify of jsonlite library to add indentation to the returned JSON string ``` ## Correlation `ds.correlation` estimate the correlation coefficient (default is `"pearson"`) of the input vectors, matrix or data frame. In this example `iris` dataset is used. Factor or character variables in the input matrix or data frame will be filtered out by default. ```{r correlation json} iriscorr = ds.correlation(iris, cor.method="pearson", tojson=TRUE) # json format jsonlite::prettify(iriscorr) # use prettify of jsonlite library to add indentation to the returned JSON string ``` ## Frequency Frequencies and relative frequencies of factors/characters of the input dataset using `ds.frequency` for `Produktbereich` from `Wuppertal_df` dataset and return as json string. ```{r frequency} wuppertalfreq = ds.frequency(Wuppertal_df$Produktbereich, tojson = TRUE) jsonlite::prettify(wuppertalfreq) # use prettify of jsonlite library to add indentation to the returned JSON string ``` If the input is a dataframe and the `select` parameter is not specified, all the factor variables will be returned. All the numeric variables of the input data are filtered out of the estimations internally. ## Kurtosis This function calculates kurtosis of the input vector, matrix or data frame. Factor or character variables that may be included in the input matrix or data frame, will be omitted in the estimations. ```{r kurtosis} ds.kurtosis(Wuppertal_df$Amount, tojson=TRUE) ``` ## Skewness This function calculates skewness of the input vector, matrix or data frame. Factor or character variables that may be included in the input matrix or data frame, will be omitted in the estimations. ```{r skewness} ds.skewness(Wuppertal_df$Amount, tojson=TRUE) ```