--- title: "Explore penguins" author: "Roland Krasser" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Explore penguins} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## How to explore the penguins dataset using the explore package. The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code! We will use < 10 lines of code and just 6 function names to explore penguins: | function | package | description | |------------------|-----------|-----------------------------------------| | `library()` | {base} | load a package | | `filter()` | {dplyr} | subset rows using column values | | `describe()` | {explore} | describe variables of the table | | `explore()` | {explore} | explore graphically a variable | | `explore_all()` | {explore} | explore all variables of the table | | `explain_tree()` | {explore} | explain a target using a decision tree | The `penguins` dataset comes with the palmerpenguins package. It has 344 observations and 8 variables. () Furthermore, we use the packages {dplyr} for `filter()` and `%>%` and {explore} for data exploration. ```{r message=FALSE, warning=FALSE} library(dplyr) library(explore) penguins <- use_data_penguins() # equivalent to # penguins <- palmerpenguins::penguins ``` ### Describe variables ```{r message=FALSE, warning=FALSE} penguins %>% describe() ``` There are some `NA`-values (unknown values) in the data. The variable containing the most NAs is sex. flipper_length_mm and others contain only 2 observations with NAs. ### Data cleaning We use only penguins with known flipper length for the data exploration! ```{r} data <- penguins %>% filter(flipper_length_mm > 0) ``` We reduced the penguins from 344 to 342. ### Explore variables ```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=total_fig_height(data, size = 2.5)} data %>% explore_all(color = "skyblue") ``` ### Which species? What is the relationship between all the variables and species? ```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=total_fig_height(data, var_name_target = "species", size = 2.2)} data %>% explore_all( target = species, color = c("darkorange", "purple", "lightseagreen")) ``` We already see some strong patterns in the data. `flipper_length_mm` separates species Gentoo, `bill_length_mm` separates species Adelie from Chinstrap. And we see that Chinstrap and Gentoo are located on separate islands. Now we explain species using a decision tree: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=4} data %>% explain_tree(target = species) ``` We found an easy explanation how to find out the species by just using flipper_length_mm and bill_length_mm. * If `flipper_legnth_mm >= 207`, it is a Gentoo penguin (95% right) * If `flipper_length_mm < 207` and `bill_length_mm < 43`, it is a Adelie penguin (97% right) * If `flipper_length_mm < 207` and `bill_length_mm >= 43`, it is a Chinstrap penguin (92% right) Now let's take a closer look to these variables: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} data %>% explore( flipper_length_mm, bill_length_mm, target = species, color = c("darkorange", "purple", "lightseagreen") ) ``` The plot shows a not perfect but good separation between the 3 species!