forstringr

R-CMD-check CRAN_Status_Badge metacran downloads

Overview

The string (or character) data type typically requires more manipulation to be helpful for data analysts. Thus, there is a need for a robust package that is up to the task. forstringr is a new package built on top of ‘stringr’ to execute various string manipulations in R programming. The main aim of ‘forstringr’ is to simplify string manipulation for R beginners. This package combines its power with the adaptability of other manipulation tools such as tidyr and dplyr. Like in the stringr package, most functions in forstringr begin with str_. For a quick video tutorial, I gave a talk at Africa R users meetup, which you can find here.

Installation

You can install forstringr package from CRAN with:

install.packages("forstringr")

or the development version from GitHub with

if(!require("devtools")){
 install.packages("devtools")
}

devtools::install_github("gbganalyst/forstringr")

Usage

This section provides a concise overview of the different functions available in the forstringr package. These functions serve various purposes and are designed to aid in string manipulation tasks.

length_omit_na()

length_omitna() counts only non-missing elements of a vector.

library(forstringr)
#> Loading required package: stringr

ethnicity <- c("Hausa", NA, "Yoruba", "Ijaw", "Igbo", NA, "Ibibio", "Tiv", "Fulani", "Kanuri", "Others")

# count all the observations, including NAs.

length(ethnicity) 
#> [1] 11


# count all the observations, without NAs.

length_omit_na(ethnicity)
#> [1] 9

str_title_case()

str_title_case() converts string to title case, capitalizing only the first letter of each word while ignoring articles, prepositions, and conjunctions.

Please note that str_title_case() is different from stringr::str_to_title() which converts to title case, where only the first letter of each word is capitalized.


words <- "the quick brown fox jumps over a lazy dog"

str_title_case(words) # from forstringr package
#> [1] "The Quick Brown Fox Jumps over a Lazy Dog"

str_to_title(words) # from stringr package
#> [1] "The Quick Brown Fox Jumps Over A Lazy Dog"

str_left()

Given a character vector, str_left() returns the left side of a string. For examples:


str_left("Nigeria")
#> [1] "N"

str_left("Nigeria", n = 3)
#> [1] "Nig"

str_left(c("Female", "Male", "Male", "Female"))
#> [1] "F" "M" "M" "F"

str_right()

Given a character vector, str_right() returns the right side of a string. For examples:


str_right("July 20, 2022", 4)
#> [1] "2022"

str_right("Sale Price", n = 5)
#> [1] "Price"

str_mid()

Like in Microsoft Excel, the str_mid()returns a specific number of characters from a text string, starting at the position you specify, based on the number of characters you select.

str_mid("Super Eagle", 7, 5)
#> [1] "Eagle"

str_mid("Oyo Ibadan", 5, 6)
#> [1] "Ibadan"

str_split_extract()

If you want to split up a string into pieces and extract the results using a specific index position, then, you will use str_split_extract(). You can interpret it as follows:

Given a character string, S, extract the element at a given position, k, from the result of splitting S by a given pattern, m. For example:

top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu")

first_name <- str_split_extract(top_10_richest_nig, " ", 1)

first_name
#>  [1] "Aliko"      "Mike"       "Femi"       "Arthur"     "Abdulsamad"
#>  [6] "Cletus"     "Orji"       "ABC"        "Jimoh"      "Tony"

str_extract_part()

Extract strings before or after a given pattern. For example:

first_name <- str_extract_part(top_10_richest_nig,  pattern = " ", before = TRUE)

first_name
#>  [1] "Aliko"      "Mike"       "Femi"       "Arthur"     "Abdulsamad"
#>  [6] "Cletus"     "Orji Uzor"  "ABC"        "Jimoh"      "Tony"

revenue <- c("$159", "$587", "$891", "$207", "$793")

str_extract_part(revenue, pattern = "$", before = FALSE)
#> [1] "159" "587" "891" "207" "793"

str_englue()

You can dynamically label ggplot2 plots with the glue operators {} or {{}} using str_englue(). For example, any value wrapped in { } will be inserted into the string and you automatically inserts a given variable name using {{ }}.

It is important to note that str_englue() must be used inside a function. str_englue("{{ var }}") defuses the argument var and transforms it to a string using the default name operation.

library(ggplot2)

histogram_plot <- function(df, var, binwidth) {
 df %>%  
   ggplot(aes(x = {{ var }})) +
   geom_histogram(binwidth = binwidth) +
   labs(title = str_englue("A histogram of {{var}} with binwidth {binwidth}"))
}

iris %>% 
  histogram_plot(Sepal.Length, binwidth = 0.1)

str_rm_whitespace_df()

Extra spaces are accidentally entered when working with survey data, and problems can arise when evaluating such data because of extra spaces. Therefore, the function str_rm_whitespace_df() eliminates your data frame unnecessary leading, trailing, or other whitespaces.

# a dataframe with whitespaces

richest_in_nigeria
#> # A tibble: 10 × 5
#>     Rank Name                   `Net worth`         Age `Source of Wealth`      
#>    <dbl> <chr>                  <chr>             <dbl> <chr>                   
#>  1     1 " Aliko Dangote"       "$14 Billion"        64 "  Cement and Sugar "   
#>  2     2 "Mike Adenuga"         "$7.9  Billion "     68 "Telecommunication,    …
#>  3     3 "Femi   Otedola"       "$5.9   Billion"     59 "Oil  and Gas"          
#>  4     4 " Arthur Eze"          "$5 Billion"         73 "Oil and Gas"           
#>  5     5 "Abdulsamad     Rabiu" "$3.7 Billion"       61 "Cement   and Sugar"    
#>  6     6 " Cletus Ibeto "       " $3.5 Billion"      69 "Automobile, Real Estat…
#>  7     7 "Orji Uzor Kalu"       "$3.2 Billion"       61 "Furniture,    Publishi…
#>  8     8 "ABC Orjiakor "        "  $1.2 Billion"     63 "Oil and Gas"           
#>  9     9 "  Jimoh Ibrahim"      "$1 Billion "        54 "Insurance, Oil and Gas…
#> 10    10 "Tony   Elumelu"       "$900    Million"    58 "  Banking  "
# a dataframe with no whitespaces

str_rm_whitespace_df(richest_in_nigeria)
#> # A tibble: 10 × 5
#>     Rank Name             `Net worth`    Age `Source of Wealth`                 
#>    <dbl> <chr>            <chr>        <dbl> <chr>                              
#>  1     1 Aliko Dangote    $14 Billion     64 Cement and Sugar                   
#>  2     2 Mike Adenuga     $7.9 Billion    68 Telecommunication, Oil, and Gas    
#>  3     3 Femi Otedola     $5.9 Billion    59 Oil and Gas                        
#>  4     4 Arthur Eze       $5 Billion      73 Oil and Gas                        
#>  5     5 Abdulsamad Rabiu $3.7 Billion    61 Cement and Sugar                   
#>  6     6 Cletus Ibeto     $3.5 Billion    69 Automobile, Real Estate            
#>  7     7 Orji Uzor Kalu   $3.2 Billion    61 Furniture, Publishing              
#>  8     8 ABC Orjiakor     $1.2 Billion    63 Oil and Gas                        
#>  9     9 Jimoh Ibrahim    $1 Billion      54 Insurance, Oil and Gas, Real Estate
#> 10    10 Tony Elumelu     $900 Million    58 Banking