--- title: "Converting dates, times or date-times to ISO 8601" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Converting dates, times or date-times to ISO 8601} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(sdtm.oak) ``` An SDTM DTC variable may include data that is represented in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format as a complete date/time, a partial date/time, or an incomplete date/time. `{sdtm.oak}` provides the `create_iso8601()` function that allows flexible mapping of date and time values in various formats to a single date-time ISO 8601 format. ## Introduction To perform conversion to the ISO 8601 format you need to pass two key arguments: - At least one vector of dates, times, or date-times of `character` type; - A date/time format via the `.format` parameter that instructs `create_iso8601()` on which date/time components to expect. ```{r} create_iso8601("2000 01 05", .format = "y m d") create_iso8601("22:35:05", .format = "H:M:S") ``` By default the `.format` parameter understands a few reserved characters: - `"y"` for year - `"m"` for month - `"d"` for day - `"H"` for hours - `"M"` for minutes - `"S"` for seconds Besides character vectors of dates and times, you may also pass a single vector of date-times, provided you adjust the format: ```{r} create_iso8601("2000-01-05 22:35:05", .format = "y-m-d H:M:S") ``` ## Multiple inputs If you have dates and times in separate vectors then you will need to pass a format for each vector: ```{r} create_iso8601("2000-01-05", "22:35:05", .format = c("y-m-d", "H:M:S")) ``` In addition, like most R functions that take vectors as input, `create_iso8601()` is vectorized: ```{r} date <- c("2000-01-05", "2001-12-25", "1980-06-18", "1979-09-07") time <- c("00:12:21", "22:35:05", "03:00:15", "07:09:00") create_iso8601(date, time, .format = c("y-m-d", "H:M:S")) ``` But the number of elements in each of the inputs has to match or you will get an error: ```{r} date <- c("2000-01-05", "2001-12-25", "1980-06-18", "1979-09-07") time <- "00:12:21" try(create_iso8601(date, time, .format = c("y-m-d", "H:M:S"))) ``` You can combine individual date and time components coming in as separate inputs; here is a contrived example of year, month and day together, hour, and minute: ```{r} year <- c("99", "84", "00", "80", "79", "1944", "1953") month_and_day <- c("jan 1", "apr 04", "mar 06", "jun 18", "sep 07", "sep 13", "sep 14") hour <- c("12", "13", "05", "23", "16", "16", "19") min <- c("0", "60", "59", "42", "44", "10", "13") create_iso8601(year, month_and_day, hour, min, .format = c("y", "m d", "H", "M")) ``` The `.format` argument must be always named; otherwise, it will be treated as if it were one of the inputs and interpreted as missing. ```{r} try(create_iso8601("2000-01-05", "y-m-d")) ``` ## Format variations The `.format` parameter can easily accommodate variations in the format of the inputs: ```{r} create_iso8601("2000-01-05", .format = "y-m-d") create_iso8601("2000 01 05", .format = "y m d") create_iso8601("2000/01/05", .format = "y/m/d") ``` Individual components may come in a different order, so adjust the format accordingly: ```{r} create_iso8601("2000 01 05", .format = "y m d") create_iso8601("05 01 2000", .format = "d m y") create_iso8601("01 05, 2000", .format = "m d, y") ``` All other individual characters given in the format are taken strictly, e.g. the number of spaces matters: ```{r} date <- c("2000 01 05", "2000 01 05", "2000 01 05", "2000 01 05") create_iso8601(date, .format = "y m d") create_iso8601(date, .format = "y m d") create_iso8601(date, .format = "y m d") create_iso8601(date, .format = "y m d") ``` The format can include regular expressions though: ```{r} create_iso8601(date, .format = "y\\s+m\\s+d") ``` By default, a streak of the reserved characters is treated as if only one was provided, so these formats are equivalent: ```{r} date <- c("2000-01-05", "2001-12-25", "1980-06-18", "1979-09-07") time <- c("00:12:21", "22:35:05", "03:00:15", "07:09:00") create_iso8601(date, time, .format = c("y-m-d", "H:M:S")) create_iso8601(date, time, .format = c("yyyy-mm-dd", "HH:MM:SS")) create_iso8601(date, time, .format = c("yyyyyyyy-m-dddddd", "H:MMMMM:SSSS")) ``` ## Multiple alternative formats When an input vector contains values with varying formats, a single format may not be adequate to encompass all variations. In such situations, it's advisable to list multiple alternative formats. This approach ensures that each format is tried sequentially until one matches the data in the vector. ```{r} date <- c("2000/01/01", "2000-01-02", "2000 01 03", "2000/01/04") create_iso8601(date, .format = "y-m-d") create_iso8601(date, .format = "y m d") create_iso8601(date, .format = "y/m/d") create_iso8601(date, .format = list(c("y-m-d", "y m d", "y/m/d"))) ``` Consider the order in which you supply the formats, as it can be significant. If multiple formats could potentially match, the sequence determines which format is applied first. ```{r} create_iso8601("07 04 2000", .format = list(c("d m y", "m d y"))) create_iso8601("07 04 2000", .format = list(c("m d y", "d m y"))) ``` Note that if you are passing alternative formats, then the `.format` argument must be a list whose length matches the number of inputs. ## Parsing of date or time components By default, date or time components are parsed as follows: - year: either parsed from a two- or four-digit year; - month: either as a numeric month (single or two-digit number) or as an English abbreviated month name (e.g. Jan, Jun or Dec) regardless of case; - month day: are parsed from two-digit numbers; - hour and minute: are parsed from single or two-digit numbers; - second: is parsed from single or two-digit numbers with an optional fractional part. ```{r} # Years: two-digit or four-digit numbers. years <- c("0", "1", "00", "01", "15", "30", "50", "68", "69", "80", "99") create_iso8601(years, .format = "y") # Adjust the point where two-digits years are mapped to 2000's or 1900's. create_iso8601(years, .format = "y", .cutoff_2000 = 20L) # Both numeric months (two-digit only) and abbreviated months work out of the box months <- c("0", "00", "1", "01", "Jan", "jan") create_iso8601(months, .format = "m") # Month days: single or two-digit numbers, anything else results in NA. create_iso8601(c("1", "01", "001", "10", "20", "31"), .format = "d") # Hours create_iso8601(c("1", "01", "001", "10", "20", "31"), .format = "H") # Minutes create_iso8601(c("1", "01", "001", "10", "20", "60"), .format = "M") # Seconds create_iso8601(c("1", "01", "23.04", "001", "10", "20", "60"), .format = "S") ``` ## Allowing alternative date or time values If date or time component values include special values, e.g. values encoding missing values, then you can indicate those values as possible alternatives such that the parsing will tolerate them; use the `.na` argument: ```{r} create_iso8601("U DEC 2019 14:00", .format = "d m y H:M") create_iso8601("U DEC 2019 14:00", .format = "d m y H:M", .na = "U") create_iso8601("U UNK 2019 14:00", .format = "d m y H:M") create_iso8601("U UNK 2019 14:00", .format = "d m y H:M", .na = c("U", "UNK")) ``` In this case you could achieve the same result using regexps: ```{r} create_iso8601("U UNK 2019 14:00", .format = "(d|U) (m|UNK) y H:M") ``` ## Changing reserved format characters There might be cases when the reserved characters --- `"y"`, `"m"`, `"d"`, `"H"`, `"M"`, `"S"` --- might get in the way of specifying an adequate format. For example, you might be tempted to use format `"HHMM"` to try to parse a time such as `"14H00M"`. You could assume that the first "H" codes for parsing the hour, and the second "H" to be a literal "H" but, actually, `"HH"` will be taken to mean parsing hours, and `"MM"` to parse minutes. You can use the function `fmt_cmp()` to specify alternative format regexps for the format, replacing the default characters. In the next example, we reassign new format strings for the hour and minute components, thus freeing the `"H"` and `"M"` patterns from being interpreted as hours and minutes, and to be taken literally: ```{r} create_iso8601("14H00M", .format = "HHMM") create_iso8601("14H00M", .format = "xHwM", .fmt_c = fmt_cmp(hour = "x", min = "w")) ``` Note that you need to make sure that the format component regexps are mutually exclusive, i.e. they don't have overlapping matches; otherwise `create_iso8601()` will fail with an error. In the next example both months and minutes could be represented by an `"m"` in the format resulting in an ambiguous format specification. ```{r} fmt_cmp(hour = "h", min = "m") try(create_iso8601("14H00M", .format = "hHmM", .fmt_c = fmt_cmp(hour = "h", min = "m"))) ```