Working with messy dates

Henrique Sposito

2024-04-19

Why {messydates}?

Dates are often messy. Whether historical (or ancient), future, or even recent, we sometimes only know approximately when an event occurred, that it happened within a particular period, an unreliable source means a date should be flagged as uncertain, or different sources offer multiple, competing dates.

The goal of {messydates} is to help with this problem by retaining and working with various kinds of date imprecision. contains a set of tools for constructing and coercing into and from the mdate class. This date class implements ISO 8601-2_2019(E) allowing regular dates to be annotated to express unspecified date components, approximate or uncertain date components, date ranges, and sets of dates.

Working with messydates: 2001 Battles

Take, for example, the names and dates of battles in 2001 according to Wikipedia included in . The dates of these battles are often uncertain or approximate with different levels of date precision being reported.

library(messydates)
battles <- messydates::battles
battles
## # A tibble: 20 × 5
##    Battle                               Date           Parties US_party N_actors
##    <chr>                                <mdate>        <chr>      <dbl>    <dbl>
##  1 Operation MH-2                       2001-03-08   … MK-Nat…        0        2
##  2 2001 Bangladesh–India border clashes 2001-04-16..2… BD-ID          0        2
##  3 Operation Vaksince                   2001-05-25   … MK-Nat…        0        2
##  4 Alkhan-Kala operation                2001-06-22..2… RU-Che…        0        2
##  5 Battle of Vedeno                     2001-08-13..2… RU-Che…        0        2
##  6 Operation Crescent Wind              2001-10-07..2… US/UK-…        1        3
##  7 Operation Rhino                      2001-10-19..2… US-Tal…        1        2
##  8 Battle of Mazar-e-Sharif             2001-11-09   … US/Nor…        1        4
##  9 Siege of Kunduz                      2001-11-11..2… US/Nor…        1        4
## 10 Battle of Herat                      2001-11-12   … US/Nor…        1        4
## 11 Battle of Kabul                      2001-11-13..2… US/Nor…        1        3
## 12 Battle of Tarin Kowt                 2001-11-13..2… US/Eas…        1        3
## 13 Operation Trent                      2001-11-~15..… US/UK-…        1        4
## 14 Battle of Kandahar                   2001-11-22..2… US/AU/…        1        4
## 15 Battle of Qala-i-Jangi               2001-11-25..2… US/UK/…        1        5
## 16 Battle of Tora Bora                  2001-12-12..2… US/Nor…        1        4
## 17 Battle of Shawali Kowt               2001-12-03   … US/Eas…        1        3
## 18 Battle of Sayyd Alma Kalay           2001-12-04   … US/Eas…        1        3
## 19 Battle of Amami-Oshima               2001-12-22   … JP-KP          0        2
## 20 Tsotsin-Yurt operation               2001-12-30..2… RU-Che…        0        2

Coerce to messydates

Previously researchers had to remove all types of imprecision from date variables and create multiple variables to deal with date ranges. {messydates} makes it much easier to retain and work with various kinds of date imprecision. In the 2001 battles dataset, for example, we see that dates are not consistently reported, but as_messydate() still handles the coercion to mdate class.

battles$Date <- as_messydate(battles$Date)
battles$Date
##  'mdate' chr [1:20] "2001-03-08" "2001-04-16..2001-04-20" "2001-05-25" ...

Annotate

The annotate functions in {messydates} help annotate censored, uncertain, and approximate dates according to ISO 8601-2_2019(E) standards. Some datasets might have an arbitrary cut off point for start and end points, that is, they are censored. But these are often coded as precise dates when they are not necessarily the real start or end dates. Inaccurate start or end dates can be represented by an “..” affix indicating “on or before”, if used as a prefix, or indicating “on or after”, if used as a suffix. In the case of the battles of 2001 dates, if we are not sure the “Battle of Kandahar” began on the 22nd of November or, alternatively, that the “Operation Vaksince” actually ended in the same day it began we use on_or_before() and on_or after() to annotate these dates.

battles$Date <- as_messydate(ifelse(battles$Battle == "Battle of Herat", on_or_before(battles$Date), battles$Date))
battles$Date <- as_messydate(ifelse(battles$Battle == "Operation Vaksince", on_or_after(battles$Date), battles$Date))
tibble::tibble(battles)
## # A tibble: 20 × 5
##    Battle                               Date           Parties US_party N_actors
##    <chr>                                <mdate>        <chr>      <dbl>    <dbl>
##  1 Operation MH-2                       2001-03-08   … MK-Nat…        0        2
##  2 2001 Bangladesh–India border clashes 2001-04-16..2… BD-ID          0        2
##  3 Operation Vaksince                   2001-05-25.. … MK-Nat…        0        2
##  4 Alkhan-Kala operation                2001-06-22..2… RU-Che…        0        2
##  5 Battle of Vedeno                     2001-08-13..2… RU-Che…        0        2
##  6 Operation Crescent Wind              2001-10-07..2… US/UK-…        1        3
##  7 Operation Rhino                      2001-10-19..2… US-Tal…        1        2
##  8 Battle of Mazar-e-Sharif             2001-11-09   … US/Nor…        1        4
##  9 Siege of Kunduz                      2001-11-11..2… US/Nor…        1        4
## 10 Battle of Herat                      ..2001-11-12 … US/Nor…        1        4
## 11 Battle of Kabul                      2001-11-13..2… US/Nor…        1        3
## 12 Battle of Tarin Kowt                 2001-11-13..2… US/Eas…        1        3
## 13 Operation Trent                      2001-11-~15..… US/UK-…        1        4
## 14 Battle of Kandahar                   2001-11-22..2… US/AU/…        1        4
## 15 Battle of Qala-i-Jangi               2001-11-25..2… US/UK/…        1        5
## 16 Battle of Tora Bora                  2001-12-12..2… US/Nor…        1        4
## 17 Battle of Shawali Kowt               2001-12-03   … US/Eas…        1        3
## 18 Battle of Sayyd Alma Kalay           2001-12-04   … US/Eas…        1        3
## 19 Battle of Amami-Oshima               2001-12-22   … JP-KP          0        2
## 20 Tsotsin-Yurt operation               2001-12-30..2… RU-Che…        0        2

Additional annotations for approximate dates, are indicated by adding a ~ to year, month, or day components, or whole dates to estimate values that are possibly correct. Day, month, or year, uncertainty can also be indicated by adding a ? to a possibly dubious date or date component. If we are not sure about the reliability of the sources for the “Battle of Shawali Kowt” and we think the declared date for the battle is approximate, we can use as_uncertain() or as_approximate() to annotate these dates.

battles$Date <- as_messydate(ifelse(battles$Battle == "Battle of Shawali Kowt", as_uncertain(battles$Date), battles$Date))
battles$Date <- as_messydate(ifelse(battles$Battle == "Battle of Sayyd Alma Kalay", as_approximate(battles$Date), battles$Date))
tibble::tibble(battles)
## # A tibble: 20 × 5
##    Battle                               Date           Parties US_party N_actors
##    <chr>                                <mdate>        <chr>      <dbl>    <dbl>
##  1 Operation MH-2                       2001-03-08   … MK-Nat…        0        2
##  2 2001 Bangladesh–India border clashes 2001-04-16..2… BD-ID          0        2
##  3 Operation Vaksince                   2001-05-25.. … MK-Nat…        0        2
##  4 Alkhan-Kala operation                2001-06-22..2… RU-Che…        0        2
##  5 Battle of Vedeno                     2001-08-13..2… RU-Che…        0        2
##  6 Operation Crescent Wind              2001-10-07..2… US/UK-…        1        3
##  7 Operation Rhino                      2001-10-19..2… US-Tal…        1        2
##  8 Battle of Mazar-e-Sharif             2001-11-09   … US/Nor…        1        4
##  9 Siege of Kunduz                      2001-11-11..2… US/Nor…        1        4
## 10 Battle of Herat                      ..2001-11-12 … US/Nor…        1        4
## 11 Battle of Kabul                      2001-11-13..2… US/Nor…        1        3
## 12 Battle of Tarin Kowt                 2001-11-13..2… US/Eas…        1        3
## 13 Operation Trent                      2001-11-~15..… US/UK-…        1        4
## 14 Battle of Kandahar                   2001-11-22..2… US/AU/…        1        4
## 15 Battle of Qala-i-Jangi               2001-11-25..2… US/UK/…        1        5
## 16 Battle of Tora Bora                  2001-12-12..2… US/Nor…        1        4
## 17 Battle of Shawali Kowt               2001-12-03?  … US/Eas…        1        3
## 18 Battle of Sayyd Alma Kalay           2001-12-04~  … US/Eas…        1        3
## 19 Battle of Amami-Oshima               2001-12-22   … JP-KP          0        2
## 20 Tsotsin-Yurt operation               2001-12-30..2… RU-Che…        0        2

Expand

Expand functions transform date ranges (annotated with ‘..’), sets of dates (annotated with ‘{ , }’), and unspecified (missing date components or annotated with ‘XX’), or approximate dates (annotated ‘~’) into lists of dates. As these dates may refer to several possible dates, the function “opens” these values to include all the possible dates implied. Let’s expand the dates in the Battles dataset.

expand(battles$Date)
## [[1]]
## [1] "2001-03-08"
## 
## [[2]]
## [1] "2001-04-16" "2001-04-17" "2001-04-18" "2001-04-19" "2001-04-20"
## 
## [[3]]
## [1] "2001-05-25"
## 
## [[4]]
## [1] "2001-06-22" "2001-06-23" "2001-06-24" "2001-06-25" "2001-06-26"
## [6] "2001-06-27" "2001-06-28"
## 
## [[5]]
##  [1] "2001-08-13" "2001-08-14" "2001-08-15" "2001-08-16" "2001-08-17"
##  [6] "2001-08-18" "2001-08-19" "2001-08-20" "2001-08-21" "2001-08-22"
## [11] "2001-08-23" "2001-08-24" "2001-08-25" "2001-08-26"
## 
## [[6]]
##  [1] "2001-10-07" "2001-10-08" "2001-10-09" "2001-10-10" "2001-10-11"
##  [6] "2001-10-12" "2001-10-13" "2001-10-14" "2001-10-15" "2001-10-16"
## [11] "2001-10-17" "2001-10-18" "2001-10-19" "2001-10-20" "2001-10-21"
## [16] "2001-10-22" "2001-10-23" "2001-10-24" "2001-10-25" "2001-10-26"
## [21] "2001-10-27" "2001-10-28" "2001-10-29" "2001-10-30" "2001-10-31"
## [26] "2001-11-01" "2001-11-02" "2001-11-03" "2001-11-04" "2001-11-05"
## [31] "2001-11-06" "2001-11-07" "2001-11-08" "2001-11-09" "2001-11-10"
## [36] "2001-11-11" "2001-11-12" "2001-11-13" "2001-11-14" "2001-11-15"
## [41] "2001-11-16" "2001-11-17" "2001-11-18" "2001-11-19" "2001-11-20"
## [46] "2001-11-21" "2001-11-22" "2001-11-23" "2001-11-24" "2001-11-25"
## [51] "2001-11-26" "2001-11-27" "2001-11-28" "2001-11-29" "2001-11-30"
## [56] "2001-12-01" "2001-12-02" "2001-12-03" "2001-12-04" "2001-12-05"
## [61] "2001-12-06" "2001-12-07" "2001-12-08" "2001-12-09" "2001-12-10"
## [66] "2001-12-11" "2001-12-12" "2001-12-13" "2001-12-14" "2001-12-15"
## [71] "2001-12-16" "2001-12-17" "2001-12-18" "2001-12-19" "2001-12-20"
## [76] "2001-12-21" "2001-12-22" "2001-12-23" "2001-12-24" "2001-12-25"
## [81] "2001-12-26" "2001-12-27" "2001-12-28" "2001-12-29" "2001-12-30"
## [86] "2001-12-31"
## 
## [[7]]
## [1] "2001-10-19" "2001-10-20"
## 
## [[8]]
## [1] "2001-11-09"
## 
## [[9]]
##  [1] "2001-11-11" "2001-11-12" "2001-11-13" "2001-11-14" "2001-11-15"
##  [6] "2001-11-16" "2001-11-17" "2001-11-18" "2001-11-19" "2001-11-20"
## [11] "2001-11-21" "2001-11-22" "2001-11-23"
## 
## [[10]]
## [1] "2001-11-12"
## 
## [[11]]
## [1] "2001-11-13" "2001-11-14"
## 
## [[12]]
## [1] "2001-11-13" "2001-11-14"
## 
## [[13]]
##  [1] "2001-11-15" "2001-11-16" "2001-11-17" "2001-11-18" "2001-11-19"
##  [6] "2001-11-20" "2001-11-21" "2001-11-22" "2001-11-23" "2001-11-24"
## [11] "2001-11-25" "2001-11-26" "2001-11-27" "2001-11-28" "2001-11-29"
## [16] "2001-11-30"
## 
## [[14]]
##  [1] "2001-11-22" "2001-11-23" "2001-11-24" "2001-11-25" "2001-11-26"
##  [6] "2001-11-27" "2001-11-28" "2001-11-29" "2001-11-30" "2001-12-01"
## [11] "2001-12-02" "2001-12-03" "2001-12-04" "2001-12-05" "2001-12-06"
## [16] "2001-12-07"
## 
## [[15]]
## [1] "2001-11-25" "2001-11-26" "2001-11-27" "2001-11-28" "2001-11-29"
## [6] "2001-11-30" "2001-12-01"
## 
## [[16]]
## [1] "2001-12-12" "2001-12-13" "2001-12-14" "2001-12-15" "2001-12-16"
## [6] "2001-12-17"
## 
## [[17]]
## [1] "2001-12-03"
## 
## [[18]]
## [1] "2001-12-04"
## 
## [[19]]
## [1] "2001-12-22"
## 
## [[20]]
## [1] "2001-12-30" "2001-12-31" "2002-01-01" "2002-01-02" "2002-01-03"

Note that to expand approximate dates one needs to declare the range to expand approximate dates using the ‘approx_range’ argument in expand()

expand(battles$Date, approx_range = 1)

Contract

The contract() function operates as the opposite of expand(). It contracts a list of dates into the abbreviated annotation of messydates, picking the most succinct representation of dates possible. We can contract back the dates in the Battles data previously expanded.

tibble::tibble(contract = contract(battles$Date))
## # A tibble: 20 × 1
##    contract              
##    <mdate>               
##  1 2001-03-08            
##  2 2001-04-16..2001-04-20
##  3 2001-05-25            
##  4 2001-06-22..2001-06-28
##  5 2001-08-13..2001-08-26
##  6 2001-10-07..2001-12-31
##  7 2001-10-19..2001-10-20
##  8 2001-11-09            
##  9 2001-11-11..2001-11-23
## 10 2001-11-12            
## 11 2001-11-13..2001-11-14
## 12 2001-11-13..2001-11-14
## 13 2001-11-15..2001-11-30
## 14 2001-11-22..2001-12-07
## 15 2001-11-25..2001-12-01
## 16 2001-12-12..2001-12-17
## 17 2001-12-03            
## 18 2001-12-04            
## 19 2001-12-22            
## 20 2001-12-30..2002-01-03

Coerce from messydates

Coercion functions coerce objects of mdate class to common date classes such as Date, POSIXct, and POSIXlt. Since mdate objects can hold multiple individual dates, an additional function must be passed as an argument so that multiple dates are “resolved” into a single date.

For example, one might wish to use the earliest possible date in any ranges of dates (min), the latest possible date (max), some notion of a central tendency (mean, median, or modal), or even a random selection from among the candidate dates. These functions are particularly useful for use with existing methods and models, especially for checking the robustness of results.

tibble::tibble(min = as.Date(battles$Date, min),
               max = as.Date(battles$Date, max),
               median = as.Date(battles$Date, median),
               mean = as.Date(battles$Date, mean),
               modal = as.Date(battles$Date, modal),
               random = as.Date(battles$Date, random))
## # A tibble: 20 × 6
##    min        max        median     mean       modal      random    
##    <date>     <date>     <date>     <date>     <date>     <date>    
##  1 2001-03-08 2001-03-08 2001-03-08 2001-03-08 2001-03-08 2001-03-08
##  2 2001-04-16 2001-04-20 2001-04-18 2001-04-18 2001-04-16 2001-04-16
##  3 2001-05-25 2001-05-25 2001-05-25 2001-05-25 2001-05-25 2001-05-25
##  4 2001-06-22 2001-06-28 2001-06-25 2001-06-25 2001-06-22 2001-06-24
##  5 2001-08-13 2001-08-26 2001-08-20 2001-08-19 2001-08-13 2001-08-26
##  6 2001-10-07 2001-12-31 2001-11-19 2001-11-18 2001-10-07 2001-10-08
##  7 2001-10-19 2001-10-20 2001-10-20 2001-10-19 2001-10-19 2001-10-20
##  8 2001-11-09 2001-11-09 2001-11-09 2001-11-09 2001-11-09 2001-11-09
##  9 2001-11-11 2001-11-23 2001-11-17 2001-11-17 2001-11-11 2001-11-23
## 10 2001-11-12 2001-11-12 2001-11-12 2001-11-12 2001-11-12 2001-11-12
## 11 2001-11-13 2001-11-14 2001-11-14 2001-11-13 2001-11-13 2001-11-13
## 12 2001-11-13 2001-11-14 2001-11-14 2001-11-13 2001-11-13 2001-11-13
## 13 2001-11-15 2001-11-30 2001-11-23 2001-11-22 2001-11-15 2001-11-20
## 14 2001-11-22 2001-12-07 2001-11-30 2001-11-29 2001-11-22 2001-11-30
## 15 2001-11-25 2001-12-01 2001-11-28 2001-11-28 2001-11-25 2001-11-28
## 16 2001-12-12 2001-12-17 2001-12-15 2001-12-14 2001-12-12 2001-12-13
## 17 2001-12-03 2001-12-03 2001-12-03 2001-12-03 2001-12-03 2001-12-03
## 18 2001-12-04 2001-12-04 2001-12-04 2001-12-04 2001-12-04 2001-12-04
## 19 2001-12-22 2001-12-22 2001-12-22 2001-12-22 2001-12-22 2001-12-22
## 20 2001-12-30 2002-01-03 2002-01-01 2002-01-01 2001-12-30 2001-12-30

Additional functionality

Several other functions are offered in {messydates}.

For example, we can run several logical tests to mdate variables. is_messydate() tests whether the object inherits the mdate class. is_intersecting() tests whether there is any intersection between two messy dates. is_subset() similarly tests whether one or more messy dates can be found within a messy date range or set. is_similar() tests whether two dates contain similar components. is_precise() tests whether certain date is precise.

is_messydate(battles$Date)
## [1] TRUE
is_intersecting(as_messydate(battles$Date[1]), as_messydate(battles$Date[2]))
## [1] FALSE
is_subset(as_messydate("2001-04-17"), as_messydate(battles$Date[2]))
## [1] TRUE
is_similar(as_messydate("2001-08-03"), as_messydate(battles$Date[1]))
## [1] TRUE
is_precise(as_messydate(battles$Date[2]))
## [1] FALSE

Additionally, one can perform intersection or union of messydates.

as_messydate(battles$Date[9]) %intersect% as_messydate(battles$Date[10])
## [1] "2001-11-12"
as_messydate(battles$Date[17]) %union% as_messydate(battles$Date[18])
## [1] "2001-12-03" "2001-12-04"

As well, we can do some arithmetic operations in the mdate variable.

tibble::tibble("one day more" = battles$Date + 1,
               "one day less" = battles$Date - "1 day")
## # A tibble: 20 × 2
##    `one day more`         `one day less`        
##    <mdate>                <mdate>               
##  1 2001-03-09             2001-03-07            
##  2 2001-04-17..2001-04-21 2001-04-15..2001-04-19
##  3 2001-05-26..           2001-05-24..          
##  4 2001-06-23..2001-06-29 2001-06-21..2001-06-27
##  5 2001-08-14..2001-08-27 2001-08-12..2001-08-25
##  6 2001-10-08..2002-01-01 2001-10-06..2001-12-30
##  7 2001-10-20..2001-10-21 2001-10-18..2001-10-19
##  8 2001-11-10             2001-11-08            
##  9 2001-11-12..2001-11-24 2001-11-10..2001-11-22
## 10 ..2001-11-13           ..2001-11-11          
## 11 2001-11-14..2001-11-15 2001-11-12..2001-11-13
## 12 2001-11-14..2001-11-15 2001-11-12..2001-11-13
## 13 2001-11-16..2001-12-01 2001-11-14..2001-11-29
## 14 2001-11-23..2001-12-08 2001-11-21..2001-12-06
## 15 2001-11-26..2001-12-02 2001-11-24..2001-11-30
## 16 2001-12-13..2001-12-18 2001-12-11..2001-12-16
## 17 2001-12-04             2001-12-02            
## 18 2001-12-05             2001-12-03            
## 19 2001-12-23             2001-12-21            
## 20 2001-12-31..2002-01-04 2001-12-29..2002-01-02

Finally, one can run logical and proportional comparisons on mdate objects.

as_messydate("2012-06-03") < as.Date("2012-06-02")
## [1] FALSE
as_messydate("2012-06-03") > as.Date("2012-06-02")
## [1] TRUE
as_messydate("2012-06-03") >= as.Date("2012-06-02")
## [1] TRUE
as_messydate("2012-06-03") <= as.Date("2012-06-02")
## [1] FALSE
as_messydate("2012-06") %g% as_messydate("2012-06-02") # proportion greater than
## [1] 0.9333333
as_messydate("2012-06") %l% as_messydate("2012-06-02") # proportion smaller than
## [1] 0.03333333
as_messydate("2012-06") %ge% "2012-06-02"  # proportion greater or equal than
## [1] 0.9666667
as_messydate("2012-06") %le% "2012-06-02" # proportion smaller or equal than
## [1] 0.06666667
as_messydate("2012-06") %><% as_messydate("2012-06-15..2012-07-15")  # proportion of dates in the first vector and in the second vector (exclusive)
## [1] 0.516129
as_messydate("2012-06") %>=<% as_messydate("2012-06-15..2012-07-15")  # proportion of dates and in the first vector in the second vector (inclusive)
## [1] 0.5333333