--- title: "sacRebleu" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{sacRebleu} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This package aims to provide metrics to evaluate generated text. To this point, only the BLEU (bilingual evaluation understudy) score, introduced by [Papineni et al., 2002](https://aclanthology.org/P02-1040/), is available. The library is implemented in 'R' and 'C++'. The metrics are implemented on the base of previous tokenization, so that lists with tokenized sequences are evaluated. This package is inspired by the ['NLTK'](https://www.nltk.org/) and ['sacrebleu'](https://github.com/mjpost/sacrebleu) implementation for 'Python'. # BLEU Score The BLEU-score is a metric used to evaluate the quality of machine-generated texts by comparing them to reference texts. It is calculated based on the precision of n-grams, which are contiguous sequences of n items, typically words. Mathematically, BLEU can be expressed as follows: \[ BLEU = \text{{BP}} \times \exp\left(\sum_{n=1}^{N} \frac{1}{N} \log \text{{precision}}_n\right) \] Where: - \(\text{{BP}}\) is the brevity penalty, which penalizes if the candidate text is shorter than the reference texts. It is defined as \(\exp(1 - \frac{{\text{{reference length}}}}{{\text{{output length}}}})\). - \(N\) is the maximum n-gram order considered in the calculation. - \(\text{{precision}}_n\) is the precision of n-grams, calculated as the ratio of the number of n-grams in the candidate text that appear in any of the reference texts to the total number of n-grams in the candidate text. \(\text{{precision}}_n\) is defined as the following: \[ precision_n = \frac{\sum_{c \in \text{Cand}} ngram_{\text{clip}}(c)}{\sum_{r \in \text{Ref}_{\text{Cand}}} ngram(r)} \] Where $ngram_{\text{clip}}$ represents the count of n-grams in the candidate text that appear in any of the reference texts, while $ngram$ stands for the total number of n-grams in the candidate sentence, ensuring they do not exceed the count of the reference n-grams. This procedure is repeated for all 1 to N-grams. In summary, the BLEU score provides a single numerical value indicating the quality of a candidate text, with higher scores indicating better quality. # Smoothing This package provides two smoothing techniques from [Chen et al., 2014](https://aclanthology.org/W14-3346/). The methods available in this package are `floor` and `add-k`. ## `floor` The precision of BLEU is calculated by dividing the sum of the n-grams. However, in some cases, the count of certain n-grams may be zero. To address this issue, a small value (denoted as $\epsilon$) is added to the numerator of the precision calculation when the count is zero. ## `add-k` Similar to the motivation behind the `floor` method, the `add-k` smoothing technique involves adding an integer value ($k$) to the overall sum of the numerator and the denominator of the precision calculation for each 1..N-gram. # Example ```{r} library(sacRebleu) cand_corpus <- list(c(1,2,3), c(1,2)) ref_corpus <- list(list(c(1,2,3), c(2,3,4)), list(c(1,2,6), c(781, 21, 9), c(7, 3))) bleu_corpus_ids_standard <- bleu_corpus_ids(ref_corpus, cand_corpus) ``` Here, the text is already tokenized and represented through integers in the 'cand_corpus' and 'ref_corpus' lists. For tokenization, the ['tok'](https://cran.r-project.org/package=tok) package is recommended.