---
title: "ids"
author: "Rich FitzJohn"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{ids}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

``` {r echo = FALSE, results = "hide"}
knitr::opts_chunk$set(error = FALSE)
human_no <- function(x) {
  s <- log10(floor(x + 1))
  p <- c(0, thousand = 3, million = 6, billion = 9, trillion = 12)
  i <- s > p
  j <- max(which(i))
  str <- names(p)[j]
  if (nzchar(str)) {
    paste(signif(x / 10^p[[j]], 3), str)
  } else {
    as.character(x)
  }
}
set.seed(1)
```

The `ids` package provides randomly generated ids in a number of
different forms with different readability and sizes.

## Random bytes

The `random_id` function generates random identifiers by generating
`bytes` random bytes and converting to hexadecimal (so each byte
becomes a pair of characters).  Rather than use R's random number
stream we use the `openssl` package here.
``` {r }
ids::random_id()
```

All `ids` functions take `n` as the first argument to be the number
of identifiers generated:
``` {r }
ids::random_id(5)
```

The default here is 16 bytes, each of which has 256 values (so
256^16 = 2^128 = 3.4e38 combinations).  You can make these larger or
smaller with the `bytes` argument:
``` {r }
ids::random_id(5, 8)
```

If `NULL` is provided as `n`, then a generating function is
returned (all ids functions do this):
``` {r }
f <- ids::random_id(NULL, 8)
f
```

This function sets all arguments except for `n`
``` {r }
f()
f(4)
```

## UUIDs

The above look a lot like UUIDs but they are not actually UUIDs. The `ids::uuid` function generates "version 4" UUIDs which are almost entirely random but contain dashes and have particular bits set:

``` {r }
ids::uuid()
```

As above, generate more than one UUID:
``` {r }
ids::uuid(4)
```

## Adjective animal

Generate (somewhat) human readable identifiers by combining one or
more adjectives with an animal name.

``` {r }
ids::adjective_animal()
```

The list of adjectives and animals comes from
[gfycat.com](http://gfycat.com), via
https://github.com/a-type/adjective-adjective-animal

Generate more than one identifier:
``` {r }
ids::adjective_animal(4)
```

Use more than one adjective for very long identifier
``` {r }
ids::adjective_animal(4, 3)
```

``` {r echo = FALSE, results = "hide"}
n1 <- length(ids:::gfycat_animals)
n2 <- length(ids:::gfycat_adjectives)
```

There are `r n1` animal names and `r n2` adjectives so each one you
add increases the identifier space by a factor of `r n2`.  So for 1,
2, and 3 adjectives there are about `r human_no(n1 * n2)`,
`r human_no(n1 * n2^2)` and `r human_no(n1 * n2^3)` possible combinations.

This is a much smaller space than the random identifiers above, but
these are more readable and memorable.

By default the random numbers come from R's random
number stream so are affected by `set.seed()`; see below for details.

Because some of the animal and adjective names are very long
(e.g. a _quasiextraterritorial hexakosioihexekontahexaphobic
queenalexandrasbirdwingbutterfly_), in order to generate more
readable/memorable identifiers it may be useful to restrict the
length.  Pass `max_len` in to do this.
``` {r }
ids::adjective_animal(4, max_len = 6)
```

A vector of length 2 here can be used to apply to the adjectives
and animal respectively:
``` {r }
ids::adjective_animal(20, max_len = c(5, Inf))
```

Note that this decreases the pool size and so increases the chance
of collisions.

In addition to snake_case, the default, the punctuation between
words can be changed to:

kebab-case:
``` {r }
ids::adjective_animal(1, 2, style = "kebab")
```

dot.case:
``` {r }
ids::adjective_animal(1, 2, style = "dot")
```

camelCase:
``` {r }
ids::adjective_animal(1, 2, style = "camel")
```

PascalCase:
``` {r }
ids::adjective_animal(1, 2, style = "pascal")
```

CONSTANT_CASE (aka SHOUTY_CASE)
``` {r }
ids::adjective_animal(1, 2, style = "constant")
```

or with spaces, lower case:
``` {r }
ids::adjective_animal(1, 2, style = "lower")
```

UPPER CASE
``` {r }
ids::adjective_animal(1, 2, style = "upper")
```

Sentence case
``` {r }
ids::adjective_animal(1, 2, style = "sentence")
```

Title Case
``` {r }
ids::adjective_animal(1, 2, style = "title")
```

MocKiNg sPoNgEbOb CaSe

``` {r }
ids::adjective_animal(1, 2, style = "spongemock")
```

Again, pass `n = NULL` here to create a generating function:
``` {r }
aa3 <- ids::adjective_animal(NULL, 3, style = "kebab", max_len = c(6, 8))
```

...which can be used to generate ids on demand.
``` {r }
aa3()
aa3(4)
```

## Random sentences

The `sentence` function creates a sentence style identifier.  This
uses the approach described by Asana on [their
blog](https://blog.asana.com/2011/09/6-sad-squid-snuggle-softly).
This approach encodes 32 bits of information (so 2^32 ~= 4 billion
possibilities) and in theory can be remapped to an integer if you
really wanted to.
``` {r }
ids::sentence()
```

As with `adjective_animal`, the case can be changed:
``` {r }
ids::sentence(2, "dot")
ids::sentence(2, "spongemock")
```

If you would rather past tense for the verbs, then pass `past = TRUE`:
``` {r }
ids::sentence(4, past = TRUE)
```

## proquints

"proquints" are an identifier that tries to be information dense
but still human readable and (somewhat) pronounceable; "proquint"
stands for *PRO*-nouncable *QUINT*-uplets.  They are introduced in
https://arxiv.org/html/0901.4016

`ids` can generate proquints:

``` {r }
ids::proquint(10)
```

By default it generates two-word proquints but that can be changed:

``` {r }
ids::proquint(5, 1)
ids::proquint(2, 4)
```

Proquints are formed by alternating
consonant/vowel/consonant/vowel/consonant using a subset of both
(16 consonants and 4 vowels).  This yields 2^16 (65,536)
possibilities per word.  Words are always lower case and always
separated by a hyphen.  So with 4 words there are 2^64 combinations
in 23 characters.

Proquints are also useful in that they can be translated with
integers.  The proquint `kapop` has integer value 25258
``` {r }
ids::proquint_to_int("kapop")
ids::int_to_proquint(25258)
```

This makes proquints suitable for creating human-pronounceable
identifiers out of things like ip addresses, integer primary keys,
etc.

The function `ids::int_to_proquint_word` will translate between
proquint words and integers (and are vectorised)
``` {r }
w <- ids::int_to_proquint_word(sample(2^16, 10) - 1L)
w
```

and `ids::proquint_word_to_int` does the reverse
``` {r }
ids::proquint_word_to_int(w)
```

while `ids::proquint_to_int` and `ids::int_to_proquint` allows
translation of multi-word proquints.  Overflow is a real
possibility; the maximum integer representable is only about `r
human_no(.Machine$integer.max)` and the maximum floating point
number of accuracy of 1 is about `r human_no(2 /
.Machine$double.eps)` -- these are big numbers but fairly small
proquints:
``` {r }
ids::int_to_proquint(.Machine$integer.max - 1)
ids::int_to_proquint(2 / .Machine$double.eps)
```

But if you had a 6 word proquint this would not work!
``` {r }
p <- ids::proquint(1, 6)
```

Too big for an integer:
``` {r error = TRUE}
ids::proquint_to_int(p)
```

And too big for an numeric number:
``` {r error = TRUE}
ids::proquint_to_int(p, as = "numeric")
```

To allow this, we use `openssl`'s `bignum` support:
``` {r }
ids::proquint_to_int(p, as = "bignum")
```

This returns a *list* with one bignum (this is required to allow
vectorisation).

## Roll your own identifiers

The `ids` functions can build identifiers in the style of
`adjective_animal` or `sentence`.  It takes as input a list of
strings.  This works particularly well with the `rcorpora` package
which includes lists of strings.

Here is a list of Pokemon names:
``` {r }
pokemon <- tolower(rcorpora::corpora("games/pokemon")$pokemon$name)
length(pokemon)
```

...and here is a list of adjectives
``` {r }
adjectives <- tolower(rcorpora::corpora("words/adjs")$adjs)
length(adjectives)
```

So we have a total pool size of about `r human_no(length(adjectives) *
length(pokemon))`, which is not huge, but it is at least topical.

To generate one identifier:
``` {r }
ids::ids(1, adjectives, pokemon)
```

All the style-changing code is available:
``` {r }
ids::ids(10, adjectives, pokemon, style = "dot")
```

Better would be to wrap this so that the constants are not passed
around the whole time:
``` {r }
adjective_pokemon <- function(n = 1, style = "snake") {
  pokemon <- tolower(rcorpora::corpora("games/pokemon")$pokemon$name)
  adjectives <- tolower(rcorpora::corpora("words/adjs")$adjs)
  ids::ids(n, adjectives, pokemon, style = style)
}

adjective_pokemon(10, "kebab")
```

As a second example we can use the word lists in rcorpora to
generate identifiers in the form `<mood>_<scientist>`, like
"melancholic_darwin".  These are similar to the names of
docker containers.

First the lists of names themselves:
``` {r }
moods <- tolower(rcorpora::corpora("humans/moods")$moods)
scientists <- tolower(rcorpora::corpora("humans/scientists")$scientists)
```

Moods include:
``` {r }
sample(moods, 10)
```

The scientists names contain spaces which is not going to work for
us because `ids` won't correctly translate all internal spaces to
the requested style.
``` {r }
sample(scientists, 10)
```

To hack around this we'll just take the last name from the list and
remove all hyphens:
``` {r }
scientists <- vapply(strsplit(sub("(-|jr\\.$)", "", scientists), " "),
                     tail, character(1), 1)
```

Which gives strings that are just letters (though there are a few
non-ASCII characters here that may cause problems because string
handling is just a big pile of awful)
``` {r }
sample(scientists, 10)
```

With the word lists, create an identifier:
``` {r }
ids::ids(1, moods, scientists)
```

Or pass `NULL` for `n` and create a function:
``` {r }
sci_id <- ids::ids(NULL, moods, scientists, style = "kebab")
```

which takes just the number of identifiers to generate as an argument
``` {r }
sci_id(10)
```

## Relationship to random number generation

Creating *random* identifiers requires some source of random numbers. There are a couple of competing needs here:

(1) You might want the identifiers to be repeatable, so that setting the seed will ensure the same stream of random identifiers:

```{r}
set.seed(1)
ids::adjective_animal(3)
set.seed(1)
ids::adjective_animal(3)
```

(2) Alternatively you might want to use identifiers as "globally unique" keys, avoiding any collisions

```{r}
set.seed(1)
ids::random_id(3, 4)
set.seed(1)
ids::random_id(3, 4)
```

To support this, `ids` uses two different types of sources of random numbers:

* **Global random numbers** come from R's random number stream and respond to `set.seed`, as above. These are used by default in `ids::adjective_animal`, `ids::proquint`, `ids::sentence` (i.e., the human-readable ids functions), and can be selected if required for `ids::random_id` and `ids::uuid` by passing the option `global = TRUE`.
* **Non-global random numbers** come from "some other source", either `openssl::rand_num` or an internal source if `openssl` is unavailable (see below).  These will not be affected by `set.seed` and we provide no way of resetting or reproducing the stream.  This is the default for `ids::random_id` and `ids::uuid` (i.e., the non-reproducible ids functions), and can be selected if required by `ids::proquint`, `ids::adjective_animal` and `ids::sentence` by passing the option `global = FALSE`.

For non-global random numbers, all else being equal, you should prefer to use the numbers provided by `openssl`.  The default for all functions is to use `openssl` if available.  You can *force* `openssl` by passing `use_openssl = TRUE` (perhaps along with `global = FALSE`); if the package is not installed then we will error rather than continuing with the internal random number stream.

The internal random number stream is a pure-R implementation of the `xoroshiro128+` algorithm.  As such it is quite slow (about 50x slower per byte of output than the cryptographically secure numbers from `openssl`) but it always available.  We seed the stream by hashing (via `utils::md5sum`) the current process id and the system time to the maximum precision available on that platform.