--- title: "ids" author: "Rich FitzJohn" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{ids} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ``` {r echo = FALSE, results = "hide"} knitr::opts_chunk$set(error = FALSE) human_no <- function(x) { s <- log10(floor(x + 1)) p <- c(0, thousand = 3, million = 6, billion = 9, trillion = 12) i <- s > p j <- max(which(i)) str <- names(p)[j] if (nzchar(str)) { paste(signif(x / 10^p[[j]], 3), str) } else { as.character(x) } } set.seed(1) ``` The `ids` package provides randomly generated ids in a number of different forms with different readability and sizes. ## Random bytes The `random_id` function generates random identifiers by generating `bytes` random bytes and converting to hexadecimal (so each byte becomes a pair of characters). Rather than use R's random number stream we use the `openssl` package here. ``` {r } ids::random_id() ``` All `ids` functions take `n` as the first argument to be the number of identifiers generated: ``` {r } ids::random_id(5) ``` The default here is 16 bytes, each of which has 256 values (so 256^16 = 2^128 = 3.4e38 combinations). You can make these larger or smaller with the `bytes` argument: ``` {r } ids::random_id(5, 8) ``` If `NULL` is provided as `n`, then a generating function is returned (all ids functions do this): ``` {r } f <- ids::random_id(NULL, 8) f ``` This function sets all arguments except for `n` ``` {r } f() f(4) ``` ## UUIDs The above look a lot like UUIDs but they are not actually UUIDs. The `ids::uuid` function generates "version 4" UUIDs which are almost entirely random but contain dashes and have particular bits set: ``` {r } ids::uuid() ``` As above, generate more than one UUID: ``` {r } ids::uuid(4) ``` ## Adjective animal Generate (somewhat) human readable identifiers by combining one or more adjectives with an animal name. ``` {r } ids::adjective_animal() ``` The list of adjectives and animals comes from [gfycat.com](http://gfycat.com), via https://github.com/a-type/adjective-adjective-animal Generate more than one identifier: ``` {r } ids::adjective_animal(4) ``` Use more than one adjective for very long identifier ``` {r } ids::adjective_animal(4, 3) ``` ``` {r echo = FALSE, results = "hide"} n1 <- length(ids:::gfycat_animals) n2 <- length(ids:::gfycat_adjectives) ``` There are `r n1` animal names and `r n2` adjectives so each one you add increases the identifier space by a factor of `r n2`. So for 1, 2, and 3 adjectives there are about `r human_no(n1 * n2)`, `r human_no(n1 * n2^2)` and `r human_no(n1 * n2^3)` possible combinations. This is a much smaller space than the random identifiers above, but these are more readable and memorable. By default the random numbers come from R's random number stream so are affected by `set.seed()`; see below for details. Because some of the animal and adjective names are very long (e.g. a _quasiextraterritorial hexakosioihexekontahexaphobic queenalexandrasbirdwingbutterfly_), in order to generate more readable/memorable identifiers it may be useful to restrict the length. Pass `max_len` in to do this. ``` {r } ids::adjective_animal(4, max_len = 6) ``` A vector of length 2 here can be used to apply to the adjectives and animal respectively: ``` {r } ids::adjective_animal(20, max_len = c(5, Inf)) ``` Note that this decreases the pool size and so increases the chance of collisions. In addition to snake_case, the default, the punctuation between words can be changed to: kebab-case: ``` {r } ids::adjective_animal(1, 2, style = "kebab") ``` dot.case: ``` {r } ids::adjective_animal(1, 2, style = "dot") ``` camelCase: ``` {r } ids::adjective_animal(1, 2, style = "camel") ``` PascalCase: ``` {r } ids::adjective_animal(1, 2, style = "pascal") ``` CONSTANT_CASE (aka SHOUTY_CASE) ``` {r } ids::adjective_animal(1, 2, style = "constant") ``` or with spaces, lower case: ``` {r } ids::adjective_animal(1, 2, style = "lower") ``` UPPER CASE ``` {r } ids::adjective_animal(1, 2, style = "upper") ``` Sentence case ``` {r } ids::adjective_animal(1, 2, style = "sentence") ``` Title Case ``` {r } ids::adjective_animal(1, 2, style = "title") ``` MocKiNg sPoNgEbOb CaSe ``` {r } ids::adjective_animal(1, 2, style = "spongemock") ``` Again, pass `n = NULL` here to create a generating function: ``` {r } aa3 <- ids::adjective_animal(NULL, 3, style = "kebab", max_len = c(6, 8)) ``` ...which can be used to generate ids on demand. ``` {r } aa3() aa3(4) ``` ## Random sentences The `sentence` function creates a sentence style identifier. This uses the approach described by Asana on [their blog](https://blog.asana.com/2011/09/6-sad-squid-snuggle-softly). This approach encodes 32 bits of information (so 2^32 ~= 4 billion possibilities) and in theory can be remapped to an integer if you really wanted to. ``` {r } ids::sentence() ``` As with `adjective_animal`, the case can be changed: ``` {r } ids::sentence(2, "dot") ids::sentence(2, "spongemock") ``` If you would rather past tense for the verbs, then pass `past = TRUE`: ``` {r } ids::sentence(4, past = TRUE) ``` ## proquints "proquints" are an identifier that tries to be information dense but still human readable and (somewhat) pronounceable; "proquint" stands for *PRO*-nouncable *QUINT*-uplets. They are introduced in https://arxiv.org/html/0901.4016 `ids` can generate proquints: ``` {r } ids::proquint(10) ``` By default it generates two-word proquints but that can be changed: ``` {r } ids::proquint(5, 1) ids::proquint(2, 4) ``` Proquints are formed by alternating consonant/vowel/consonant/vowel/consonant using a subset of both (16 consonants and 4 vowels). This yields 2^16 (65,536) possibilities per word. Words are always lower case and always separated by a hyphen. So with 4 words there are 2^64 combinations in 23 characters. Proquints are also useful in that they can be translated with integers. The proquint `kapop` has integer value 25258 ``` {r } ids::proquint_to_int("kapop") ids::int_to_proquint(25258) ``` This makes proquints suitable for creating human-pronounceable identifiers out of things like ip addresses, integer primary keys, etc. The function `ids::int_to_proquint_word` will translate between proquint words and integers (and are vectorised) ``` {r } w <- ids::int_to_proquint_word(sample(2^16, 10) - 1L) w ``` and `ids::proquint_word_to_int` does the reverse ``` {r } ids::proquint_word_to_int(w) ``` while `ids::proquint_to_int` and `ids::int_to_proquint` allows translation of multi-word proquints. Overflow is a real possibility; the maximum integer representable is only about `r human_no(.Machine$integer.max)` and the maximum floating point number of accuracy of 1 is about `r human_no(2 / .Machine$double.eps)` -- these are big numbers but fairly small proquints: ``` {r } ids::int_to_proquint(.Machine$integer.max - 1) ids::int_to_proquint(2 / .Machine$double.eps) ``` But if you had a 6 word proquint this would not work! ``` {r } p <- ids::proquint(1, 6) ``` Too big for an integer: ``` {r error = TRUE} ids::proquint_to_int(p) ``` And too big for an numeric number: ``` {r error = TRUE} ids::proquint_to_int(p, as = "numeric") ``` To allow this, we use `openssl`'s `bignum` support: ``` {r } ids::proquint_to_int(p, as = "bignum") ``` This returns a *list* with one bignum (this is required to allow vectorisation). ## Roll your own identifiers The `ids` functions can build identifiers in the style of `adjective_animal` or `sentence`. It takes as input a list of strings. This works particularly well with the `rcorpora` package which includes lists of strings. Here is a list of Pokemon names: ``` {r } pokemon <- tolower(rcorpora::corpora("games/pokemon")$pokemon$name) length(pokemon) ``` ...and here is a list of adjectives ``` {r } adjectives <- tolower(rcorpora::corpora("words/adjs")$adjs) length(adjectives) ``` So we have a total pool size of about `r human_no(length(adjectives) * length(pokemon))`, which is not huge, but it is at least topical. To generate one identifier: ``` {r } ids::ids(1, adjectives, pokemon) ``` All the style-changing code is available: ``` {r } ids::ids(10, adjectives, pokemon, style = "dot") ``` Better would be to wrap this so that the constants are not passed around the whole time: ``` {r } adjective_pokemon <- function(n = 1, style = "snake") { pokemon <- tolower(rcorpora::corpora("games/pokemon")$pokemon$name) adjectives <- tolower(rcorpora::corpora("words/adjs")$adjs) ids::ids(n, adjectives, pokemon, style = style) } adjective_pokemon(10, "kebab") ``` As a second example we can use the word lists in rcorpora to generate identifiers in the form `_`, like "melancholic_darwin". These are similar to the names of docker containers. First the lists of names themselves: ``` {r } moods <- tolower(rcorpora::corpora("humans/moods")$moods) scientists <- tolower(rcorpora::corpora("humans/scientists")$scientists) ``` Moods include: ``` {r } sample(moods, 10) ``` The scientists names contain spaces which is not going to work for us because `ids` won't correctly translate all internal spaces to the requested style. ``` {r } sample(scientists, 10) ``` To hack around this we'll just take the last name from the list and remove all hyphens: ``` {r } scientists <- vapply(strsplit(sub("(-|jr\\.$)", "", scientists), " "), tail, character(1), 1) ``` Which gives strings that are just letters (though there are a few non-ASCII characters here that may cause problems because string handling is just a big pile of awful) ``` {r } sample(scientists, 10) ``` With the word lists, create an identifier: ``` {r } ids::ids(1, moods, scientists) ``` Or pass `NULL` for `n` and create a function: ``` {r } sci_id <- ids::ids(NULL, moods, scientists, style = "kebab") ``` which takes just the number of identifiers to generate as an argument ``` {r } sci_id(10) ``` ## Relationship to random number generation Creating *random* identifiers requires some source of random numbers. There are a couple of competing needs here: (1) You might want the identifiers to be repeatable, so that setting the seed will ensure the same stream of random identifiers: ```{r} set.seed(1) ids::adjective_animal(3) set.seed(1) ids::adjective_animal(3) ``` (2) Alternatively you might want to use identifiers as "globally unique" keys, avoiding any collisions ```{r} set.seed(1) ids::random_id(3, 4) set.seed(1) ids::random_id(3, 4) ``` To support this, `ids` uses two different types of sources of random numbers: * **Global random numbers** come from R's random number stream and respond to `set.seed`, as above. These are used by default in `ids::adjective_animal`, `ids::proquint`, `ids::sentence` (i.e., the human-readable ids functions), and can be selected if required for `ids::random_id` and `ids::uuid` by passing the option `global = TRUE`. * **Non-global random numbers** come from "some other source", either `openssl::rand_num` or an internal source if `openssl` is unavailable (see below). These will not be affected by `set.seed` and we provide no way of resetting or reproducing the stream. This is the default for `ids::random_id` and `ids::uuid` (i.e., the non-reproducible ids functions), and can be selected if required by `ids::proquint`, `ids::adjective_animal` and `ids::sentence` by passing the option `global = FALSE`. For non-global random numbers, all else being equal, you should prefer to use the numbers provided by `openssl`. The default for all functions is to use `openssl` if available. You can *force* `openssl` by passing `use_openssl = TRUE` (perhaps along with `global = FALSE`); if the package is not installed then we will error rather than continuing with the internal random number stream. The internal random number stream is a pure-R implementation of the `xoroshiro128+` algorithm. As such it is quite slow (about 50x slower per byte of output than the cryptographically secure numbers from `openssl`) but it always available. We seed the stream by hashing (via `utils::md5sum`) the current process id and the system time to the maximum precision available on that platform.