The ids
package
provides randomly generated ids in a number of different forms with
different readability and sizes.
The random_id
function generates random identifiers by
generating bytes
random bytes and converting to hexadecimal
(so each byte becomes a pair of characters). Rather than use R’s random
number stream we use the openssl
package here.
## [1] "f633f9ebdef187b7a0e17f2694963a3c"
All ids
functions take n
as the first
argument to be the number of identifiers generated:
## [1] "6eba5c177fffd0d361071daadcb38030" "24ec7e11cf4ba7b2d4c75f7ec9d5214a"
## [3] "669666996c4c26b61ea3ee593cf2546b" "80829389d3181bef68c55e7b14ce164b"
## [5] "c3407dd7851b656f2d7da8e89e5f8657"
The default here is 16 bytes, each of which has 256 values (so 256^16
= 2^128 = 3.4e38 combinations). You can make these larger or smaller
with the bytes
argument:
## [1] "c840a9679fa8f0ad" "e412ebc1e914b871" "ca9163b42e85ce0d" "cc28735bc42aa193"
## [5] "eadfa303c5e6e6c5"
If NULL
is provided as n
, then a generating
function is returned (all ids functions do this):
## function (n = 1)
## {
## random_id(n, bytes, use_openssl, global)
## }
## <bytecode: 0x55c77ef19288>
## <environment: 0x55c77d676ed8>
This function sets all arguments except for n
## [1] "97f7cbb991565f7b"
## [1] "aa658232570558c9" "ffcfd338bb219ab8" "acab9f5f89b7059e" "b15517756107fcdd"
The above look a lot like UUIDs but they are not actually UUIDs. The
ids::uuid
function generates “version 4” UUIDs which are
almost entirely random but contain dashes and have particular bits
set:
## [1] "b89730aa-3b94-4b37-9d30-70d8e48d34bf"
As above, generate more than one UUID:
## [1] "0806f04c-7850-48bc-8e1d-9fb5392e013e"
## [2] "4417f2dd-1f3a-4e73-a74a-bdb1e294776d"
## [3] "b56d8e9e-f93a-47db-93b8-ac19d1dfd229"
## [4] "31fa0084-9c0c-4414-9105-6b3b9d7eab58"
Generate (somewhat) human readable identifiers by combining one or more adjectives with an animal name.
## [1] "careful_gaur"
The list of adjectives and animals comes from gfycat.com, via https://github.com/a-type/adjective-adjective-animal
Generate more than one identifier:
## [1] "unterrestrial_takin" "illfated_bonobo"
## [3] "wet_vixen" "classless_africanwildcat"
Use more than one adjective for very long identifier
## [1] "changeable_tricolour_ethnological_shrimp"
## [2] "zealous_preagricultural_snobbish_drafthorse"
## [3] "brilliant_rhombohedral_locustal_tattler"
## [4] "waiting_necessary_dermatic_earwig"
There are 1748 animal names and 8946 adjectives so each one you add increases the identifier space by a factor of 8946. So for 1, 2, and 3 adjectives there are about 15.6 million, 140 billion and 1250 trillion possible combinations.
This is a much smaller space than the random identifiers above, but these are more readable and memorable.
By default the random numbers come from R’s random number stream so
are affected by set.seed()
; see below for details.
Because some of the animal and adjective names are very long (e.g. a
quasiextraterritorial hexakosioihexekontahexaphobic
queenalexandrasbirdwingbutterfly), in order to generate more
readable/memorable identifiers it may be useful to restrict the length.
Pass max_len
in to do this.
## [1] "neon_beetle" "dermal_quagga" "gluey_kakapo" "slim_grison"
A vector of length 2 here can be used to apply to the adjectives and animal respectively:
## [1] "legal_watussi" "baggy_husky"
## [3] "mirky_cranefly" "lithe_langur"
## [5] "glass_tigerbeetle" "testy_octopus"
## [7] "puny_robin" "suede_mussel"
## [9] "bare_americanindianhorse" "lowly_mammal"
## [11] "antsy_newfoundlanddog" "wavy_basil"
## [13] "wood_grayfox" "dying_firecrest"
## [15] "lazy_quoll" "peat_trumpeterbird"
## [17] "choky_easternglasslizard" "human_wuerhosaurus"
## [19] "pagan_bittern" "solid_fieldspaniel"
Note that this decreases the pool size and so increases the chance of collisions.
In addition to snake_case, the default, the punctuation between words can be changed to:
kebab-case:
## [1] "residential-sizy-whiterhino"
dot.case:
## [1] "brave.selfdestroying.elephantseal"
camelCase:
## [1] "antiallergenicBioclimaticQueenalexandrasbirdwing"
PascalCase:
## [1] "QuickMilitantKarakul"
CONSTANT_CASE (aka SHOUTY_CASE)
## [1] "DISCUSSIBLE_RELISHABLE_XOLOITZCUINTLI"
or with spaces, lower case:
## [1] "asphaltic savoury queenslandgrouper"
UPPER CASE
## [1] "ATMOSPHERIC SERENE RINGWORM"
Sentence case
## [1] "Supersecure hardheaded mare"
Title Case
## [1] "Absent Cycadaceous Huemul"
MocKiNg sPoNgEbOb CaSe
## [1] "cREdItAbLe-DiNKy-QueEnbEE"
Again, pass n = NULL
here to create a generating
function:
…which can be used to generate ids on demand.
## [1] "blank-frowsy-dozing-johndory"
## [1] "sand-bored-slow-anaconda" "nylon-scared-sepia-flicker"
## [3] "fervid-dismal-sulfur-kid" "wacky-binary-stingy-myna"
The sentence
function creates a sentence style
identifier. This uses the approach described by Asana on their
blog. This approach encodes 32 bits of information (so 2^32 ~= 4
billion possibilities) and in theory can be remapped to an integer if
you really wanted to.
## [1] "4_nutty_goats_hopping_slowly"
As with adjective_animal
, the case can be changed:
## [1] "12.dazzling.hedgehogs.wandering.boastfully"
## [2] "17.awesome.platypuses.jumping.optimistically"
## [1] "21-UnKeMpt-rEiNdeErS-pRaNcInG-qUIrKiLy"
## [2] "6-cOoRdINatEd-FlAmInGoS-sAUnTeRiNg-brAvElY"
If you would rather past tense for the verbs, then pass
past = TRUE
:
## [1] "26_amazing_moose_laughed_zestfully"
## [2] "7_curious_sheep_pranced_quietly"
## [3] "31_waggish_iguanas_squiggled_youthfully"
## [4] "7_nutty_llamas_sauntered_shakily"
“proquints” are an identifier that tries to be information dense but still human readable and (somewhat) pronounceable; “proquint” stands for PRO-nouncable QUINT-uplets. They are introduced in https://arxiv.org/html/0901.4016
ids
can generate proquints:
## [1] "zabir-fagug" "zosut-mavig" "ribis-liduv" "jihon-tobur" "dizol-figab"
## [6] "kosir-bunat" "buhiv-vumar" "tadis-vunaz" "jahor-kagan" "tutod-hisin"
By default it generates two-word proquints but that can be changed:
## [1] "tomuh" "vomul" "ralon" "zuzok" "sogak"
## [1] "mapig-kunod-pijoh-zujis" "vodap-bumad-notug-rirup"
Proquints are formed by alternating consonant/vowel/consonant/vowel/consonant using a subset of both (16 consonants and 4 vowels). This yields 2^16 (65,536) possibilities per word. Words are always lower case and always separated by a hyphen. So with 4 words there are 2^64 combinations in 23 characters.
Proquints are also useful in that they can be translated with
integers. The proquint kapop
has integer value 25258
## [1] 25258
## [1] "kapop"
This makes proquints suitable for creating human-pronounceable identifiers out of things like ip addresses, integer primary keys, etc.
The function ids::int_to_proquint_word
will translate
between proquint words and integers (and are vectorised)
## [1] "vugir" "fivoh" "pogop" "jomab" "pofos" "zidif" "zakik" "juraz" "barip"
## [10] "zunok"
and ids::proquint_word_to_int
does the reverse
## [1] 60635 10148 43242 23040 43180 62546 61846 24271 730 65126
while ids::proquint_to_int
and
ids::int_to_proquint
allows translation of multi-word
proquints. Overflow is a real possibility; the maximum integer
representable is only about
r human_no(.Machine$integer.max)
and the maximum floating
point number of accuracy of 1 is about 9010 trillion – these are big
numbers but fairly small proquints:
## [1] "luzuz-zuzuv"
## [1] "babob-babab-babab-babab"
But if you had a 6 word proquint this would not work!
Too big for an integer:
## Error in proquint_combine(idx, len, as): Numeric overflow: cannot represent proquint as numeric
And too big for an numeric number:
## Error in proquint_combine(idx, len, as): Numeric overflow: cannot represent proquint as numeric
To allow this, we use openssl
’s bignum
support:
## [[1]]
## [b] 68240143167322810041142066081
This returns a list with one bignum (this is required to allow vectorisation).
The ids
functions can build identifiers in the style of
adjective_animal
or sentence
. It takes as
input a list of strings. This works particularly well with the
rcorpora
package which includes lists of strings.
Here is a list of Pokemon names:
## [1] 663
…and here is a list of adjectives
## [1] 961
So we have a total pool size of about 637 thousand, which is not huge, but it is at least topical.
To generate one identifier:
## [1] "conceptual_venomoth"
All the style-changing code is available:
## [1] "lacklustre.cradily" "retiring.cradily" "quick.mankey"
## [4] "exponential.pikachu" "unwary.magnemite" "primer.gliscor"
## [7] "eaten.vibrava" "indiscriminate.victini" "sentient.vibrava"
## [10] "flammable.uxie"
Better would be to wrap this so that the constants are not passed around the whole time:
adjective_pokemon <- function(n = 1, style = "snake") {
pokemon <- tolower(rcorpora::corpora("games/pokemon")$pokemon$name)
adjectives <- tolower(rcorpora::corpora("words/adjs")$adjs)
ids::ids(n, adjectives, pokemon, style = style)
}
adjective_pokemon(10, "kebab")
## [1] "boundary-poliwag" "bohemian-lucario"
## [3] "contemporaneous-cresselia" "process-ampharos"
## [5] "commuting-liepard" "healthiest-eevee"
## [7] "inert-stunfisk" "cardinal-lopunny"
## [9] "employed-whimsicott" "running-regirock"
As a second example we can use the word lists in rcorpora to generate
identifiers in the form <mood>_<scientist>
,
like “melancholic_darwin”. These are similar to the names of docker
containers.
First the lists of names themselves:
moods <- tolower(rcorpora::corpora("humans/moods")$moods)
scientists <- tolower(rcorpora::corpora("humans/scientists")$scientists)
Moods include:
## [1] "great" "haughty" "threatened" "jaded" "sinful"
## [6] "thoughtful" "delighted" "dependent" "conventional" "dirty"
The scientists names contain spaces which is not going to work for us
because ids
won’t correctly translate all internal spaces
to the requested style.
## [1] "emil fischer" "frederick gowland hopkins"
## [3] "pierre-simon laplace" "alexander fleming"
## [5] "b. f. skinner" "stephen hawking"
## [7] "ernest rutherford" "james dwight dana"
## [9] "antoine lavoisier" "bill nye"
To hack around this we’ll just take the last name from the list and remove all hyphens:
Which gives strings that are just letters (though there are a few non-ASCII characters here that may cause problems because string handling is just a big pile of awful)
## [1] "gamow" "planck" "nye" "haxel" "fermi" "binet"
## [7] "halley" "drexler" "tombaugh" "wundt"
With the word lists, create an identifier:
## [1] "enraged_sagan"
Or pass NULL
for n
and create a
function:
which takes just the number of identifiers to generate as an argument
## [1] "giddy-carver" "splendid-dirac" "reminiscent-hirase"
## [4] "private-tyson" "uplifted-buffon" "sabotaged-lucretius"
## [7] "contempt-czerny" "outraged-humboldt" "dignified-napier"
## [10] "awake-bosch"
Creating random identifiers requires some source of random numbers. There are a couple of competing needs here:
## [1] "careful_annelida" "uncalorific_ivorygull" "marginal_tegus"
## [1] "careful_annelida" "uncalorific_ivorygull" "marginal_tegus"
## [1] "f57298cc" "09fe5b94" "a737c746"
## [1] "7c7d6193" "877c11c8" "c9d9f626"
To support this, ids
uses two different types of sources
of random numbers:
set.seed
, as above. These are used by
default in ids::adjective_animal
,
ids::proquint
, ids::sentence
(i.e., the
human-readable ids functions), and can be selected if required for
ids::random_id
and ids::uuid
by passing the
option global = TRUE
.openssl::rand_num
or an internal source if
openssl
is unavailable (see below). These will not be
affected by set.seed
and we provide no way of resetting or
reproducing the stream. This is the default for
ids::random_id
and ids::uuid
(i.e., the
non-reproducible ids functions), and can be selected if required by
ids::proquint
, ids::adjective_animal
and
ids::sentence
by passing the option
global = FALSE
.For non-global random numbers, all else being equal, you should
prefer to use the numbers provided by openssl
. The default
for all functions is to use openssl
if available. You can
force openssl
by passing
use_openssl = TRUE
(perhaps along with
global = FALSE
); if the package is not installed then we
will error rather than continuing with the internal random number
stream.
The internal random number stream is a pure-R implementation of the
xoroshiro128+
algorithm. As such it is quite slow (about
50x slower per byte of output than the cryptographically secure numbers
from openssl
) but it always available. We seed the stream
by hashing (via utils::md5sum
) the current process id and
the system time to the maximum precision available on that platform.