If a model encounters a word it doesn't know, it breaks it into smaller chunks it does recognize. For example: The word "rarity" might be split into rar + ##ity . The word "unrar" might become un + ##rar .

In the world of BERT, the number isn't just a digit—it's the subword token for "rar" . What is a Token, Anyway?

Text classification with BERT: tokenizers.ipynb - Colab - Google

It doesn't need to memorize every single version of a word.