Byte pair encoding

Byte pair encoding[1][2] (also known as digram coding)[3] is an algorithm, first described in 1994 by Philip Gage.[4] Its modification is recently notable for its use as a tokenizer for transformer models, such as GPT-3 model,[5][6] which first encodes the most frequent words as single tokens, while the less frequent words that contain more frequent words as its subwords are represented by multiple tokens, each of them representing a word part.[7]

For example, the frequent word 'transform' is found in 'transformers' as its first subword/token, and 'ers' as another one. Thus, it can be represented by two tokens. All the unique tokens found in a corpus are listed in a token vocabulary, the size of which, in the case of GPT-3, is 50257. Before the tokenization process can start, the frequency of each word in the training corpus has to be calculated. This calculation is performed by pre-tokenizers such as Spacy and ftfy, which are used by GPT.

The original algorithm uses a simple and robust form of data compression in which the most common pair of contiguous bytes of data in a sequence are replaced with a byte that does not occur within the sequence. A lookup table of the replacements is required to rebuild the original data. The algorithm is effective for the tokenization because it does not require large computational overheads and remains consistent and reliable.

The original algorithm

The original algorithm operates by iteratively replacing the most common contiguous sequences of characters in a target piece of text with unused 'placeholder' bytes. The iteration ends when no sequences can be found, leaving the target text effectively compressed. Decompression can be performed by reversing this process, querying known placeholder terms against their corresponding denoted sequence, per a lookup table. In the original paper, this lookup table is encoded and stored alongside the compressed text.

Example

Suppose the data to be encoded is

aaabdaaabac

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". Now there is the following data and replacement table:

ZabdZabac
Z=aa

Then the process is repeated with byte pair "ab", replacing it with "Y":

ZYdZYac
Y=ab
Z=aa

The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte pair encoding, replacing "ZY" with "X":

XdXac
X=ZY
Y=ab
Z=aa

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

To decompress the data, simply perform the replacements in the reverse order.

See also

References

  1. ^ Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.
  2. ^ "A New Algorithm for Data Compression". Dr. Dobb's Journal. 1 February 1994. Retrieved 10 August 2020.
  3. ^ Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1994). Managing Gigabytes. New York: Van Nostrand Reinhold. ISBN 978-0-442-01863-4.
  4. ^ "Byte Pair Encoding". Archived from the original on 2016-03-26.
  5. ^ Brown, Tom B.; Mann, Benjamin; Ryde r, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini (2020-06-04). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].
  6. ^ "google/sentencepiece". Google. 2021-03-02. Retrieved 2021-03-02.
  7. ^ Sennrich, Rico; Haddow, Barry; Birch, Alexandra (2016-08-12). "Neural Machine Translation of Rare Words with Subword Units". Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (ACL): 1715–1725. doi:10.18653/v1/P16-1162.