A Quick WordPiece Tokenization System

[ad_1]

Tokenization is a elementary pre-processing step for many pure language processing (NLP) functions. It entails splitting textual content into smaller items referred to as tokens (e.g., phrases or phrase segments) with a purpose to flip an unstructured enter string right into a sequence of discrete parts that’s appropriate for a machine studying (ML) mannequin. ln deep studying–based mostly fashions (e.g., BERT), every token is mapped to an embedding vector to be fed into the mannequin.

Tokenization in a typical deep studying mannequin, like BERT.

A elementary tokenization strategy is to interrupt textual content into phrases. Nonetheless, utilizing this strategy, phrases that aren’t included within the vocabulary are handled as “unknown”. Fashionable NLP fashions tackle this difficulty by tokenizing textual content into subword items, which regularly retain linguistic which means (e.g., morphemes). So, regardless that a phrase could also be unknown to the mannequin, particular person subword tokens might retain sufficient info for the mannequin to deduce the which means to some extent. One such subword tokenization method that’s generally used and could be utilized to many different NLP fashions is named WordPiece. Given textual content, WordPiece first pre-tokenizes the textual content into phrases (by splitting on punctuation and whitespaces) after which tokenizes every phrase into subword items, referred to as wordpieces.

The WordPiece tokenization course of with an instance sentence.

In “Quick WordPiece Tokenization”, offered at EMNLP 2021, we developed an improved end-to-end WordPiece tokenization system that accelerates the tokenization course of, decreasing the general mannequin latency and saving computing assets. Compared to conventional algorithms which were used for many years, this strategy reduces the complexity of the computation by an order of magnitude, leading to considerably improved efficiency, as much as 8x sooner than normal approaches. The system has been utilized efficiently in a variety of methods at Google and has been publicly launched in TensorFlow Textual content.

Single-Phrase WordPiece Tokenization

WordPiece makes use of a grasping longest-match-first technique to tokenize a single phrase — i.e., it iteratively picks the longest prefix of the remaining textual content that matches a phrase within the mannequin’s vocabulary. This strategy is called most matching or MaxMatch, and has additionally been used for Chinese language phrase segmentation for the reason that Nineteen Eighties. But regardless of its huge use in NLP for many years, it’s nonetheless comparatively computation intensive, with the generally adopted MaxMatch approaches’ computation being quadratic with respect to the enter phrase size (n). It is because two pointers are wanted to scan over the enter: one to mark a begin place, and the opposite to seek for the longest substring matching a vocabulary token at that place.

We suggest an alternative choice to the MaxMatch algorithm for WordPiece tokenization, referred to as LinMaxMatch, which has a tokenization time that’s strictly linear with respect to n. First, we manage the vocabulary tokens in a trie (additionally referred to as a prefix tree), the place every trie edge is labeled by a personality, and a tree path from the basis to some node represents a prefix of some token within the vocabulary. Within the determine under, nodes are depicted as circles and tree edges are black strong arrows. Given a trie, a vocabulary token could be situated to match an enter textual content by traversing from the basis and following the trie edges to match the enter character by character; this course of is known as trie matching.

The determine under exhibits the trie created from the vocabulary consisting of “a”, “abcd”, “##b”, “##bc”, and “##z”. An enter textual content “abcd” could be matched to a vocabulary token by strolling from the basis (higher left) and following the trie edges with labels “a”, “b”, “c”, “d” one after the other. (The main “##” symbols are particular characters utilized in WordPiece tokenization which might be described in additional element under.)

Trie diagram of the vocabulary [“a”, “abcd”, “##b”, “##bc”, “##z”]. Circles and arrows signify nodes and edges alongside the trie, respectively.

Second, impressed by the Aho-Corasick algorithm, a classical string-searching algorithm invented in 1975, we introduce a technique that breaks out of a trie department that fails to match the given enter and skips on to an alternate department to proceed matching. As in normal trie matching, throughout tokenization, we comply with the trie edges to match the enter characters one after the other. When trie matching can’t match an enter character for a given node, a normal algorithm would backtrack to the final character the place a token was matched after which restart the trie matching process from there, which ends up in repetitive and wasteful iterations. As a substitute of backtracking, our methodology triggers a failure transition, which is completed in two steps: (1) it collects the precomputed tokens saved at that node, which we name failure pops; and (2) it then follows the precomputed failure hyperlink to a brand new node from which the trie matching course of continues.

For instance, given a mannequin with the vocabulary described above (“a”, “abcd”, “##b”, “##bc”, and “##z”), WordPiece tokenization distinguishes subword tokens matching firstly of the enter phrase from the subword tokens beginning within the center (the latter being marked with two main hashes “##”). Therefore, for enter textual content “abcz”, the anticipated tokenization output is [“a”, “##bc”, “##z”], the place “a” matches at the start of the enter whereas “##bc” and “##z” match within the center. For this instance, the determine under exhibits that, after efficiently matching three characters ‘a’, ‘b’, ‘c’, trie matching can’t match the subsequent character ‘z’ as a result of “abcz” just isn’t within the vocabulary. On this state of affairs, LinMaxMatch conducts a failure transition by outputting the primary acknowledged token (utilizing the failure pop token “a”) and following the failure hyperlink to a brand new node to proceed the matching course of (on this case, node with “##bc” because the failure pop tokens).The method then repeats from the brand new node.

Trie construction for a similar vocabulary as proven within the instance above, now illustrating the strategy taken by our new Quick WordPiece Tokenizer algorithm. Failure pops are bracketed and proven in purple. Failure hyperlinks between nodes are indicated with dashed crimson line arrows.

Since at the least n operations are required to learn the complete enter, the LinMaxMatch algorithm is asymptotically optimum for the MaxMatch downside.

Finish-to-Finish WordPiece Tokenization

Whereas the prevailing methods pre-tokenize the enter textual content (splitting it into phrases by punctuation and whitespace characters) after which name WordPiece tokenization on every ensuing phrase, we suggest an end-to-end WordPiece tokenizer that mixes pre-tokenization and WordPiece right into a single, linear-time go. It makes use of the LinMaxMatch trie matching and failure transitions as a lot as doable and solely checks for punctuation and whitespace characters among the many comparatively few enter characters that aren’t dealt with by the loop. It’s extra environment friendly because it traverses the enter solely as soon as, performs fewer punctuation / whitespace checks, and skips the creation of intermediate phrases.

Finish-to-Finish WordPiece Tokenization.

Benchmark Outcomes

We benchmark our methodology in opposition to two widely-adopted WordPiece tokenization implementations, HuggingFace Tokenizers, from the HuggingFace Transformer library, one of the crucial fashionable open-source NLP instruments, and TensorFlow Textual content, the official library of textual content utilities for TensorFlow. We use the WordPiece vocabulary launched with the BERT-Base, Multilingual Cased mannequin.

We in contrast our algorithms with HuggingFace and TensorFlow Textual content on a big corpus (a number of million phrases) and located that the way in which the strings are cut up into tokens is equivalent to different implementations for each single-word and end-to-end tokenization.

To generate the take a look at information, we pattern 1,000 sentences from the multilingual Wikipedia dataset, masking 82 languages. On common, every phrase has 4 characters, and every sentence has 82 characters or 17 phrases. We discovered this dataset massive sufficient as a result of a a lot bigger dataset (consisting of tons of of 1000’s of sentences) generated comparable outcomes.

We evaluate the common runtime when tokenizing a single phrase or common textual content (end-to-end) for every system. Quick WordPiece tokenizer is 8.2x sooner than HuggingFace and 5.1x sooner than TensorFlow Textual content, on common, for common textual content end-to-end tokenization.

Common runtime of every system. Observe that for higher visualization, single-word tokenization and end-to-end tokenization are proven in numerous scales.

We additionally look at how the runtime grows with respect to the enter size for single-word tokenization. Due to its linear-time complexity, the runtime of LinMaxMatch will increase at most linearly with the enter size, which is way slower than different quadratic-time approaches.

The common runtime of every system with respect to the enter size for single-word tokenization.

Conclusion

We proposed LinMaxMatch for single-word WordPiece tokenization, which solves the decades-old MaxMatch downside within the asymptotically-optimal time with respect to the enter size. LinMaxMatch extends the Aho-Corasick Algorithm, and the thought could be utilized to extra string search and transducer challenges. We additionally proposed an Finish-to-Finish WordPiece algorithm that mixes pre-tokenization and WordPiece tokenization right into a single, linear-time go for even greater effectivity.

Acknowledgements

We gratefully acknowledge the important thing contributions and helpful advices from different workforce members and colleagues, together with Abbas Bazzi, Alexander Frömmgen, Alex Salcianu, Andrew Hilton, Bradley Inexperienced, Ed Chi, Chen Chen, Dave Dopson, Eric Lehman, Fangtao Li, Gabriel Schubiner, Gang Li, Greg Billock, Hong Wang, Jacob Devlin, Jayant Madhavan, JD Chen, Jifan Zhu, Jing Li, John Blitzer, Kirill Borozdin, Kristina Toutanova, Majid Hadian-Jazi, Mark Omernick, Max Gubin, Michael Fields, Michael Kwong, Namrata Godbole, Nathan Lintz, Pandu Nayak, Pew Putthividhya, Pranav Khaitan, Robby Neale, Ryan Doherty, Sameer Panwar, Sundeep Tirumalareddy, Terry Huang, Thomas Strohmann, Tim Herrmann, Tom Small, Tomer Shani, Wenwei Yu, Xiaoxue Zang, Xin Li, Yang Guo, Yang Music, Yiming Xiao, Yuan Shen, and plenty of extra.

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *