gpt-tokenizer/text.py

4 lines
24 KiB
Python
Raw Normal View History

2024-02-24 15:33:58 +00:00
# Text taken from https://www.reedbeta.com/blog/programmers-intro-to-unicode/.
text = """A Programmers Introduction to Unicode March 3, 2017 · Coding · 22 Comments ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺\u200c🇳\u200c🇮\u200c🇨\u200c🇴\u200c🇩\u200c🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I dont blame programmers for still finding the whole thing mysterious, even 30 years after Unicodes inception. A few months ago, I got interested in Unicode and decided to spend some time learning more about it in detail. In this article, Ill give an introduction to it from a programmers point of view. Im going to focus on the character set and whats involved in working with strings and files of Unicode text. However, in this article Im not going to talk about fonts, text layout/shaping/rendering, or localization in detail—those are separate issues, beyond my scope (and knowledge) here. Diversity and Inherent Complexity The Unicode Codespace Codespace Allocation Scripts Usage Frequency Encodings UTF-8 UTF-16 Combining Marks Canonical Equivalence Normalization Forms Grapheme Clusters And More… Diversity and Inherent Complexity As soon as you start to study Unicode, it becomes clear that it represents a large jump in complexity over character sets like ASCII that you may be more familiar with. Its not just that Unicode contains a much larger number of characters, although thats part of it. Unicode also has a great deal of internal structure, features, and special cases, making it much more than what one might expect a mere “character set” to be. Well see some of that later in this article. When confronting all this complexity, especially as an engineer, its hard not to find oneself asking, “Why do we need all this? Is this really necessary? Couldnt it be simplified?” However, Unicode aims to faithfully represent the entire worlds writing systems. The Unicode Consortiums stated goal is “enabling people around the world to use computers in any language”. And as you might imagine, the diversity of written languages is immense! To date, Unicode supports 135 different scripts, covering some 1100 languages, and theres still a long tail of over 100 unsupported scripts, both modern and historical, which people are still working to add. Given this enormous diversity, its inevitable that representing it is a complicated project. Unicode embraces that diversity, and accepts the complexity inherent in its mission to include all human writing systems. It doesnt make a lot of trade-offs in the name of simplification, and it makes exceptions to its own rules where necessary to further its mission. Moreover, Unicode is committed not just to supporting texts in any single language, but also to letting multiple languages coexist within one text—which introduces even more complexity. Most programming languages have libraries available to handle the gory low-level details of text manipulation, but as a programmer, youll still need to know about certain Unicode features in order to know when and how to apply them. It may take some time to wrap your head around it all, but dont be discouraged—think about the billions of people for whom your software will be more accessible through supporting text in their language. Embrace the complexity! The Unicode Codespace Lets start with some general orientation. The basic elements of Unicode—its “characters”, although that term isnt quite right—are called code points. Code points are identified by number, customarily written in hexadecimal with the prefix “U+”, such as U+0041 “A” latin capital letter a or U+03B8 “θ” greek small letter theta. Each code point also has a short name, and quite a few
tokens = list(map(int, text.encode('utf-8')))