How does Python 2 represent Unicode internally? -
when read python2's official page on unicode, says
under hood, python represents unicode strings either 16-or 32-bit integers, depending on how python interpreter compiled.
what above sentence mean? mean python2 has own special encodings of unicode? if so, why not use utf-8?
this statement means there underlying c code uses both these encodings , depending on circumstances, either variant chosen. circumstances typically user choice, compiler , operating system.
now, possible rationale that, there reasons not use utf-8:
- first , foremost, indexing utf-8 string o(n) in complexity, while o(1) utf-32/ucs4. while irrelevant streamed data , utf-8 can save space transmission or storage, in-memory handling more convenient 1 character per unicode codepoint.
- secondly, using 1 character per codepoint translates api python provides in language, natural choice.
- on ms windows platforms, native encoding ui , filesystem utf-16, using encoding provides seamless integration platform.
- on compilers
wchar_t
16-bit type, if wanted use 32-bit type there have reimplement kinds of functions self-invented character type. dropping support above unicode bmp or leaking surrogate sequences python api reasonable compromise (but 1 sticks unfortunately).
note possible reasons, don't claim these apply python's implementation.
Comments
Post a Comment