How does Python 2 represent Unicode internally? -

January 15, 2012

when read python2's official page on unicode, says

under hood, python represents unicode strings either 16-or 32-bit integers, depending on how python interpreter compiled.

what above sentence mean? mean python2 has own special encodings of unicode? if so, why not use utf-8?

this statement means there underlying c code uses both these encodings , depending on circumstances, either variant chosen. circumstances typically user choice, compiler , operating system.

now, possible rationale that, there reasons not use utf-8:

first , foremost, indexing utf-8 string o(n) in complexity, while o(1) utf-32/ucs4. while irrelevant streamed data , utf-8 can save space transmission or storage, in-memory handling more convenient 1 character per unicode codepoint.
secondly, using 1 character per codepoint translates api python provides in language, natural choice.
on ms windows platforms, native encoding ui , filesystem utf-16, using encoding provides seamless integration platform.
on compilers wchar_t 16-bit type, if wanted use 32-bit type there have reimplement kinds of functions self-invented character type. dropping support above unicode bmp or leaking surrogate sequences python api reasonable compromise (but 1 sticks unfortunately).

note possible reasons, don't claim these apply python's implementation.

Search This Blog

Script

How does Python 2 represent Unicode internally? -

Comments

Post a Comment

Popular posts from this blog

javascript - Bootstrap Popover: iOS Safari strange behaviour -

Magento/PHP - Get phones on all members in a customer group -

spring cloud - How to configure SpringCloud Eureka instance to point to https on non standard port -