Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
UTF-32
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Utility of fixed width == A fixed number of bytes per code point has theoretical advantages, but each of these has problems in reality: * Truncation becomes easier, but not significantly so compared to [[UTF-8]] and [[UTF-16]] (both of which can search backwards for the point to truncate by looking at 2β4 code units at most).{{efn|For UTF-8: Select point to truncate at. If the byte before it is 0-0x7F, or the byte after it is anything other than the continuation bytes 0x80-0xBF, the string can be truncated at that point. Otherwise search up to 3 bytes backwards for such a point and truncate at that. If not found, truncate at the original position. This works even if there are encoding errors in the UTF-8. UTF-16 is trivial and only has to back up one word at most.}}{{citation needed|date=January 2023}} * Finding the ''Nth'' character in a string. For fixed width, this is simply a [[Big O notation|O(1) problem]], while it is [[Big O notation|O(n) problem]] in a variable-width encoding. Novice programmers often vastly overestimate how useful this is.<ref name=manishearth>{{Cite web|title=Let's Stop Ascribing Meaning to Code Points |website=In Pursuit of Laziness |first1=Manish |last1=Goregaokar |date=January 14, 2017 |url=https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/|access-date=2020-06-14|quote=Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation. }}</ref> Also what a user might call a "character" is still variable-width, for instance the [[combining character]] sequence {{char|Γ‘}} could be 2 code points, the emoji {{char|π¨βπ¦²}} is three,<ref>{{Cite web|title=π¨β𦲠Man: Bald Emoji|url=https://emojipedia.org/man-bald/|access-date=2021-10-12|website=Emojipedia|language=en}}</ref> and the ligature {{char|ο¬}} is one. * Quickly knowing the "width" of a string. However even [[Duospaced font|"fixed width" fonts]] have varying width, often [[CJK characters|CJK ideographs]] are twice as wide,<ref name=manishearth/> plus the already-mentioned problems with the number of code points not being equal to the number of characters.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)