Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
UTF-8
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Implementations and adoption == [[File:UTF-8 takes over.png|thumb|400px|Declared character set for the 10 million most popular websites from 2010 to 2021.]] [[File:Utf8webgrowth.svg|thumb|400px|Use of the main encodings on the web from 2001 to 2012 as recorded by Google,<ref name=MarkDavis2012>{{ cite web | author-last=Davis |author-first=Mark |author-link=Mark Davis (Unicode) | date=2012-02-03 | title=Unicode over 60 percent of the web | website=Official Google blog | url=https://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html | url-status=live |access-date=2020-07-24 | archive-url=https://web.archive.org/web/20180809152828/https://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html | archive-date=2018-08-09 }}</ref> with UTF-8 overtaking all others in 2008 and over 60% of the web in 2012 (since then approaching 100%). UTF-8 is the only encoding of Unicode (explicitly) listed there, and the rest only provide subsets of Unicode. The ASCII-only figure includes all web pages that only contain ASCII characters, regardless of the declared header.]] {{See also|Popularity of text encodings}} UTF-8 has been the most common encoding for the [[World Wide Web]] since 2008.<ref name=markdavis>{{cite web | first=Mark |last=Davis |author-link=Mark Davis (Unicode) | date=2008-05-05 | title=Moving to Unicode 5.1 | website=Official Google blog |language=en | url=https://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html | access-date=2023-03-13 }}</ref> {{As of|2025|01}}, UTF-8 is used by 98.5% of surveyed web sites.<ref name=W3TechsWebEncoding>{{Cite web|url=https://w3techs.com/technologies/cross/character_encoding/ranking|title=Usage Survey of Character Encodings broken down by Ranking |website=W3Techs |language = en | date = January 2025 |access-date=2025-01-07}}</ref> Although many pages only use ASCII characters to display content, very few websites now declare their encoding to only be ASCII instead of UTF-8.<ref>{{cite web |url=https://w3techs.com/technologies/details/en-usascii |title = Usage statistics and market share of ASCII for websites | date = January 2025 | website = W3Techs | access-date = 2025-01-07 }}</ref> Virtually all countries and <!-- over 97% all of the tracked --> languages have 95% or more use of UTF-8 encodings on the web. <!-- Over 61% of the languages tracked have <!- currently 61.4% have at least 99.5% UTF-8 support which rounds up to 100% (44.5% have "100.0%" which means 99.95+%) -> 100% UTF-8 use. --> Many standards only support UTF-8, e.g. [[JSON]] exchange requires it (without a byte-order mark (BOM)).<ref name=rfc8259>{{ cite IETF | last = Bray | first = Tim | editor-last = Bray | editor-first = Tim | date = December 2017 | title = The JavaScript Object Notation (JSON) Data Interchange Format | publisher = IETF | doi = 10.17487/RFC8259 | access-date = 16 February 2018 | rfc = 8259 }}</ref> UTF-8 is also the recommendation from the [[WHATWG]] for HTML and [[Document Object Model|DOM]] specifications, and stating "UTF-8 encoding is the most appropriate encoding for interchange of [[Unicode]]"<ref name=whatwg>{{ cite web | title = Encoding Standard | website = encoding.spec.whatwg.org | url = https://encoding.spec.whatwg.org/#preface | access-date = 2020-04-15 }}</ref> and the [[Internet Mail Consortium]] recommends that all e‑mail programs be able to display and create mail using UTF-8.<ref name=IMC>{{ cite web | url = https://www.imc.org/mail-i18n.html | title = Using International Characters in Internet Mail | publisher = Internet Mail Consortium | date = 1998-08-01 | access-date = 2007-11-08 | url-status = dead | archive-url = https://web.archive.org/web/20071026103104/https://www.imc.org/mail-i18n.html | archive-date = 2007-10-26}}</ref><ref name=mandatory>{{ cite web | title = Encoding Standard | website = encoding.spec.whatwg.org |language = en | url = https://encoding.spec.whatwg.org/#security-background | access-date = 2018-11-15 }}</ref> The [[World Wide Web Consortium]] recommends UTF-8 as the default encoding in XML and HTML (and not just using UTF-8, also declaring it in metadata), "even when all characters are in the ASCII range ... Using non-UTF-8 encodings can have unexpected results".<ref name=html5charset>{{ cite report | section = Specifying the document's character encoding | title = HTML 5.2 | date = 14 December 2017 | publisher = [[World Wide Web Consortium]] | url = https://www.w3.org/TR/html5/document-metadata.html | section-url = https://www.w3.org/TR/html5/document-metadata.html#charset | access-date = 2018-06-03 | mode = cs1 }}</ref> Many software programs have the ability to read/write UTF-8. It may require the user to change options from the normal settings, or may require a BOM (byte-order mark) as the first character to read the file. Examples of software supporting UTF-8 include [[Microsoft Word]],<!-- "Unicode (UTF-8)", "Unicode (Big-Endian)" and "Unicode (UTF-7)" --><ref>{{ cite web | title=Choose text encoding when you open and save files | website=Microsoft Support (support.microsoft.com) | url=https://support.microsoft.com/en-us/office/choose-text-encoding-when-you-open-and-save-files-60d59c21-88b5-4006-831c-d536d42fd861 | access-date=2021-11-01 }}</ref><ref>{{ cite web | title=UTF-8 - Character encoding of Microsoft ''Word'' <code>DOC</code> and <code>DOCX</code> files? | website=Stack Overflow | url=https://stackoverflow.com/questions/28172022/character-encoding-of-microsoft-word-doc-and-docx-files | access-date=2021-11-01 }}</ref><!-- <ref>{{ cite web | last=Gao |first=Ivy | title=How to fix corrupted character encoding (corrupted text) in Microsoft ''Word'' | website=TurboFuture | url=https://turbofuture.com/computers/3-Easy-Ways-To-Fix-Corrupted-Character-Encoding-In-Plain-Text-Documents | access-date=2021-11-01 | lang=en }}</ref> --><ref>{{ cite web | title = Exporting a UTF-8 <code>.txt</code> file from ''Word'' | website = support.3playmedia.com | date = 14 March 2023 | url = https://support.3playmedia.com/hc/en-us/articles/227730088-Exporting-a-UTF-8-txt-file-from-Word }}</ref> [[Microsoft Excel]] (2016 and later),<ref>{{ cite web | title = Are <code>XLSX</code> files UTF-8 encoded, by definition? | series = Excel | website = Stack Overflow | url = https://stackoverflow.com/questions/45194771/are-xlsx-files-utf-8-encoded-by-definition | access-date = 2021-11-01 }}</ref><ref>{{ cite web | author1 = Abhinav, Ankit | author2 = Xu, Jazlyn | date = April 13, 2020 | title = How to open UTF-8 <code>CSV</code> file in ''Excel'' without mis-conversion of characters in Japanese and Chinese language for both Mac and Windows? | website = Microsoft Support Community | language = en-US | url = https://answers.microsoft.com/en-us/msoffice/forum/all/how-to-open-utf-8-csv-file-in-excel-without-mis/1eb15700-d235-441e-8b99-db10fafff3c2 | access-date = 2021-11-01 }}</ref> [[Google Drive]], [[LibreOffice]],<ref>{{ cite web | title = Save a CSV file as UTF-8 | series = LibreOffice | website = RO CSVI | url = https://rolandd.com/documentation/ro-csvi/save-a-csv-file-as-utf-8 | access-date = 2025-05-20 }}</ref> and most databases. Software that "defaults" to UTF-8 (meaning it writes it without the user changing settings, and it reads it without a BOM) has become more common since 2010.<ref>{{ cite web | last=Galloway |first=Matt | date=October 2012 | title=Character encoding for iOS developers; or, UTF-8 what now? | website=www.galloway.me.uk | language=en-UK | url=https://www.galloway.me.uk/2012/10/character-encoding-for-ios-developers-utf8/ | access-date=2021-01-02 | quote = ... in reality, you usually just assume UTF-8 since that is by far the most common encoding. }}</ref> [[Windows Notepad]], in all currently supported versions of Windows, defaults to writing UTF-8 without a BOM (a change from {{nobr|[[Windows 7]]}} ''Notepad''), bringing it into line with most other text editors.<ref>{{ cite web | title=Windows 10 Notepad is getting better UTF-8 encoding support | website=BleepingComputer | url=https://www.bleepingcomputer.com/news/microsoft/windows-10-notepad-is-getting-better-utf-8-encoding-support/ | access-date=2021-03-24 | quote=Microsoft is now defaulting to saving new text files as UTF-8 without BOM, as shown below. | language=en-us }}</ref> Some system files on [[Windows 11|Windows 11]] require UTF-8<ref>{{ cite web | title = Customize the Windows 11 ''Start'' menu | url=https://docs.microsoft.com/en-us/windows-hardware/customize/desktop/customize-the-windows-11-start-menu | access-date=2021-06-29 | website=docs.microsoft.com | language=en-us | quote=Make sure your LayoutModification.json uses UTF-8 encoding. }}</ref> with no requirement for a BOM, and almost all files on macOS and Linux are required to be UTF-8 without a BOM.{{citation needed|date=June 2021}} Programming languages that default to UTF-8 for [[input/output|I/O]] include [[Ruby (programming language)|Ruby]] 3.0,<ref>{{ cite web | title = Set default for Encoding.default_external to UTF-8 on Windows | series = Ruby master | id = Feature #16604 | website = Ruby Issue Tracking System (bugs.ruby-lang.org) | url = https://bugs.ruby-lang.org/issues/16604 | access-date = 2022-08-01 }}</ref><ref>{{ cite web | title = Feature #12650: Use UTF-8 encoding for ENV on Windows | series = Ruby master | website = Ruby Issue Tracking System (bugs.ruby-lang.org) | url = https://bugs.ruby-lang.org/issues/12650 | access-date = 2022-08-01 }}</ref> [[R (programming language)|R]] 4.2.2,<ref>{{ cite web | title = New features in R 4.2.0 | date = 2022-04-01 | website = R bloggers (r-bloggers.com) | series = The Jumping Rivers Blog | url = https://www.r-bloggers.com/2022/04/new-features-in-r-4-2-0/ | access-date = 2022-08-01 | language = en-US }}</ref> [[Raku (programming language)|Raku]] and [[Java (programming language)|Java]] 18.<ref name=Java_UTF-8_and_UTF-16>{{ cite web | title = UTF-8 by default | id = JEP 400 | website = openjdk.java.net | url = https://openjdk.java.net/jeps/400 | access-date=2022-03-30 }}</ref> Although the current version of [[Python (programming language)|Python]] requires an option to <code>open()</code> to read/write UTF-8,<ref>{{ cite web | title = add a new UTF-8 mode | website = peps.python.org | id = PEP 540 | url = https://peps.python.org/pep-0540/ | access-date = 2022-09-23 }}</ref> plans exist to make UTF-8 I/O the default in Python 3.15.<ref>{{ cite web | title = Make UTF-8 mode default | website = peps.python.org | id = PEP 686 | url = https://peps.python.org/pep-0686/ | access-date=2023-07-26 }}</ref> [[C++23]] adopts UTF-8 as the only portable source code file format.<ref>{{ cite report | title = Support for UTF-8 as a portable source file encoding | year = 2022 | id = p2295r6 | website = open-std.org | url = https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2295r6.pdf }}</ref> Backwards compatibility is a serious impediment to changing code and APIs using [[UTF-16]] to use UTF-8, but this is happening. {{As of|2019|05}}, Microsoft [[Unicode in Microsoft Windows#UTF-8|added the capability]] for an application to set UTF-8 as the "code page" for the Windows API, removing the need to use UTF-16; and more recently has recommended programmers use UTF-8,<ref name=Microsoft-UTF-8>{{ cite web | title=Use UTF-8 code pages in Windows apps | website=[[Microsoft Learn]] | date=20 August 2024 |language=en-us | url=https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page | access-date=2024-09-24}}</ref> and even states "UTF-16 [...] is a unique burden that Windows places on code that targets multiple platforms".<ref name="Microsoft GDK">{{ cite web | title=UTF-8 support in the Microsoft GDK | series = Microsoft Game Development Kit (GDK) | website = [[Microsoft Learn]] |language=en-us | url=https://learn.microsoft.com/en-us/gaming/gdk/_content/gc/system/overviews/utf-8 | access-date = 2023-03-05 }}</ref> The default string primitive in [[Go (programming language)|Go]],<ref>{{ cite report | section=Source code representation | title=The ''Go'' Programming Language Specification | website=golang.org | section-url=https://golang.org/ref/spec#Source_code_representation | access-date=2021-02-10 }}</ref> [[Julia (programming language)|Julia]], [[Rust (programming language)|Rust]], [[Swift (programming language)#String support|Swift]] (since version 5),<ref>{{ cite web | last=Tsai |first=Michael J. | date=21 March 2019 | title=UTF-8 string in Swift 5 | type=blog post |language=en | url=https://mjtsai.com/blog/2019/03/21/utf-8-string-in-swift-5/ | access-date=2021-03-15 }}</ref> and [[PyPy]]<ref>{{ cite web | title=PyPy v7.1 released; now uses UTF-8 internally for Unicode strings | department=Mattip | date=2019-03-24 | website=PyPy status blog | url=https://morepypy.blogspot.com/2019/03/pypy-v71-released-now-uses-utf-8.html | access-date=2020-11-21 }}</ref> uses UTF-8 internally in all cases. Python (since version 3.3) uses UTF-8 internally for Python C API extensions<ref name=PEP393>{{ cite web | title = Flexible String Representation | id = PEP 393 | website = Python.org |language=en | url = https://peps.python.org/pep-0393 | access-date = 2022-05-18 }}</ref><ref>{{Cite web |title=Common Object Structures |url=https://docs.python.org/3/c-api/structures.html |access-date=2024-05-29 |website=Python documentation |language=en}}</ref> and sometimes for strings<ref name=PEP393/><ref>{{ cite web | title=Unicode objects and codecs | url=https://docs.python.org/3/c-api/unicode.html | access-date=2023-08-19 |website=Python documentation | quote=UTF-8 representation is created on demand and cached in the Unicode object.}}</ref> and a future version of Python is planned to store strings as UTF-8 by default.<ref>{{ cite web | title=PEP 623 – remove wstr from Unicode | website=Python.org |language=en | url=https://www.python.org/dev/peps/pep-0623/ | access-date=2020-11-21 }}</ref><ref>{{ cite web | last=Wouters |first=Thomas | date=2023-07-11 | title=Python 3.12.0 beta 4 released | website = Python Insider (pythoninsider.blogspot.com) | type = blog post | url=https://pythoninsider.blogspot.com/2023/07/pleased-to-announce-release-of-python-3.html | access-date=2023-07-26 | quote=The deprecated <code>wstr</code> and <code>wstr_length</code> members of the C implementation of unicode objects were removed, per PEP 623. }}</ref> Modern versions of [[Microsoft Visual Studio]] use UTF-8 internally.<ref>{{ cite web | title=validate-charset (validate for compatible characters) | website=docs.microsoft.com |language=en-us | url=https://docs.microsoft.com/en-us/cpp/build/reference/validate-charset-validate-for-compatible-characters | access-date=2021-07-19 | quote=Visual Studio uses UTF-8 as the internal character encoding during conversion between the source character set and the execution character set. }}</ref> Microsoft's SQL Server 2019 added support for UTF-8, and using it results in a 35% speed increase, and "nearly 50% reduction in storage requirements."<ref>{{ cite web | title = Introducing UTF-8 support for SQL Server | date = 2019-07-02 | website = techcommunity.microsoft.com | url = https://techcommunity.microsoft.com/t5/sql-server/introducing-utf-8-support-for-sql-server/ba-p/734928 | access-date = 2021-08-24 | language = en-US }}</ref> {{anchor|Modified UTF-8}} [[Java (programming language)|Java]] internally uses UTF-16 for the ''char'' data type and, consequentially, the ''Character'', ''String'', and the ''StringBuffer'' classes,<ref>{{cite web |title=Character (Java SE 24 & JDK 24) |url=https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/Character.html#unicode |year=2025 |publisher=[[Oracle Corporation]] |access-date=2025-04-08}}</ref> but for I/O uses ''Modified UTF-8'' (MUTF-8), in which the [[null character]] {{tt|U+0000}} uses the two-byte overlong encoding {{tt|0xC0}}, {{tt|0x80}}, instead of just {{tt|0x00}}.<ref>{{cite web |title=Java SE documentation for Interface java.io.DataInput, subsection on Modified UTF-8 |url=https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html#modified-utf-8 |year=2015 |publisher=[[Oracle Corporation]] |access-date=2015-10-16}}</ref> Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including {{tt|U+0000}},<ref>{{cite web |url=https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.4.7 |title=The Java Virtual Machine Specification, section 4.4.7: "The CONSTANT_Utf8_info Structure" |publisher=[[Oracle Corporation]] |year=2015 |access-date=2015-10-16}}</ref> which allows such strings (with a null byte appended) to be processed by traditional [[null-terminated string]] functions. Java reads and writes normal UTF-8 to files and streams,<ref>{{Javadoc:SE|java/io|InputStreamReader}} and {{Javadoc:SE|java/io|OutputStreamWriter}}</ref> but it uses Modified UTF-8 for object [[Java serialization|serialization]],<ref>{{cite web |title=Java Object Serialization Specification, chapter 6: Object Serialization Stream Protocol, section 2: Stream Elements |url=https://docs.oracle.com/javase/8/docs/platform/serialization/spec/protocol.html#a8299 |year=2010 |publisher=[[Oracle Corporation]] |access-date=2015-10-16}}</ref><ref>{{Javadoc:SE|java/io|DataInput}} and {{Javadoc:SE|java/io|DataOutput}}</ref> for the [[Java Native Interface]],<ref>{{cite web |url=https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/types.html#modified_utf_8_strings |title=Java Native Interface Specification, chapter 3: JNI Types and Data Structures, section: Modified UTF-8 Strings |publisher=[[Oracle Corporation]] |year=2015 |access-date=2015-10-16}}</ref> and for embedding constant strings in [[Class (file format)|class files]].<ref>{{cite web |title=The Java Virtual Machine Specification, section 4.4.7: "The CONSTANT_Utf8_info Structure" |url=https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.4.7 |publisher=[[Oracle Corporation]] |year=2015 |access-date=2015-10-16}}</ref> The dex format defined by [[Dalvik (software)|Dalvik]] also uses the same modified UTF-8 to represent string values.<ref>{{cite web |url=https://source.android.com/tech/dalvik/dex-format.html |title=ART and Dalvik |work=Android Open Source Project |access-date=2013-04-09 |url-status=dead |archive-url=https://web.archive.org/web/20130426010617/https://source.android.com/tech/dalvik/dex-format.html |archive-date=2013-04-26 }}</ref> [[Tcl]] also uses the same modified UTF-8<ref>{{cite web |title=UTF-8 bit by bit |date=2001-02-28 |url=https://wiki.tcl-lang.org/page/UTF-8+bit+by+bit |access-date=2022-09-03 |website=Tcler's Wiki}}</ref> as Java for internal representation of Unicode data, but uses strict CESU-8 for external data. All known Modified UTF-8 implementations also treat the surrogate pairs as in [[CESU-8]]. The [[Raku (programming language)|Raku]] programming language (formerly Perl 6) uses <code>utf-8</code> encoding by default for I/O ([[Perl]] 5 also supports it<!-- "utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code" -->)<!-- "Raku applies normalization by default to all input and output except for file names, which are read and written as UTF8-C8" -->; though that choice in Raku also implies "normalization into Unicode [[Unicode equivalence#Normal forms|NFC (normalization form canonical)]]. In some cases the user will want to ensure no normalization is done; for this <code>utf8-c8</code>" can be used.<ref>{{Cite web |title=encoding {{!}} Raku Documentation |url=https://docs.raku.org/routine/encoding |access-date=2024-10-06 |website=docs.raku.org}}</ref> That ''UTF-8 Clean-8'' variant, implemented by Raku, is an encoder/decoder <!-- that primarily works as the UTF-8 one. However, upon encountering a byte sequence that will either not decode as valid UTF-8, or that would not round-trip due to normalization, it will use NFG synthetics to keep track of the original bytes involved. This means that encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they originally existed. The synthetics contain four codepoints: ... --> that preserves bytes as is (even illegal UTF-8 sequences) and allows for Normal Form Grapheme synthetics.<ref>{{Cite web |title=Unicode {{!}} Raku Documentation |url=https://docs.raku.org/language/unicode#UTF8-C8 |access-date=2024-10-06 |website=docs.raku.org}}</ref> Version 3 of the [[Python (programming language)|Python]] programming language treats each byte of an invalid UTF-8 bytestream as an error (see also changes with new UTF-8 mode in Python 3.7<ref>{{Cite web|title=PEP 540 -- Add a new UTF-8 Mode|url=https://www.python.org/dev/peps/pep-0540/|access-date=2021-03-24|website=Python.org|language=en}}</ref>); this gives 128 different possible errors. Extensions have been created to allow any byte sequence that is assumed to be UTF-8 to be losslessly transformed to UTF-16 or UTF-32, by translating the 128 possible error bytes to 128 reserved code points, and transforming those code points back to error bytes to output UTF-8. The most common approach is to translate the codes to {{tt|U+DC80}}...{{tt|U+DCFF}} which are low (trailing) surrogate values and thus "invalid" UTF-16, as used by [[Python (programming language)|Python]]'s [[Python Enhancement Proposal|PEP]] 383 (or "surrogateescape") approach.<ref name="pep383">{{cite web |id=PEP 383 |title=Non-decodable Bytes in System Character Interfaces |url=https://www.python.org/dev/peps/pep-0383 |publisher=[[Python Software Foundation]] |language=en |first=Martin |last=von Löwis |date=2009-04-22}}</ref> Another encoding called [[MirBSD]] OPTU-8/16 converts them to {{tt|U+EF80}}...{{tt|U+EFFF}} in a [[Private Use Area]].<ref>{{cite web |title=RTFM optu8to16(3), optu8to16vis(3) |url=https://www.mirbsd.org/htman/i386/man3/optu8to16.htm |website=www.mirbsd.org}}</ref> In either approach, the byte value is encoded in the low eight bits of the output code point. These encodings are needed if invalid UTF-8 is to survive translation to and then back from the UTF-16 used internally by Python, and as Unix filenames can contain invalid UTF-8 it is necessary for this to work.<ref name="davis383">{{cite web |url=https://www.unicode.org/reports/tr36/#EnablingLosslessConversion |last1=Davis |first1=Mark |author-link1=Mark Davis (Unicode) |first2=Michel |last2=Suignard |title=3.7 Enabling Lossless Conversion |work=Unicode Security Considerations |id=Unicode Technical Report #36 |year=2014}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)