Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Character encoding
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Terminology == Informally, the terms "character encoding", "character map", "character set" and "code page" are often used interchangeably.<ref name="SteeleMSDN">{{cite web|url=https://docs.microsoft.com/en-us/archive/blogs/shawnste/whats-the-difference-between-an-encoding-code-page-character-set-and-unicode|author=Shawn Steele|title=What's the difference between an Encoding, Code Page, Character Set and Unicode?|date=15 March 2005|website=Microsoft Docs}}</ref> Historically, the same standard would specify a repertoire of characters and how they were to be encoded into a stream of code units β usually with a single character per code unit. However, due to the emergence of more sophisticated character encodings, the distinction between these terms has become important. * A ''[[Character (computing)|character]]'' is a minimal unit of text that has semantic value.<ref name="SteeleMSDN"/><ref name="Unicode glossary">{{cite web |title=Glossary of Unicode Terms |url=https://unicode.org/glossary/ |publisher=Unicode Consortium}}</ref> * A ''character set'' is a collection of elements used to represent text.<ref name="SteeleMSDN"/><ref name="Unicode glossary"/> For example, the [[Latin alphabet]] and [[Greek alphabet]] are both character sets. * {{anchor|CCS}}A ''coded character set'' is a character set mapped to a set of unique numbers.<ref name="Unicode glossary"/> For historical reasons, this is also often referred to as a [[code page]].<ref name="SteeleMSDN"/> * {{anchor|repertoire}}A ''character repertoire'' is the set of characters that can be represented by a particular coded character set.<ref name="Unicode glossary"/><ref name="unicode15">{{cite book |title=The Unicode Standard Version 15.0 β Core Specification |date=September 2022 |publisher=Unicode Consortium |isbn=978-1-936213-32-0 |chapter=Chapter 3: Conformance |url=https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf}}</ref> The repertoire may be closed, meaning that no additions are allowed without creating a new standard (as is the case with ASCII and most of the ISO-8859 series); or it may be open, allowing additions (as is the case with Unicode and to a limited extent [[Windows code page]]s).<ref name="unicode15"/> * A ''code point'' is a value or position of a character in a coded character set.<ref name="Unicode glossary"/> * A ''code space'' is the range of numerical values spanned by a coded character set.<ref name="Unicode glossary"/><ref name="utr17"/> * A ''code unit'' is the minimum bit combination that can represent a character in a character encoding (in [[computer science]] terms, it is the [[Word (computer architecture)|word]] size of the character encoding).<ref name="Unicode glossary"/><ref name="utr17"/> For example, common code units include 7-bit, 8-bit, 16-bit, and 32-bit. In some encodings, some characters are encoded using multiple code units; such an encoding is referred to as a [[variable-width encoding]]. ===Code pages=== {{main|Code page}} "Code page" is a historical name for a coded character set. Originally, a code page referred to a specific [[page number]] in the IBM standard character set manual, which would define a particular character encoding.<ref name="DEC_VT510">{{cite web |title=VT510 Video Terminal Programmer Information |at=7.1. Character Sets - Overview |publisher=[[Digital Equipment Corporation]] (DEC) |url=http://www.vt100.net/docs/vt510-rm/chapter7.html#S7.1 |access-date=2017-02-15 |quote=In addition to traditional [[Digital Equipment Corporation|DEC]] and [[ISO]] character sets, which conform to the structure and rules of [[ISO 2022]], the [[VT510]] supports a number of IBM PC code pages ([[page number]]s in IBM's standard character set manual) in [[PCTerm]] mode to emulate the [[console terminal]] of industry-standard PCs. |archive-date=2016-01-26 |archive-url=https://web.archive.org/web/20160126192029/http://www.vt100.net/docs/vt510-rm/chapter7.html#S7.1 |url-status=live }}</ref> Other vendors, including [[Microsoft]], [[SAP AG|SAP]], and [[Oracle Corporation]], also published their own sets of code pages; the most well-known code page suites are "[[Windows code page|Windows]]" (based on Windows-1252) and "IBM"/"DOS" (based on [[code page 437]]). Despite no longer referring to specific page numbers in a standard, many character encodings are still referred to by their code page number; likewise, the term "code page" is often still used to refer to character encodings in general. The term "code page" is not used in Unix or Linux, where "charmap" is preferred, usually in the larger context of locales. IBM's Character Data Representation Architecture (CDRA) designates entities with coded character set identifiers ([[CCSID]]s), each of which is variously called a "charset", "character set", "code page", or "CHARMAP".<ref name=utr17/> ===Code units=== The code unit size is equivalent to the bit measurement for the particular encoding: * A code unit in [[ASCII]] consists of 7 bits; * A code unit in [[UTF-8]], [[EBCDIC]] and [[GB 18030]] consists of 8 bits; * A code unit in [[UTF-16]] consists of 16 bits; * A code unit in [[UTF-32]] consists of 32 bits. ===Code points=== A code point is represented by a sequence of code units. The mapping is defined by the encoding. Thus, the number of code units required to represent a code point depends on the encoding: * UTF-8: code points map to a sequence of one, two, three or four code units. * UTF-16: code units are twice as long as 8-bit code units. Therefore, any code point with a scalar value less than U+10000 is encoded with a single code unit. Code points with a value U+10000 or higher require two code units each. These pairs of code units have a unique term in UTF-16: [[UTF-16#Code points from U+010000 to U+10FFFF|"Unicode surrogate pairs".]] * UTF-32: the 32-bit code unit is large enough that every code point is represented as a single code unit. * GB 18030: multiple code units per code point are common, because of the small code units. Code points are mapped to one, two, or four code units.<ref>{{cite web | url=https://docs.oracle.com/javase/tutorial/i18n/text/terminology.html | title=Terminology (The Java Tutorials) | publisher=Oracle | access-date=25 March 2018 }}</ref> ===Characters=== {{main|Character (computing)}} Exactly what constitutes a character varies between character encodings. For example, for letters with [[diacritic]]s, there are two distinct approaches that can be taken to encode them: they can be encoded either as a single unified character (known as a precomposed character), or as separate characters that combine into a single [[glyph]]. The former simplifies the text handling system, but the latter allows any letter/diacritic combination to be used in text. [[Typographic ligature|Ligatures]] pose similar problems. Exactly how to handle [[glyph]] variants is a choice that must be made when constructing a particular character encoding. Some writing systems, such as Arabic and Hebrew, need to accommodate things like [[grapheme]]s that are joined in different ways in different contexts, but represent the same semantic character.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)