Editing UTF-EBCDIC (section)

{{Short description|Character encoding for Unicode compatible with EBCDIC}}{{Infobox character encoding
| name = UTF-EBCDIC
| encodes = [[Unicode]]
| basedon = [[UTF-8]]
| by = [[IBM]]
| definitions = [https://www.unicode.org/reports/tr16/tr16-8.html Unicode Technical Report #16]
}}

'''UTF-EBCDIC''' is a [[character encoding]] capable of encoding all 1,112,064 valid character [[code point]]s in [[Unicode]] using 1 to 5 [[byte]]s (in contrast to a maximum of 4 for [[UTF-8]]).<ref>{{Cite web|title=UTR #16: UTF-EBCDIC|url=https://www.unicode.org/reports/tr16/tr16-8.html|quote=You need to search at most five bytes (seven bytes, if the full range of 31 bits of ISO/IEC 10646 is considered) backwards|access-date=2021-02-23|website=www.unicode.org}}</ref> It is meant to be [[EBCDIC]]-friendly, so that legacy EBCDIC applications on [[Mainframe computer|mainframes]] may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to [[UTF-8]]'s advantages for existing [[ASCII]]-based systems.  Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first (creating what the specification calls an I8 sequence). The main difference between this encoding and UTF-8 is that it allows Unicode code points {{tt|U+0080}} through {{tt|U+009F}} (the [[C1 control code]]s) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses {{tt|101xxxxx}} instead of {{tt|10xxxxxx}} as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above {{tt|U+03FF}} are larger than the UTF-8 encoding.

The UTF-8-Mod transformation leaves the data in an ASCII-based format (for example, {{tt|U+0041}} "A" is still encoded as {{tt|0x41}}), so each byte is fed through a reversible (one-to-one) lookup table to produce the final UTF-EBCDIC encoding. For example, {{tt|0x41}} in this table maps to {{tt|0xC1}}; thus the UTF-EBCDIC encoding of {{tt|U+0041}} (Unicode's "A") is {{tt|0xC1}} (EBCDIC's "A").

UTF-EBCDIC is rarely used, even on the EBCDIC-based mainframes for which it was designed. [[IBM]] EBCDIC-based mainframe operating systems, such as [[z/OS]], usually use [[UTF-16]] for complete Unicode support. For example, [[IBM Db2]], [[COBOL]], [[PL/I]], [[Java (programming language)|Java]] and the [[IBM]] [[XML]] toolkit support UTF-16 on IBM mainframes.