Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Binary Ordered Compression for Unicode
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|MIME compatible Unicode compression scheme}} {{Redirect|BOCU}} '''Binary Ordered Compression for Unicode''' ('''BOCU''') is a [[MIME]] compatible Unicode compression scheme. BOCU-1 combines the wide applicability of [[UTF-8]] with the compactness of [[Standard Compression Scheme for Unicode]] (SCSU). This [[Unicode]] [[character encoding|encoding]] is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.<ref>{{cite web |url=https://www.unicode.org/notes/tn6/#Introduction |title=UTN #6: BOCU-1|date=2006-02-04 |author=Markus Scherer, [[Mark Davis (Unicode)|Mark Davis]] |access-date=2008-05-18}}</ref> For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific [[code page]]s. SCSU has not been widely adopted, as it is not suitable for MIME "text" media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the [[ZIP file format|zip]], [[bzip2]], and other industry standard algorithms compact larger amounts of Unicode text more efficiently.<ref>{{cite web |url=http://unicode.org/notes/tn14 |title=UTN #14: A survey of Unicode compression |date=2004-01-30 |first=Doug |last=Ewell |access-date=2008-06-13 |format=PDF }}</ref> Both SCSU<ref>[https://www.iana.org/assignments/charset-reg/SCSU IANA registration record for SCSU]</ref> and BOCU-1<ref>[https://www.iana.org/assignments/charset-reg/BOCU-1 IANA registration record for BOCU-1]</ref> are [[Internet Assigned Numbers Authority|IANA]] registered charsets. == Details == All numbers in this section are [[hexadecimal]], and all ranges are inclusive. Code points from <code>U+0000</code> to <code>U+0020</code> are encoded in BOCU-1 as the corresponding byte value. All other code points (that is, <code>U+0021</code> through <code>U+D7FF</code> and <code>U+E000</code> through <code>U+10FFFF</code>) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (<code>U+0020</code>). The initial state is <code>U+0040</code>. The normalization mapping is as follows: {| class="wikitable" ! style="width: auto;" | Code range ! style="width: auto;" | Normalized code point ! style="width: auto;" | Notes |- | <code>U+3040</code> to <code>U+309F</code> | <code>U+3070</code> | [[Hiragana (Unicode block)|Hiragana]] |- | <code>U+4E00</code> to <code>U+9FA5</code> | <code>U+7711</code> | [[CJK Unified Ideographs (Unicode block)|Unihan]] |- | <code>U+AC00</code> to <code>U+D7A3</code> | <code>U+C1D1</code> | [[Hangul Syllables|Hangul]] |- | <code>U+0020</code> ! <small>encoder state kept as is</small> | Space |- | <code>U+''hhhh''00</code> to <code>U+''hhhh''7F</code><br /><small>(excluding ranges above)</small> | <code>U+hhhh40</code> | middle<br />of 128 |- | <code>U+''hhhh''80</code> to <code>U+''hhhh''FF</code><br /><small>(excluding ranges above)</small> | <code>U+''hhhh''C0</code> | middle<br />of 128 |} The difference between the current code point and the normalized previous code point is encoded as follows: {| class="wikitable" ! style="width: auto;" | Difference range ! style="width: auto;" | Byte sequence range<br><small>(see below)</small> |- | <code>-10FF9F</code> to <code>-2DD0D</code> | <code>21</code> <code>F0</code> <code>58</code> <code>D9</code> to <code>21</code> <code>FF</code> <code>FF</code> <code>FF</code> |- | <code>-2DD0C</code> to <code>-2912</code> | <code>22</code> <code>01</code> <code>01</code> to <code>24</code> <code>FF</code> <code>FF</code> |- | <code>-2911</code> to <code>-41</code> | <code>25</code> <code>01</code> to <code>4F</code> <code>FF</code> |- | <code>-40</code> to <code>3F</code> | <code>50</code> to <code>CF</code> |- | <code>40</code> to <code>2910</code> | <code>D0</code> <code>01</code> to <code>FA</code> <code>FF</code> |- | <code>2911</code> to <code>2DD0B</code> | <code>FB</code> <code>01</code> <code>01</code> to <code>FD</code> <code>FF</code> <code>FF</code> |- | <code>2DD0C</code> to <code>10FFBF</code> | <code>FE</code> <code>01</code> <code>01</code> <code>01</code> to <code>FE</code> <code>19</code> <code>B4</code> <code>54</code> |} Each byte range is [[lexicographical order|lexicographically ordered]] with the following thirteen byte values excluded: <code>00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20</code>. For example, the byte sequence <code>FC 06 FF</code>, coding for a difference of <code>1156B</code>, is immediately followed by the byte sequence <code>FC 10 01</code>, coding for a difference of <code>1156C</code>. Any ASCII input <code>U+0000</code> to <code>U+007F</code> excluding space <code>U+0020</code> resets the encoder to <code>U+0040</code>. Because the above-mentioned values cover line end code points <code>U+000D</code> and <code>U+000A</code> ''as is'' (<code>0D 0A</code>), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in [[UTF-8]] affects at most one code point, and for [[Standard Compression Scheme for Unicode|SCSU]] it can affect the entire document. BOCU-1 offers a similar robustness also for input texts without the above-mentioned values with the special reset code <code>0xFF</code>. When a decoder finds this octet it resets its state to <code>U+0040</code> as for a line end. The use of <code>0xFF</code> reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the ''binary order''. The optional use of a signature [[Byte-order mark|<code>U+FEFF</code>]] at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequence <code>FB EE 28</code>, changes the initial state <code>U+0040</code> to <code>U+FEC0</code>. In other words, the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (<code>FB EE 28 FF</code>) could avoid this effect, but the BOCU-1 specification does not recommend this practice. In theory [[UTF-1]] and [[UTF-8]] could encode the original [[Universal Character Set|UCS-4]] set with 31 bits up to <code>7FFFFFFF</code>. BOCU-1 and [[UTF-16]] can encode the modern [[Unicode]] set from <code>U+0000</code> to <code>U+10FFFF</code>. Excluding the thirteen ''protected'' code points encoded as single octets BOCU-1 can use <math>256 - 13 = 243</math> octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "[[modulo operation|modulo]] 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference. Note that the reset byte <code>0xFF</code> is not ''protected'' and can occur as trail byte. == Patent == Prior to 16 November 2022, the general BOCU algorithm was covered by [[United States patent law|United States Patent]] #6,737,994, which also mentions the specific BOCU-1 implementation.<ref>{{cite web |url=https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/6737994 |title=United States Patent #6,737,994, "Binary-ordered compression for unicode" |date=2004-05-18 |author=Davis |author-link=Mark Davis (Unicode) |access-date=2022-12-28|display-authors=etal}}</ref> This patent has now expired. [[IBM]], which employed both of the inventors of BOCU-1 at the time it was created, stated in the Unicode Technical Note that implementers of a "fully compliant version of BOCU-1" had to contact IBM to request a royalty-free license.<ref>{{cite web |url=https://www.unicode.org/notes/tn6/#Intellectual_Property |title=UTN #6: BOCU-1|date=2006-02-04 |author=Markus Scherer, [[Mark Davis (Unicode)|Mark Davis]] |access-date=2014-02-05}}</ref> BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to have been encumbered with [[intellectual property]] restrictions. By contrast, IBM also filed for a patent on [[UTF-EBCDIC]], but it chose in that case to make the documentation and [[Character encoding#Modern encoding model|encoding scheme]] "freely available to anyone concerned towards making the transformation format as part of the UCS standards", instead of requiring implementers to request a license.<ref>{{cite web |url=https://www.unicode.org/reports/tr16/#Bibliography |title=UTR #16: UTF-EBCDIC|date=2002-04-16 |author=V.S. Umamaheswaran |access-date=2008-11-16}}</ref> == References == {{Reflist}} == See also == * [[UTF-1]] contains a comparison of the UTF-1, [[UTF-8]], and BOCU-1 designs * [[International Components for Unicode]] A library that can convert between BOCU-1 and other Unicode encodings {{Unicode navigation}} {{character encoding}} {{DEFAULTSORT:Binary Ordered Compression For Unicode}} [[Category:Data compression]] [[Category:Unicode Transformation Formats]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Character encoding
(
edit
)
Template:Cite web
(
edit
)
Template:Redirect
(
edit
)
Template:Reflist
(
edit
)
Template:Short description
(
edit
)
Template:Unicode navigation
(
edit
)