TOC

BOM 头的研究

BOM 是 Byte Order Mark 的缩写,代表一个 Unicode 字符 FEFF

Windows 系统下的很多软件就用 BOM 字符作为 Magic Number, 用来确认文件的字符编码和字节顺序。
这个设计可谓巧妙,但是给开发者处理文本文件带来了非常多的不便。

Encoding Hexadecimal Decimal CP1252 (latin1)
UTF-8 EF BB BF 239 187 191 
UTF-16 (BE) FE FF 254 255 þÿ
UTF-16 (LE) FF FE 255 254 ÿþ
UTF-32 (BE) 00 00 FE FF 0 0 254 255 ^@^@þÿ
UTF-32 (LE) FF FE 00 00 255 254 0 0 ÿþ^@^@
UTF-7 2B 2F 76 43 47 118 +/v
UTF-1 F7 64 4C 247 100 76 ÷dL
UTF-EBCDIC DD 73 66 73 221 115 102 115 Ýsfs
SCSU 0E FE FF 14 254 255 ^Nþÿ
BOCU-1 FB EE 28 251 238 40 ûî(
GB-18030 84 31 95 33 132 49 149 51 „1•3

PS: ^@ is the null character
PS: ^N is the "shift out" character

a = '\ufeff'
encodings = 'utf-8', 'utf-16-le', 'utf-16-be', 'utf-32-le', 'utf-32-be', 'utf-7', 'gb18030'
print('| %-15s | %-22s | %-16s | %-15s |' % ('Encoding', 'Hexadecimal', 'Decimal', 'Latin-1'))
print('| ' + (' | '.join(['-' * 15, '-' * 22, '-' * 16, '-' * 15])) + ' |')
for encoding in encodings:
    print('| %-15s | %-22s | %-16s | %-15s |' % (
        '**%s**' % encoding,
        '`%s`' % a.encode(encoding),
        '`%s`' % (' '.join(['%02x' % i for i in a.encode(encoding)])),
        '`%r`' % a.encode(encoding).decode('cp1252'),
    ))
Encoding Hexadecimal Decimal Latin-1
utf-8 b'\xef\xbb\xbf' ef bb bf ''
utf-16-le b'\xff\xfe' ff fe 'ÿþ'
utf-16-be b'\xfe\xff' fe ff 'þÿ'
utf-32-le b'\xff\xfe\x00\x00' ff fe 00 00 'ÿþ\x00\x00'
utf-32-be b'\x00\x00\xfe\xff' 00 00 fe ff '\x00\x00þÿ'
utf-7 b'+/v8-' 2b 2f 76 38 2d '+/v8-'
gb18030 b'\x841\x953' 84 31 95 33 '„1•3'