Python 字符编码
2015-08-25
突然来了兴致,想看看 Unicode 中有多少个中文,查了一下,很多人都是说 4e00 至 9fff 段。
# -*- coding: utf-8 -*-
all_chinese_in_unicode = range(0x4e00, 0x9fff)
def transfer(u_char_num):
if isinstance(u_char_num, int):
u_char_hex = '%x' % u_char_num
u_char_str = '\u' + u_char_hex
else:
if isinstance(u_char_num, str) and len(u_char_num) == 4:
u_char_str = '\u' + u_char_num
else:
raise Exception
u_char = u_char_str.decode('raw_unicode_escape')
# print u_char_hex, u_char
return u_char
def test_transfer():
# repr(u"国") -> u'\u56fd' -> 22269
print transfer(0x56fd)
# 打印所有中文字符
# for i in all_chinese_in_unicode:
# print transfer(i),
# print
# 打印最后一个中文字符及前、后各一个字符
last_chinese_char = 0x9fbb
last_chinese_char_index = all_chinese_in_unicode.index(last_chinese_char)
start, end = last_chinese_char_index - 1, last_chinese_char_index + 2
for i in all_chinese_in_unicode[start:end]:
print transfer(i),
print
Ubuntu 下的 zsh 中运行,只能显示到这个字符:龻,后面的都是乱码,这个字符对应的十六进制数是 9fbb
。
结果又意外发现,最后一个字符似乎不是 9fbb,而是 9fcc(改 URL 一个一个试出来的)。
来源:https://www.fileformat.info/info/unicode/char/9fcc/index.htm
对应文字图片:https://www.fileformat.info/info/unicode/char/9fcc/sample.png
当然,这个来源并不保证权威,也可能有错误,涉及中文字符的范围还是使用 4e00 - 9fff 比较保险。
比如正则表达式:[\u4e00-\u9fff]
。
标点符号
看来是需要抽半天空闲时间,仔细研究研究编码问题了。
参考
个人
2015-08-24
主要是说明一下为什么博客为什么突然中断了,以及现在打算怎么弄。
软件设计 设计模式
2015-08-10
最经典的是 SOLID 五个,但也有说六个、七个的。
SSH 网络代理
2015-07-15
网络代理是一种常见的网络技术,其中 SSH 由于其普遍性,用来代理是最顺手不过的。
助记:
DLR
独立日, 分别对应三种最常见的代理模式 Dynamic, Local, Remote
ProxyPort:DestHost:DestPort User@ProxyHost
比如有三台机器 A, B, C, 其中 A 是本地机器,B 是代理机器,C 是目标机器(注意:A,B,C 可以是同一台机器,或者其中任意两个是同一台机器):
- 在本地机器上建立隧道,代理机器做跳板,将本地端口绑定到目的机器端口:
// A:80 => B => C:80
ssh -L 80:C:80 root@B
// B:80 => C:80
ssh -R 80:C:80 root@B
上面两种都是端口映射模式,应该能够满足所有端口对端口的代理。
最后的 -D
就是动态代理,类似 NAT,将发往指定端口的数据转发到出去。
ssh -D 80 root@B
这个时候,如果设置系统网络代理为 socks5://B:80
,然后本地的所有网络请求都会经过 B 中转,比如我访问 C:80,C 服务器接受到的是来自 B 的访问。
PS: 这个 socks5 要是展开带另起一篇,这里就只需要知道,这种代理的方式有一个专门的名字叫 socks 就行了。
如果在 Windows 上,PuTTY 就是一个很好的 SSH 客户端,也可以从来建立代理,这就不展开讲了。
下次我研究研究 PuTTY,另外再发一篇文章。
Update @ 2021-10-01:
- 有一些 Linux 命令直接支持 socks5 代理,比如 wget, curl。
- 编程时,发起网络请求也可以设置 socks5 代理。
Update @ 2016-09-16:
Update @ 2017-06-09:
DB MySQL 字符编码
2015-05-11
各种类型的编码
mysql> SHOW VARIABLES LIKE 'character%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | gbk |
| character_set_connection | gbk |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | gbk |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
5.7 及更早版本默认字符集和 Collation 是 latin1 和 latin1_swedish_ci
8.0 之后改成 utf8mb4 和 utf8mb4_0900_ai_ci
上面就有 6 种 全局的编码:
- client 客户端编码
- connection 连接编码
- database 数据库编码,创建表时的默认编码
load data, 以及创建
- filesystem 文件系统编码
- results 结果编码
- server 服务端编码,创建数据库时的默认编码
- system 系统编码
还有两种局部的编码:表的编码和列(字段)的编码。
列主要是 char, varchar, text 类型。
字符集
# 都可以 LIKE 搜索
SHOW CHARACTER SET;
SHOW CHARSET;
SHOW CHAR SET;
Charset |
Description |
Default collation |
Maxlen |
armscii8 |
ARMSCII-8 Armenian |
armscii8_general_ci |
1 |
ascii |
US ASCII |
ascii_general_ci |
1 |
big5 |
Big5 Traditional Chinese |
big5_chinese_ci |
2 |
binary |
Binary pseudo charset |
binary |
1 |
cp1250 |
Windows Central European |
cp1250_general_ci |
1 |
cp1251 |
Windows Cyrillic |
cp1251_general_ci |
1 |
cp1256 |
Windows Arabic |
cp1256_general_ci |
1 |
cp1257 |
Windows Baltic |
cp1257_general_ci |
1 |
cp850 |
DOS West European |
cp850_general_ci |
1 |
cp852 |
DOS Central European |
cp852_general_ci |
1 |
cp866 |
DOS Russian |
cp866_general_ci |
1 |
cp932 |
SJIS for Windows Japanese |
cp932_japanese_ci |
2 |
dec8 |
DEC West European |
dec8_swedish_ci |
1 |
eucjpms |
UJIS for Windows Japanese |
eucjpms_japanese_ci |
3 |
euckr |
EUC-KR Korean |
euckr_korean_ci |
2 |
gb18030 |
China National Standard GB18030 |
gb18030_chinese_ci |
4 |
gb2312 |
GB2312 Simplified Chinese |
gb2312_chinese_ci |
2 |
gbk |
GBK Simplified Chinese |
gbk_chinese_ci |
2 |
geostd8 |
GEOSTD8 Georgian |
geostd8_general_ci |
1 |
greek |
ISO 8859-7 Greek |
greek_general_ci |
1 |
hebrew |
ISO 8859-8 Hebrew |
hebrew_general_ci |
1 |
hp8 |
HP West European |
hp8_english_ci |
1 |
keybcs2 |
DOS Kamenicky Czech-Slovak |
keybcs2_general_ci |
1 |
koi8r |
KOI8-R Relcom Russian |
koi8r_general_ci |
1 |
koi8u |
KOI8-U Ukrainian |
koi8u_general_ci |
1 |
latin1 |
cp1252 West European |
latin1_swedish_ci |
1 |
latin2 |
ISO 8859-2 Central European |
latin2_general_ci |
1 |
latin5 |
ISO 8859-9 Turkish |
latin5_turkish_ci |
1 |
latin7 |
ISO 8859-13 Baltic |
latin7_general_ci |
1 |
macce |
Mac Central European |
macce_general_ci |
1 |
macroman |
Mac West European |
macroman_general_ci |
1 |
sjis |
Shift-JIS Japanese |
sjis_japanese_ci |
2 |
swe7 |
7bit Swedish |
swe7_swedish_ci |
1 |
tis620 |
TIS620 Thai |
tis620_thai_ci |
1 |
ucs2 |
UCS-2 Unicode |
ucs2_general_ci |
2 |
ujis |
EUC-JP Japanese |
ujis_japanese_ci |
3 |
utf16 |
UTF-16 Unicode |
utf16_general_ci |
4 |
utf16le |
UTF-16LE Unicode |
utf16le_general_ci |
4 |
utf32 |
UTF-32 Unicode |
utf32_general_ci |
4 |
utf8mb3 |
UTF-8 Unicode |
utf8mb3_general_ci |
3 |
utf8mb4 |
UTF-8 Unicode |
utf8mb4_0900_ai_ci |
4 |
SELECT * FROM information_schema.character_sets;
SELECT * FROM information_schema.character_sets WHERE CHARACTER_SET_NAME LIKE "%utf%";
CHARACTER_SET_NAME |
DEFAULT_COLLATE_NAME |
DESCRIPTION |
MAXLEN |
utf8mb3 |
utf8mb3_general_ci |
UTF-8 Unicode |
3 |
utf16 |
utf16_general_ci |
UTF-16 Unicode |
4 |
utf16le |
utf16le_general_ci |
UTF-16LE Unicode |
4 |
utf32 |
utf32_general_ci |
UTF-32 Unicode |
4 |
utf8mb4 |
utf8mb4_0900_ai_ci |
UTF-8 Unicode |
4 |
排序规则
SHOW COLLATION;
这就多了,两百多。
SHOW COLLATION LIKE "%ascii%";
Collation |
Charset |
Id |
Default |
Compiled |
Sortlen |
Pad_attribute |
ascii_bin |
ascii |
65 |
|
Yes |
1 |
PAD SPACE |
ascii_general_ci |
ascii |
11 |
Yes |
Yes |
1 |
PAD SPACE |
SELECT * FROM information_schema.collations WHERE CHARACTER_SET_NAME = "utf8mb4" AND COLLATION_NAME LIKE "%zh%";
COLLATION_NAME |
CHARACTER_SET_NAME |
ID |
IS_DEFAULT |
IS_COMPILED |
SORTLEN |
PAD_ATTRIBUTE |
utf8mb4_zh_0900_as_cs |
utf8mb4 |
308 |
|
Yes |
0 |
NO PAD |
命名规则
字符集名称,语言,通用后缀
ai
Accent-insensitive 重音不敏感
as
Accent-sensitive 重音敏感
ci
Case-insensitive 大小写不敏感
cs
Case-sensitive 大小写敏感
ks
Kana-sensitive 假名敏感(日语)
bin
二进制
8.0 之后,很多编码中多了 0900 字样,表示 Unicode 9.0 规范。
参考资料与拓展阅读
Python 设计模式
2015-04-06
使用全局变量
我个人比较喜欢的方式,简单高效。
class Application:
pass
APP = Application()
装饰器(使用特定方法来获取对象实例)
import threading
class singleton:
_instance_lock = threading.Lock()
def __init__(self, decorated):
self._decorated = decorated
def instance(self):
if not hasattr(self, '_instance'):
with singleton._instance_lock:
if not hasattr(self, '_instance'):
self._instance = self._decorated()
return self._instance
def __call__(self):
raise TypeError('Singletons must be accessed through `instance()`.')
def __instancecheck__(self, inst):
return isinstance(inst, self._decorated)
def __getattr__(self, name):
if name == '_instance':
raise AttributeError("'singleton' object has no attribute '_instance'")
return getattr(self.instance(), name)
@singleton
class Application:
pass
Application.instance()
装饰器
def singleton(cls):
_instances = {}
def _inner(*args, **kargs):
if cls not in _instances:
_instances[cls] = cls(*args, **kargs)
return _instances[cls]
return _inner
元类
class Application:
_instance_lock = threading.Lock()
def __new__(cls, *args, **kwargs):
if not hasattr(cls, "_instance"):
with cls._instance_lock:
if not hasattr(cls, "_instance"):
cls._instance = object.__new__(cls)
return cls._instance
改成通用方式:
import threading
class SingletonMeta(type):
_instances = {}
_instance_lock = threading.Lock()
def __call__(cls, *args, **kwargs):
if cls not in cls._instances:
with cls._instance_lock:
if cls not in cls._instances:
instance = super().__call__(*args, **kwargs)
cls._instances[cls] = instance
return cls._instances[cls]
class Application(metaclass=SingletonMeta):
pass
Django 密码学 加密 PBKDF2
2015-04-06
实现
在 Python 2.7.8 (July 2, 2014),或 Python 3.4 (March 17, 2014) 以上版本,标准库 hashlib 有 pbkdf2_hmac
:
import base64
import hashlib
import random
import secrets
def pbkdf2(password, salt, iterations, dklen=0, digest=None):
if digest is None:
digest = hashlib.sha256
if not dklen:
dklen = None
password = password.encode('utf-8')
salt = salt.encode('utf-8')
return hashlib.pbkdf2_hmac(
digest().name, password, salt, iterations, dklen)
### pbkdf2_sha256 ###
ALGORITHM = 'pbkdf2_sha256'
ITERATIONS = 30000
DIGEST = hashlib.sha256
def encode(password, salt, iterations=None):
assert password is not None
assert salt and '$' not in salt
if not iterations:
iterations = ITERATIONS
hash = pbkdf2(password, salt, iterations, digest=DIGEST)
hash = base64.b64encode(hash).decode('ascii').strip()
return "%s$%d$%s$%s" % (ALGORITHM, iterations, salt, hash)
def decode(encoded):
algorithm, iterations, salt, hash = encoded.split('$', 3)
assert algorithm == ALGORITHM
return {
'algorithm': algorithm,
'hash': hash,
'iterations': int(iterations),
'salt': salt,
}
def verify(password, encoded):
algorithm, iterations, salt, hash = encoded.split('$', 3)
assert algorithm == ALGORITHM
encoded_2 = encode(password, salt, int(iterations))
return secrets.compare_digest(encoded.encode('utf-8'),
encoded_2.encode('utf-8'))
salt_length = 32
# salt = bytes([random.randint(0, 255) for i in range(salt_length + 20)])
salt = random.randbytes(salt_length + 20)
salt = salt.replace(b'$', b'').decode('latin')[:salt_length]
# print(repr(salt))
a = encode('123456', salt)
print(repr(a))
print(decode(a))
# 'pbkdf2_sha256$30000$x\x82¨\x07Ó[Ám¬9V¬\x13·sÉ\x1eEܱ3\x04Ü\x07\x05BÁfM\x9fV×$/jqvasjfZX6c9xCytBVqddAJFve2DXcLWClTAsZvl48='
print(verify('234567', a))
print(verify('123456', a))
参考资料与拓展阅读
DB SQL MySQL
2015-03-22

Django 信息安全 密码学
2015-02-05
https://docs.djangoproject.com/en/2.0/topics/auth/passwords/
在 Django 中,密码哈希存储在 auth_user 表中的 password 字段中。该字段的值包含了算法名称、迭代次数、盐值和哈希值,它们都是使用特定的格式进行编码的。例如,一个密码哈希的值可能如下所示:
pbkdf2_sha256$150000$V7v0fjbhMIhq$Gd/0XuOqK3ib6NBNGVwIKGXcFTUiNbTzdTNN8RiW24E=
用美元符号切割,得到 4 部分:
pbkdf2_sha256
算法名称
150000
迭代次数
V7v0fjbhMIhq
盐
Gd/0XuOqK3ib6NBNGVwIKGXcFTUiNbTzdTNN8RiW24E=
支持的哈希算法
PASSWORD_HASHERS = [
'django.contrib.auth.hashers.PBKDF2PasswordHasher',
'django.contrib.auth.hashers.PBKDF2SHA1PasswordHasher',
'django.contrib.auth.hashers.Argon2PasswordHasher',
'django.contrib.auth.hashers.BCryptSHA256PasswordHasher',
'django.contrib.auth.hashers.BCryptPasswordHasher',
]
默认使用的是 PBKDF2PasswordHasher
:
PBKDF2 是基于密钥的密码派生函数,它使用一个伪随机函数(PRF)和一个盐值来从给定密码派生出一个密钥。
PBKDF2 算法的核心是迭代的使用 PRF 函数,将每一次迭代的输出与前一次迭代的输出进行异或,生成最终的派生密钥。
大概逻辑如下:
import hashlib
import binascii
import os
class PBKDF2PasswordHasher:
algorithm = 'pbkdf2_sha256'
iterations = 100000
def salt(self):
return binascii.hexlify(os.urandom(16)).decode()
def encode(self, password, salt):
dk = hashlib.pbkdf2_hmac('sha256', password.encode(), salt.encode(), self.iterations)
return '%d$%s$%s' % (self.iterations, salt, binascii.hexlify(dk).decode())
def verify(self, password, encoded):
iterations, salt, dk = encoded.split('$')
iterations = int(iterations)
hashed_password = self.encode(password, salt)
return hashed_password == encoded
生成密码
from django.contrib.auth.hashers import make_password
password = '123456'
hashed_password = make_password(password)
验证密码
from django.contrib.auth.hashers import check_password
is_correct_password = check_password(password, hashed_password)
Web JavaScript
2015-01-30