由__future__中unicode_literals引起的错误来研究python中的编码问题

在py2.7的项目中用了future模块中的 unicode_literals 来为兼容py3.x做准备，今天遇到一个UnicodeEncodeError的错误，跟了下，发现这个小坑值得注意。是怎么样的一个坑呢？跟着代码看看。顺便深究一下原理。

1. 未引入unicode_literals版本

.. code:: python

#coding:utf-8
from datetime import datetime

now = datetime.now()
print now.strftime('%m月%d日 %H:%M')

这段代码可以正常执行输出: 03月12日 21:53

2. 引入unicode_literals

.. code:: python

#coding:utf-8
from __future__ import unicode_literals
from datetime import datetime

now = datetime.now()
print now.strftime('%m月%d日 %H:%M')

抛出如下错误::

Traceback (most recent call last):
File "unicode_error_demo2.py", line 7, in <module>
      print now.strftime('%m月%d日 %H:%M')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u6708' in position 2: ordinal not in range(128)

3. 解决方案一：设置运行时编码为utf-8

.. code:: python

#coding:utf-8
from __future__ import unicode_literals
import sys
from datetime import datetime

reload(sys)
sys.setdefaultencoding('utf-8')

now = datetime.now()
print now.strftime('%m月%d日 %H:%M')

正常执行

4. 解决方案二: 使用byte string

.. code:: python

#coding:utf-8
from __future__ import unicode_literals
from datetime import datetime

now = datetime.now()
print now.strftime(b'%m月%d日 %H:%M')  # 指明为bytearray字符串

# 或者这样也行
t = bytearray('%m月 %d %H:%M', 'utf-8')
print now.strftime(str(t))

5. 总结

这里主要涉及到python中的编码问题，也是很多人在刚接触Python时感到头疼的问题。更多基础的东西，可以到下面的参考链接里看，这里就分析下我的这几段代码。

先来看 第一段代码 ，第一段能成功执行是正常的，因为datetime的strftime函数，接受的参数就是string（注意：string表示字节，unicode表示字符串，见参考1），因此是正常的，strftime接收到string，然后格式化最后返回。

第二段例子 我们引入了来自future的unicode_literals，这个模块的作用就是把你当前模块所有的字符串（string literals）转为unicode。基于这个认识来看代码，虽然我们给 now.strftime 传递的还是一样的参数，但本质已经不同——一个是string（字节）一个是unicode（字符）。而 strftime 能够接收的参数应该是string类型的，那咱们传了一个unicode进去，它必然要转换一下，这一转换就出错了——UnicodeEncodeError。

这个地方应该详细说下，咱们给定了一个unicode字符"月"，要被转为string，怎么转呢？这时就得想到ASCII了，这是Python2.7运行时默认的编码环境。所谓"编码"就是用来编码的嘛，于是python就通过ASCII来把unicode转为string，遂，抛错了。

错误的原因在Traceback中详细指明了——咱们传进去的u'\u6708' （也就是"月"字）ascii解释不了。这个符号不在ascii的128个字符表当中,因此就抛错了。关于字符编码方面的内容可以查看参考5。

再来说 第三段代码 ，我们重载了系统的编码环境为utf-8，于是上面的那个问题消失了，简单来说就是utf-8可以表示更多的字符。

最后来看 第四段代码 ，我们通过把字符串定义为byte类型同样解决了那个错误。原理也很简单，就是先把unicode转换为bytes，然后再转为string。这段代码里提供了两种方法，一个是在字符串前加 b 来声明一个bytes（而不是unicode）；第二个是对生成的unicode对象通过utf-8进行编码为bytearray，然后转为string。这个问题可以查看参考4和参考6。

上面都是the5fire自己根据资料总结出来的结论，如果有问题欢迎指出。

PS: 同样的问题对于python built-in的getattr方法也适用。

参考资料:

黄聪：解决python中文处理乱码，先要弄懂“字符”和“字节”的差别
http://docs.python.org/2/library/datetime.html#datetime.date.strftime
http://docs.python.org/2.7/library/functions.html#getattr
http://docs.python.org/2/whatsnew/2.6.html?highlight=bytestring#pep-3112-byte-literals
http://www.cnblogs.com/huxi/articles/1897271.html
http://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal

由future中unicode_literals引起的错误来研究python中的编码问题

1. 未引入unicode_literals版本

2. 引入unicode_literals

3. 解决方案一：设置运行时编码为utf-8

4. 解决方案二: 使用byte string

5. 总结

参考资料:

相关文章