I parse html with python and there is date string: [ 24-Янв-17 07:24 ]. "Янв" is "Jan". I want to convert it into datetime object.
# Some beautifulsoup parsing
timeData = data.find('div', {'id' : 'time'}).text
import locale
locale.setlocale(locale.LC_TIME, 'ru_RU.UTF-8')
result = datetime.datetime.strptime(timeData, u'[ %d-%b-%y  %H:%M ]')
The error is:
ValueError: time data '[ 24-\xd0\xaf\xd0\xbd\xd0\xb2-17 07:24 ]' does not match format '[ %d-%b-%y %H:%M ]'
type(timeData) returns unicode. Encoding timeData from utf-8 returns UnicodeEncodeError. What's wrong?
chardet returns {'confidence': 0.87625, 'encoding': 'utf-8'} and when I write: datetime.datetime.strptime(timeData.encode('utf-8'), ...) it returns error as above.
Original page has window-1251 encoding.
print type(timeData)
print timeData
timeData = timeData.encode('cp1251')
print type(timeData)
print timeData
returns
<type 'unicode'>
[ 24-Янв-17 07:24 ]
<type 'str'>
[ 24-???-17 07:24 ]
We can convert string format to DateTime by using the strptime() function. We will use the '%Y/%m/%d' format to get the string to datetime.
strptime() -> string parsed time.
Python time strptime() MethodThe format parameter uses the same directives as those used by strftime(); it defaults to "%a %b %d %H:%M:%S %Y" which matches the formatting returned by ctime(). If string cannot be parsed according to format, or if it has excess data after parsing, ValueError is raised.
To convert string to datetime in Python, use the strptime() method. The strptime() is a built-in function of the Python datetime class used to convert a string representation of the date/time to a date object.
Got it!  янв has to be lower-case in CPython 2.7.12.  Code (works in CPy 2.7.12 and CPy 3.4.5 on cygwin):
# coding=utf8
#timeData='[ 24-Янв-17 07:24 ]'
timeData='[ 24-янв-17 07:24 ]'    ### lower-case
import datetime
import locale
locale.setlocale(locale.LC_TIME, 'ru_RU.UTF-8')
result = datetime.datetime.strptime(timeData, u'[ %d-%b-%y  %H:%M ]')
print(result)
result:
2017-01-24 07:24:00
If I use the upper-case Янв, it works in Py 3, but in Py 2 it gives
ValueError: time data '[ 24-\xd0\xaf\xd0\xbd\xd0\xb2-17 07:24 ]' does not match format '[ %d-%b-%y  %H:%M ]'
To handle this in general in Python 2, lower-case first (see this answer):
# coding=utf8
timeData=u'[ 24-Янв-17 07:24 ]'
       # ^ unicode data
import datetime
import locale
locale.setlocale(locale.LC_TIME, 'ru_RU.UTF-8')
print(timeData.lower())     # works OK
result = datetime.datetime.strptime(
    timeData.lower().encode('utf8'), u'[ %d-%b-%y  %H:%M ]')
    ##               ^^^^^^^^^^^^^^ back to a string
    ##       ^^^^^^^ lowercase
print(result)
Result:
[ 24-янв-17 07:24 ]
2017-01-24 07:24:00
I can't test it with your beautifulsoup code, but, in general, get Unicode data and then use the above.
Or, if at all possible, switch to Python 3 :) .
So how did I figure this out?  I went looking in the CPython source for the code to strptime (search).  I found the handy _strptime module, containing class LocaleTime.  I also found a mention of LocaleTime.  To print the available month names, do this (added on to the end of the code under "Quick fix," above):
from _strptime import LocaleTime
lt = LocaleTime()
print(lt.a_month)    
a_month has the abbreviated month names per the source.
On Py3, that yields:
['', 'янв', 'фев', 'мар', 'апр', 'май', 'июн', 'июл', 'авг', 'сен', 'окт', 'ноя', 'дек']
      ^ lowercase!
On Py2, that yields:
['', '\xd1\x8f\xd0\xbd\xd0\xb2',
and a bunch more.  Note that the first character is \xd1\x8f, and in your error message, \xd0\xaf doesn't match.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With