Python Japanese code encoding:UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 0: invalid start byte

I use the following code(python3.6) to insert csv data into the database：

# -*- coding: UTF-8 -*- import pandas as pd import pymssql def insert_report_pn_dictionary(server, user, password, database): dict_list = tw_df = pd.read_csv(r'D:/dict.csv', encoding='utf-8') word_list = list(tw_df['WORD']) pn_list = list(tw_df['PN']) pn_dict = dict(zip(word_list, pn_list)) for key, value in pn_dict.items(): dict_list.append((key, value)) try: conn = pymssql.connect(server, user, password, database) cur = conn.cursor() sql = ' insert into report_pn_dictionary (dict_keyword,dict_pn) ' ' values(%s, %s) ' cur.executemany(sql, dict_list) conn.commit() except pymssql.Error as ex: raise ex except Exception as ex: raise ex finally: conn.close() if __name__=="__main__": server = '10.10.10.10' user = 'test' password = 'test' database = 'DBTest' insert_report_pn_dictionary(server, user, password, database)

But there is a error. The error message is：

E:Anaconda3python.exe E:/test_opencv/TRkeywordpn/set_dictionary.py Traceback (most recent call last): File "E:/test_opencv/TRkeywordpn/set_dictionary.py", line 33, in <module> insert_report_pn_dictionary(server, user, password, database) File "E:/test_opencv/TRkeywordpn/set_dictionary.py", line 7, in insert_report_pn_dictionary tw_df = pd.read_csv(r'D:/dict.csv', encoding='utf-8') File "E:Anaconda3libsite-packagespandasioparsers.py", line 646, in parser_f return _read(filepath_or_buffer, kwds) File "E:Anaconda3libsite-packagespandasioparsers.py", line 401, in _read data = parser.read() File "E:Anaconda3libsite-packagespandasioparsers.py", line 939, in read ret = self._engine.read(nrows) File "E:Anaconda3libsite-packagespandasioparsers.py", line 1508, in read data = self._reader.read(nrows) File "pandasparser.pyx", line 848, in pandas.parser.TextReader.read (pandasparser.c:10415) File "pandasparser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandasparser.c:10691) File "pandasparser.pyx", line 947, in pandas.parser.TextReader._read_rows (pandasparser.c:11728) File "pandasparser.pyx", line 1049, in pandas.parser.TextReader._convert_column_data (pandasparser.c:13162) File "pandasparser.pyx", line 1108, in pandas.parser.TextReader._convert_tokens (pandasparser.c:14116) File "pandasparser.pyx", line 1206, in pandas.parser.TextReader._convert_with_dtype (pandasparser.c:16172) File "pandasparser.pyx", line 1222, in pandas.parser.TextReader._string_convert (pandasparser.c:16400) File "pandasparser.pyx", line 1458, in pandas.parser._string_box_utf8 (pandasparser.c:22072) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 0: invalid start byte

My CSV content likes this:

WORD PN

優れる 1

良い 0.999995

喜ぶ 0.999979

褒める 0.999979

めでたい 0.999645

賢い 0.999486

善い 0.999314

I have more than 50 thousand lines data.

How to modify my code?I think this is the problem of Japanese coding.

your csv is not in utf-8. try to guess which encoding it uses. SHIFT_JIS may be the one
– bobrobbob
Jun 29 at 9:03

SHIFT_JIS

@bobrobbob I modified it as you said,it worked,thanks!
– MadFrog
12 hours ago

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk