I’m running a socket server that should be able to accept genuine UTF-8 characters from clients.
The issue is that some customers (mostly hackers) are sending all kinds of incorrect data via it.
I can easily tell who is a genuine client, however I am logging all data transmitted to files so that I may study it afterwards.
Characters like this may trigger the UnicodeDecodeError issue.
I need to be able to encode the string in UTF-8 format with or without certain characters.
For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:
EHLO example.com MAIL FROM: <email@example.com> ...
All of this was logged in JSON.
Then other people with bad motives decided to mail all kinds of garbage.
That is why, in my case, stripping the non-ASCII characters is entirely acceptable.
Asked by transilvlad
str = unicode(str, errors='replace')
str = unicode(str, errors='ignore')
Note that this will remove (ignore) the characters in question, leaving the string empty.
For me this is ideal case since I’m using it as protection against non-ASCII input which is not allowed by my application.
Alternatively, to read in the file, use the open method from the codecs module:
import codecs with codecs.open(file_name, 'r', encoding='utf-8', errors='ignore') as fdata:
Answered by transilvlad
I was able to solve the problem by switching the engine from C to Python.
Engine is C:
pd.read_csv(gdp_path, sep='\t', engine='c')
Engine is Python:
pd.read_csv(gdp_path, sep='\t', engine='python')
For me, there are no mistakes.
Answered by Doğuş
This is a problem I’m having now that I’ve switched to Python 3. I had no notion Python 2 was just steamrolling any file encoding difficulties.
After none of the above worked, I came across this helpful description of the differences and how to find a solution.
In short, to make Python 3 behave as similarly as possible to Python 2 use:
with open(filename, encoding="latin-1") as datafile: # work on datafile here
However, as you’ll see from the rest of the essay, there is no one-size-fits-all approach.
Answered by James McCormac
>>> '\x9c'.decode('cp1252') u'\u0153' >>> print '\x9c'.decode('cp1252') œ
Answered by Ignacio Vazquez-Abrams
The first is to use get encoding type to determine the type of encoded file:
import os from chardet import detect # get file encoding type def get_encoding_type(file): with open(file, 'rb') as f: rawdata = f.read() return detect(rawdata)['encoding']
the second is to open the files with the following command:
open(current_file, 'r', encoding = get_encoding_type, errors='ignore')
Answered by Ivan Lee
Post is based on https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c