Problem
I’m running a socket server that should be able to accept genuine UTF-8 characters from clients.
The issue is that some customers (mostly hackers) are sending all kinds of incorrect data via it.
I can easily tell who is a genuine client, however I am logging all data transmitted to files so that I may study it afterwards.
Characters like this may trigger the UnicodeDecodeError issue.
I need to be able to encode the string in UTF-8 format with or without certain characters.
Update:
For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:
EHLO example.com
MAIL FROM: <john.doe@example.com>
...
All of this was logged in JSON.
Then other people with bad motives decided to mail all kinds of garbage.
That is why, in my case, stripping the non-ASCII characters is entirely acceptable.
Asked by transilvlad
Solution #1
http://docs.python.org/howto/unicode.html#the-unicode-type
str = unicode(str, errors='replace')
or
str = unicode(str, errors='ignore')
Note that this will remove (ignore) the characters in question, leaving the string empty.
For me this is ideal case since I’m using it as protection against non-ASCII input which is not allowed by my application.
Alternatively, to read in the file, use the open method from the codecs module:
import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
errors='ignore') as fdata:
Answered by transilvlad
Solution #2
I was able to solve the problem by switching the engine from C to Python.
Engine is C:
pd.read_csv(gdp_path, sep='\t', engine='c')
Engine is Python:
pd.read_csv(gdp_path, sep='\t', engine='python')
For me, there are no mistakes.
Answered by Doğuş
Solution #3
This is a problem I’m having now that I’ve switched to Python 3. I had no notion Python 2 was just steamrolling any file encoding difficulties.
After none of the above worked, I came across this helpful description of the differences and how to find a solution.
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
In short, to make Python 3 behave as similarly as possible to Python 2 use:
with open(filename, encoding="latin-1") as datafile:
# work on datafile here
However, as you’ll see from the rest of the essay, there is no one-size-fits-all approach.
Answered by James McCormac
Solution #4
>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
œ
Answered by Ignacio Vazquez-Abrams
Solution #5
The first is to use get encoding type to determine the type of encoded file:
import os
from chardet import detect
# get file encoding type
def get_encoding_type(file):
with open(file, 'rb') as f:
rawdata = f.read()
return detect(rawdata)['encoding']
the second is to open the files with the following command:
open(current_file, 'r', encoding = get_encoding_type, errors='ignore')
Answered by Ivan Lee
Post is based on https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c