Coder Perfect

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x9c

Problem

I’m running a socket server that should be able to accept genuine UTF-8 characters from clients.

The issue is that some customers (mostly hackers) are sending all kinds of incorrect data via it.

I can easily tell who is a genuine client, however I am logging all data transmitted to files so that I may study it afterwards.

Characters like this may trigger the UnicodeDecodeError issue.

I need to be able to encode the string in UTF-8 format with or without certain characters.

Update:

For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:

EHLO example.com
MAIL FROM: <john.doe@example.com>
...

All of this was logged in JSON.

Then other people with bad motives decided to mail all kinds of garbage.

That is why, in my case, stripping the non-ASCII characters is entirely acceptable.

Asked by transilvlad

Solution #1

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')

Note that this will remove (ignore) the characters in question, leaving the string empty.

For me this is ideal case since I’m using it as protection against non-ASCII input which is not allowed by my application.

Alternatively, to read in the file, use the open method from the codecs module:

import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
                 errors='ignore') as fdata:

Answered by transilvlad

Solution #2

I was able to solve the problem by switching the engine from C to Python.

Engine is C:

pd.read_csv(gdp_path, sep='\t', engine='c')

Engine is Python:

pd.read_csv(gdp_path, sep='\t', engine='python')

For me, there are no mistakes.

Answered by Doğuş

Solution #3

This is a problem I’m having now that I’ve switched to Python 3. I had no notion Python 2 was just steamrolling any file encoding difficulties.

After none of the above worked, I came across this helpful description of the differences and how to find a solution.

http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

In short, to make Python 3 behave as similarly as possible to Python 2 use:

with open(filename, encoding="latin-1") as datafile:
    # work on datafile here

However, as you’ll see from the rest of the essay, there is no one-size-fits-all approach.

Answered by James McCormac

Solution #4

>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
œ

Answered by Ignacio Vazquez-Abrams

Solution #5

The first is to use get encoding type to determine the type of encoded file:

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

the second is to open the files with the following command:

open(current_file, 'r', encoding = get_encoding_type, errors='ignore')

Answered by Ivan Lee

Post is based on https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c