Coder Perfect

how to tell whether a text file contains invalid utf8 unicode/binary


I need to find invalid (non-ASCII) utf-8, Unicode, or binary characters in a corrupted text file.


I’ve tried the following:

iconv -f utf-8 -t utf-8 -c file.csv 

This transforms a file from utf-8 to utf-8 encoding, with -c indicating that incorrect utf-8 characters should be skipped. Those prohibited characters were, however, printed in the end. Are there any additional bash on Linux or other languages solutions?

Asked by user121196

Solution #1

This works well to spot erroneous UTF-8 sequences if your locale is set to UTF-8 (see locale output):

grep -axv '.*' file.txt

The following is an explanation (taken from the grep man page):

As a result, output will be generated, which will include lines containing the invalid non utf8 byte sequence (since inverted -v)

Answered by Blaf

Solution #2

I’d use grep to look for non-ASCII characters.

With GNU grep and pcre (due to -P, not always available). On FreeBSD, you can use pcregrep from the pcre2 package to do the following:

grep -P "[\x80-\xFF]" file

How Do I Grep? is a reference in How Do I Grep? In UNIX, for all non-ASCII characters. So, if you just want to see if the file contains non-ASCII characters, you may simply say:

if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi
#        ^
#        silent grep

You can use the following commands to get rid of these characters:

sed -i.bak 's/[\d128-\d255]//g' file

This will result in the creation of a file. bak file will be used as a backup, while the original file will be stripped of non-ASCII characters. Remove non-ascii characters from csv is a reference.

Answered by fedorqui ‘SO stop harming’

Solution #3

If you want to find non-ASCII characters in the shell, try this.


$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/'  utf8.txt


2 Pour être ou ne pas être
4 Byť či nebyť
5 是或不

Answered by Bouramas

Solution #4

By definition, what you’re looking at is tainted. You appear to be viewing the file in Latin-1 mode, with the three characters 12 representing the three byte values 0xEF 0xBF 0xBD. However, that is the UTF-8 encoding of the Unicode REPLACEMENT CHARACTER U+FFFD, which is the result of attempting to convert bytes from an unknown or undefined encoding into UTF-8, and which would be shown correctly as (if you have a browser from this century, you should see something like a black diamond with a question mark in it; but this also depends on the font you are using etc).

So the answer to your query concerning “how to detect” this occurrence is simple: U+FFFD is a dead giveaway, and the only possible symptom of the mechanism you’re proposing.

This isn’t “illegal Unicode” or “invalid UTF-8” in the sense that it’s a legitimate UTF-8 sequence encoding a valid Unicode code point; rather, the semantics of this particular code point are “this is a replacement character for a character that couldn’t be represented properly,” i.e. invalid input.

The solution to how to prevent it in the first place is easy, but also uninformative: you must determine when and how the wrong encoding occurred, and then repair the procedure that resulted in the improper output.

Try something like this to just get rid of the U+FFFD characters.

perl -CSD -pe 's/\x{FFFD}//g' file

However, the best remedy is to avoid producing these erroneous outputs in the first place.

ords, someone took UTF-8 material that had previously been distorted as described above and commanded the computer to convert it from Latin-1 to UTF-8. It’s simple to undo this by converting it “back” to Latin-1. The original UTF-8 data before the unnecessary erroneous conversion should be obtained.)

Answered by tripleee

Solution #5

All non-ASCII characters should be removed using this Perl program:

 foreach $file (@ARGV) {
   open(IN, $file);
   open(OUT, "> super-temporary-utf8-replacement-file-which-should-never-be-used-EVER");
   while (<IN>) {
     print OUT "$_";
   rename "super-temporary-utf8-replacement-file-which-should-never-be-used-EVER", $file;

This works by reading files from the command line, such as perl foo bar baz. Then it substitutes each instance of a non-ASCII character with nothing for each line (deletion). The amended line is then written to super-temporary-utf8-replacement-file-which-should-never-be-used-EVER (named so that it doesn’t alter any other files). It then renames the temporary file to the same name as the original. This supports ALL ASCII characters (including DEL, NUL, CR, and so on) in case you need them for something specific. Simply replace: with: if you only want printable characters. with: ascii print: in s/ s/ s/ s/ s I hope this information is useful! If this isn’t what you were searching for, please let me know.

Answered by ASCIIThenANSI

Post is based on