Coder Perfect

How do I figure out what encoding/codepage a text file uses?

Problem

We receive text files (.txt,.csv, etc.) from a variety of sources in our program. Because these files were produced in a different/unknown codepage, they sometimes contain garbage when read.

Is it possible to detect the codepage of a text file (automatically)?

The StreamReader constructor’s detectEncodingFromByteOrderMarks works for UTF8 and other unicode marked files, but I’m seeking for a means to detect code pages like ibm850 and windows1252.

Thank you for your responses; I’ve completed the task.

The files we get are from end-users who have no idea what codepages are. The receivers are also end-users, and they already know the following regarding codepages: Codepages exist, and they are inconvenient.

Solution:

Asked by GvS

Solution #1

You can’t figure out what codepage to use; you have to be informed. You can try to estimate it by analyzing the bytes, but this can lead to some strange (and occasionally humorous) results. I can’t remember where I put it, but I’m sure Notepad can be made to display English text in Chinese.

Anyway, here’s what you should read: Every software developer must know the very bare minimum about Unicode and character sets (no excuses!).

Specifically Joel says:

Answered by JV.

Solution #2

If you want to discover non-UTF encodings (i.e., no BOM), you’ll have to rely on heuristics and statistical text analysis. You might be interested in reading Mozilla’s paper on universal charset detection (same link, with better formatting via Wayback Machine).

Answered by Tomer Gabel

Solution #3

Have you tried the Mozilla Universal Charset Detector’s C# port?

Example from http://code.google.com/p/ude/

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}    

Answered by ITmeze

Solution #4

This is obviously untrue. Every web browser features a universal charset detector to deal with sites that have no indication of encoding at all. One can be found in Firefox. You can look at the code to see how it works. You can find some documentation here. It’s just a heuristic, but one that works exceptionally well.

It is even possible to discern the language given a sufficient amount of text.

Here’s another one I came up when searching on Google:

Answered by shoosh

Solution #5

I realize it’s late for this question, and this approach won’t appeal to everyone (because to its English-centric slant and lack of statistical/empirical testing), but it’s worked great for me, especially when processing CSV data:

http://www.architectshack.com/TextFileEncodingDetector.ashx

Advantages:

Note that this class was written by me, so take it with a grain of salt! 🙂

Answered by Tao

Post is based on https://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file