Problem
I have an application that deals with clients from all over the world, so anything that goes into my databases must be encoded in UTF-8.
The main problem for me is that I don’t know what encoding the source of any string is going to be – it could be from a text box (using is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.
What I need is a function or class that ensures that everything coming into my database is encoded in UTF-8 as much as feasible. I tried iconv(mb detect encoding($text), “UTF-8”, $text); but it doesn’t work (it returns ‘fianc’ if the input is ‘fiancée’). =/ I’ve tried a lot of different stuff.
For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn’t help against nasty hackers (in fact, it could make their life a little easier).
I’ve looked through the other SO queries on the issue, but they all appear to have minor distinctions, such as “I need to parse RSS feeds” or “I scrape data from websites” (or, more importantly, “You can’t”).
But there has to be something that at least tries!
Asked by Grim…
Solution #1
What you’re requesting is really difficult. Getting the user to specify the encoding is the ideal option if at all possible. It shouldn’t make preventing an attack either easier or harder.
You could, however, try this:
iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
It’s possible that setting it to stringent will help you achieve a better result.
Answered by Jeff Day
Solution #2
We have four common encodings in motherland Russia, thus your question is in high demand.
Because code pages cross, you can’t tell if something is encoded only by looking at the char codes of symbols. Some codepages in various languages even cross completely. As a result, we require a new strategy.
Working with probability is the only approach to deal with unknown encodings. So, rather than answering the question “what is the encoding of this text?” we’re attempting to comprehend “what is the most likely encoding of this text?”
This method was devised by a reader of a popular Russian tech blog:
For each encoding you want to support, create a probability range of char codes. You can construct it utilizing some large books in your own language (e.g. some fiction, use Shakespeare for english and Tolstoy for russian, lol ). You’ll receive something similar to this:
encoding_1:
190 => 0.095249209893009,
222 => 0.095249209893009,
...
encoding_2:
239 => 0.095249209893009,
207 => 0.095249209893009,
...
encoding_N:
charcode => probabilty
Next, you take text in an unknown encoding and search for the frequency of each symbol in the text using your “probability dictionary” for each encoding. Symbol probabilities are added together. The winner is most likely the encoding with the highest rating. Larger texts produce better outcomes.
If you’re interested, I’d be happy to assist you with this assignment. By creating a two-charcode probabilty list, we can considerably improve the accuracy.
mb detect encoding, by the way, does not work. Yes, absolutely. Please review the source code for mb detect encoding in “ext/mbstring/libmbfl/mbfl/mbfl ident.c.”
Answered by Oroboros102
Solution #3
You’ve undoubtedly tried it already, but why not use the mb convert encoding function instead? It will try to auto-detect the text’s char set, or you can feed it a list.
I also attempted to run:
$text = "fiancée";
echo mb_convert_encoding($text, "UTF-8");
echo "<br/><br/>";
echo iconv(mb_detect_encoding($text), "UTF-8", $text);
and the outcomes are the same in both cases. What makes you think your content was reduced to ‘fianc’? Is it in a database or a browser?
Answered by Alexey Gerasimov
Solution #4
There is no foolproof method for determining the charset of a string. There are a few methods for attempting to predict the charset. mb detect encoding is one of these methods, and it’s probably/currently the best in PHP (). This will scan your string for occurrences of characters that are only seen in particular charsets. There may or may not be identifiable occurrences depending on your string.
Consider the differences between ISO-8859-1 and ISO-8859-15 (http://en.wikipedia.org/wiki/ISO/IEC 8859-15#Changes from ISO-8859-1).
Only a few different characters exist, and to make matters worse, they’re all represented by the same bytes. There’s no way to tell if byte 0xA4 in your string is supposed to represent or € without knowing its encoding, therefore there’s no way to determine its exact charset if you’re given a string without knowing its encoding.
(Note: you could add a human factor, or use a more complex scanning technique (as Oroboros102 suggests) to try to figure out if the character should be or € depending on the surrounding context, but this seems like a stretch.)
There are more discernible distinctions between UTF-8 and ISO-8859-1, for example, so it’s still worth attempting to figure it out if you’re unsure, though you can’t and shouldn’t rely on it.
Interesting read: http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#how-do-i-determine-the-charset-encoding-of-a-string
However, there are various techniques to ensure that the correct charset is used. When it comes to forms, try to enforce UTF-8 as much as possible (check out Snowman to ensure your submission is UTF-8 in all browsers: http://intertwingly.net/blog/2010/07/29/Rails-and-Snowmen). After that, you may rest assured that every text provided through your forms is utf 8. To help with detection (using the document’s BOM), consider executing the unix ‘file -i’ command on it using e.g. exec() (if possible on your server). You might read the HTTP headers, which normally specify the charset, when scraping data. When parsing XML files, look for a charset definition in the XML meta-data.
Rather than trying to guess the charset automatically, you should try to ensure a certain charset yourself, or acquire a definition from the source you’re getting it from (if relevant), before resorting to detection.
Answered by matthiasmullie
Solution #5
Here you’ll find some excellent responses and attempts to answer your question. I’m no encoding expert, but I get your goal for a UTF-8 stack that extends all the way to your database. For tables, columns, and connections, I’ve been using MySQL’s utf8mb4 encoding.
“I just want my sanitizers, validators, business logic, and prepared statements to cope with UTF-8 when data comes through HTML forms or e-mail registration links,” my problem boils down to. So, in my humble opinion, I began with the following concept:
From my abstract composition class Sanitizer
private function isUTF8($encoding, $value)
{
return (($encoding === 'UTF-8') && (utf8_encode(utf8_decode($value)) === $value));
}
private function utf8tify(&$value)
{
$encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
mb_internal_encoding('UTF-8');
mb_substitute_character(0xfffd); //REPLACEMENT CHARACTER
mb_detect_order($encodings);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if (!$stringEncoding) {
$value = null;
throw new \RuntimeException("Unable to identify character encoding in sanitizer.");
}
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = mb_convert_encoding($value, 'UTF-8', $stringEncoding);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = null;
throw new \RuntimeException("Unable to convert character encoding from ISO-8859-1, or ASCII, to UTF-8 in Sanitizer.");
}
}
return;
}
One could argue that encoding issues should be separated from my abstract Sanitizer class, and that I should simply inject an Encoder object into a specific child instance of Sanitizer. However, the primary flaw in my method is that, without additional understanding, I just reject encoding kinds that I don’t want (and rely on PHP mb_* functions to do so). I can’t say whether this is harmful to some people without more research (or, if I am losing out on important information). As a result, I need to study more. I came across this article.
What every programmer should know about encodings and character sets in order to work with text
What happens if I add encrypted data (using OpenSSL or mcrypt) to my email registration links? Is it possible that this will obstruct decoding? What about Windows-1252, for example? What about the consequences for security? In Sanitizer::isUTF8, the use of utf8 decode() and utf8 encode() is questionable.
The PHP mb_* functions have been criticized for having flaws. I’ve never looked into iconv, but if it’s better than mb *functions, please let me know.
Answered by Anthony Rutledge
Post is based on https://stackoverflow.com/questions/7979567/php-convert-any-string-to-utf-8-without-knowing-the-original-character-set-or