Problem
What is the best way to remove non-ASCII characters from a string? (written in C#)
Asked by philcruz
Solution #1
string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);
Answered by philcruz
Solution #2
Here’s a solution that doesn’t use regular expressions and is entirely written in.NET:
string inputString = "Räksmörgås";
string asAscii = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(inputString)
)
);
It may appear complicated, but it should be simple to use. It converts a string using the.NET ASCII encoding. Because it can represent any of the original characters, UTF8 is used during the conversion. Any non-ASCII character is converted to an empty string using an EncoderReplacementFallback.
Answered by bzlm
Solution #3
MonsCamus, I assume, meant:
parsememo = Regex.Replace(parsememo, @"[^\u0020-\u007E]", string.Empty);
Answered by Josh
Solution #4
Take a look at this question if you don’t want to strip but rather convert latin accented to non-accented characters: What is the best way to convert 8-bit characters to 7-bit characters? (For example, Ü to U)
Answered by sinelaw
Solution #5
I’ve created a pure LINQ solution, inspired by philcruz’s Regular Expression solution.
public static string PureAscii(this string source, char nil = ' ')
{
var min = '\u0000';
var max = '\u007F';
return source.Select(c => c < min ? nil : c > max ? nil : c).ToText();
}
public static string ToText(this IEnumerable<char> source)
{
var buffer = new StringBuilder();
foreach (var c in source)
buffer.Append(c);
return buffer.ToString();
}
This code has not been tested.
Answered by Bent Rasmussen
Post is based on https://stackoverflow.com/questions/123336/how-can-you-strip-non-ascii-characters-from-a-string-in-c