I need to compare 2 strings in C# and treat accented letters the same as non-accented letters. For example:
string s1 = "hello";
string s2 = "héllo";
s1.Equals(s2, StringComparison.InvariantCultureIgnoreCase);
s1.Equals(s2, StringComparison.OrdinalIgnoreCase);
These 2 strings need to be the same (as far as my application is concerned), but both of these statements evaluate to false. Is there a way in C# to do this?
-
try this overload on the String.Compare Method.
String.Compare Method (String, String, Boolean, CultureInfo)
It produces a int value based on the compare operations including cultureinfo. the example in the page compares "Change" in en-US and en-CZ. CH in en-CZ is a single "letter".
example from the link
using System; using System.Globalization; class Sample { public static void Main() { String str1 = "change"; String str2 = "dollar"; String relation = null; relation = symbol( String.Compare(str1, str2, false, new CultureInfo("en-US")) ); Console.WriteLine("For en-US: {0} {1} {2}", str1, relation, str2); relation = symbol( String.Compare(str1, str2, false, new CultureInfo("cs-CZ")) ); Console.WriteLine("For cs-CZ: {0} {1} {2}", str1, relation, str2); } private static String symbol(int r) { String s = "="; if (r < 0) s = "<"; else if (r > 0) s = ">"; return s; } } /* This example produces the following results. For en-US: change < dollar For cs-CZ: change > dollar */
therefor for accented languages you will need to get the culture then test the strings based on that.
-
The following method
CompareIgnoreAccents(...)
works on your example data. Here is the article where I got my background information: http://www.codeproject.com/KB/cs/EncodingAccents.aspxprivate static bool CompareIgnoreAccents(string s1, string s2) { return string.Compare( RemoveAccents(s1), RemoveAccents(s2), StringComparison.InvariantCultureIgnoreCase) == 0; } private static string RemoveAccents(string s) { Encoding destEncoding = Encoding.GetEncoding("iso-8859-8"); return destEncoding.GetString( Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s))); }
I think an extension method would be better:
public static string RemoveAccents(this string s) { Encoding destEncoding = Encoding.GetEncoding("iso-8859-8"); return destEncoding.GetString( Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s))); }
Then the use would be this:
if(string.Compare(s1.RemoveAccents(), s2.RemoveAccents(), true) == 0) { ...
-
Write a method normalize(String s) that takes in a string and turns accented letters into non accented one. Then instead of comparing string x to string y, compare normalize(string x) to normalize(string y).
-
Here's a function that strips diacritics from a string:
static string RemoveDiacritics(string sIn) { string sFormD = sIn.Normalize(NormalizationForm.FormD); StringBuilder sb = new StringBuilder(); foreach (char ch in sFormD) { UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(ch); if (uc != UnicodeCategory.NonSpacingMark) { sb.Append(ch); } } return (sb.ToString().Normalize(NormalizationForm.FormC)); }
More details here.
The principle is that is it turns 'é' into 2 successive chars 'e', acute. It then iterates through the chars and skips the diacritics.
"héllo" becomes "he<acute>llo", which in turn becomes "hello".
Debug.Assert("hello"==RemoveDiacritics("héllo"));
This line doesn't assert, which is what you want.
0 comments:
Post a Comment