Sunday, March 6, 2011

Ignoring accented letters in string comparison

I need to compare 2 strings in C# and treat accented letters the same as non-accented letters. For example:

string s1 = "hello";
string s2 = "héllo";

s1.Equals(s2, StringComparison.InvariantCultureIgnoreCase);
s1.Equals(s2, StringComparison.OrdinalIgnoreCase);

These 2 strings need to be the same (as far as my application is concerned), but both of these statements evaluate to false. Is there a way in C# to do this?

From stackoverflow
  • try this overload on the String.Compare Method.

    String.Compare Method (String, String, Boolean, CultureInfo)

    It produces a int value based on the compare operations including cultureinfo. the example in the page compares "Change" in en-US and en-CZ. CH in en-CZ is a single "letter".

    example from the link

    using System;
    using System.Globalization;
    
    class Sample {
        public static void Main() {
        String str1 = "change";
        String str2 = "dollar";
        String relation = null;
    
        relation = symbol( String.Compare(str1, str2, false, new CultureInfo("en-US")) );
        Console.WriteLine("For en-US: {0} {1} {2}", str1, relation, str2);
    
        relation = symbol( String.Compare(str1, str2, false, new CultureInfo("cs-CZ")) );
        Console.WriteLine("For cs-CZ: {0} {1} {2}", str1, relation, str2);
        }
    
        private static String symbol(int r) {
        String s = "=";
        if      (r < 0) s = "<";
        else if (r > 0) s = ">";
        return s;
        }
    }
    /*
    This example produces the following results.
    For en-US: change < dollar
    For cs-CZ: change > dollar
    */
    

    therefor for accented languages you will need to get the culture then test the strings based on that.

    http://msdn.microsoft.com/en-us/library/hyxc48dt.aspx

  • The following method CompareIgnoreAccents(...) works on your example data. Here is the article where I got my background information: http://www.codeproject.com/KB/cs/EncodingAccents.aspx

    private static bool CompareIgnoreAccents(string s1, string s2)
    {
        return string.Compare(
            RemoveAccents(s1), RemoveAccents(s2), StringComparison.InvariantCultureIgnoreCase) == 0;
    }
    
    private static string RemoveAccents(string s)
    {
        Encoding destEncoding = Encoding.GetEncoding("iso-8859-8");
    
        return destEncoding.GetString(
            Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s)));
    }
    

    I think an extension method would be better:

    public static string RemoveAccents(this string s)
    {
        Encoding destEncoding = Encoding.GetEncoding("iso-8859-8");
    
        return destEncoding.GetString(
            Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s)));
    }
    

    Then the use would be this:

    if(string.Compare(s1.RemoveAccents(), s2.RemoveAccents(), true) == 0) {
       ...
    
  • Write a method normalize(String s) that takes in a string and turns accented letters into non accented one. Then instead of comparing string x to string y, compare normalize(string x) to normalize(string y).

  • Here's a function that strips diacritics from a string:

    static string RemoveDiacritics(string sIn)
    {
      string sFormD = sIn.Normalize(NormalizationForm.FormD);
      StringBuilder sb = new StringBuilder();
    
      foreach (char ch in sFormD)
      {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(ch);
        if (uc != UnicodeCategory.NonSpacingMark)
        {
          sb.Append(ch);
        }
      }
    
      return (sb.ToString().Normalize(NormalizationForm.FormC));
    }
    

    More details here.

    The principle is that is it turns 'é' into 2 successive chars 'e', acute. It then iterates through the chars and skips the diacritics.

    "héllo" becomes "he<acute>llo", which in turn becomes "hello".

    Debug.Assert("hello"==RemoveDiacritics("héllo"));
    

    This line doesn't assert, which is what you want.

0 comments:

Post a Comment