Thursday, March 31, 2011

UTF-8 text is garbled when form is posted as multipart/form-data

I'm uploading a file to the server. The file upload HTML form has 2 fields:

  1. File name - A HTML text box where the user can give a name in any language.
  2. File upload - A HTMl 'file' where user can specify a file from disk to upload.

When the form is submitted, the file contents are received properly. However, when the file name (point 1 above) is read, it is garbled. ASCII characters are displayed properly. When the name is given in some other language (German, French etc.), there are problems.

In the servlet method, the request's character encoding is set to UTF-8. I even tried doing a filter as mentioned - http://stackoverflow.com/questions/29751/problems-while-submitting-a-utf-8-form-textarea-with-jquery-ajax - but it doesn't seem to work. Only the filename seems to be garbled.

The MySQL table where the file name goes supports UTF-8. I gave random non-English characters & they are stored/displayed properly.

Using Fiddler, I monitored the request & all the POST data is passed correctly. I'm trying to identify how/where the data could get garbled. Any help will be greatly appreciated.

From stackoverflow
  • The filter is key for IE. A few other things to check;

    What is the page encoding and character set? Both should be UTF-8

    <%@ page language="java" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"%>
    

    What is the character set in the meta tag?

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    

    Does your MySQL connection string specify UTF-8? e.g.

    jdbc:mysql://127.0.0.1/dbname?requireSSL=false&useUnicode=true&characterEncoding=UTF-8
    
  • I had the same problem using Apache commons-fileupload. I did not find out what causes the problems especially because I have the UTF-8 encoding in the following places: 1. HTML meta tag 2. Form accept-charset attribute 3. Tomcat filter on every request that sets the "UTF-8" encoding

    -> My solution was to especially convert Strings from ISO-8859-1 (or whatever is the default encoding of your platform) to UTF-8:

    new String (s.getBytes ("iso-8859-1"), "UTF-8");
    

    hope that helps

    David García González : What could happen if commons-fileupload fixed this and the request is in UTF-8? Perhaps when you execute s.getBytes ("iso-8859-1") the bytes are not in the iso-8859-1 encoding.
  • The filter thing and setting up Tomcat to support UTF-8 URIs is only important if you're passing the via the URL's query string, as you would with a HTTP GET. If you're using a POST, with a query string in the HTTP message's body, what's important is going to be the content-type of the request and this will be up to the browser to set the content-type to UTF-8 and send the content with that encoding.

    The only way to really do this is by telling the browser that you can only accept UTF-8 by setting the Accept-Charset header on every response to "UTF-8;q=1,ISO-8859-1;q=0.6". This will put UTF-8 as the best quality and the default charset, ISO-8859-1, as acceptable, but a lower quality.

    When you say the file name is garbled, is it garbled in the HttpServletRequest.getParameter's return value?

  • You do not use UTF-8 to encode text data for HTML forms. The html standard defines two encodings, and the relevant part of that standard is here. The "old" encoding, than handles ascii, is application/x-www-form-urlencoded. The new one, that works properly, is multipart/form-data.

    Specifically, the form declaration looks like this:

     <FORM action="http://server.com/cgi/handle"
           enctype="multipart/form-data"
           method="post">
       <P>
       What is your name? <INPUT type="text" name="submit-name"><BR>
       What files are you sending? <INPUT type="file" name="files"><BR>
       <INPUT type="submit" value="Send"> <INPUT type="reset">
     </FORM>
    

    And I think that's all you have to worry about - the webserver should handle it. If you are writing something that directly reads the InputStream from the web client, then you will need to read RFC 2045 and RFC 2046.

  • I had the same problem and it turned out that in addition to specifying the encoding in the Filter

    request.setCharacterEncoding("UTF-8");
    response.setCharacterEncoding("UTF-8");
    

    it is necessary to add "accept-charset" to the form

    <form method="post" enctype="multipart/form-data" accept-charset="UTF-8" >
    

    and run the JVM with

    -Dfile.encoding=UTF-8
    

    The HTML meta tag is not necessary if you send it in the HTTP header using response.setCharacterEncoding().

0 comments:

Post a Comment