Encoding issues. Solutions for linux and within Java apps.

Puh, encoding! Did you ever have trouble with it? No? You must be a lucky guy!

Even none-developers should have problems with it e.g. if they use different operating systems.

What is encoding? And what’s so difficult with the encoding?

First encoding (better: character encoding) defines how characters have to be saved to be displayed correctly in your editor or look at the wikipedia definition to be correct. Update: here is a nice introduction.

For example if your editor only reads ASCII files all is very simple: it will use every 8 bits of the bitstream to get a number. Then it will interpret this number according to the ASCII-table. So, if it finds a 97 (this is 0×61 in hexadecimal) it prints ‘a’.

(BTW: look at this nice ASCII-art.)

But what if the encoding is another one? Or if even the bitstream should be splitted into 16-bits-packages instead of 8-bits-packages?

Then the user won’t see the correct information!

Second: On linux everything is in UTF-8. Windows uses CP 1252. and so on. Not good!

(With everything I means: clipboard, default file encoding, …)

How can you (as an end user) handle this under linux?

There are at least 4 programs that helps you with encoding issues under linux:

  • There are command line utilities in linux where you can determine automatically the encoding of a file: enconv and enca or open the file in firefox and go to View -> Encoding and view the detected encoding!
  • To change the encoding of file-content the editor kate is really great:
    Go to extras -> encoding and try it out.
  • Change the encoding of the content of several files which come from windows and you want to have them in linux then use recode:
    recode CP1252..UTF-8 *
    recode ISO-8859-1..UTF-8 *

    do the following to backup the original files:

    mkdir test && cp * test/ && cd test
  • Another command line utility is iconv (or here)
  • Change the encoding of the filenames with convmv (files e.g. from windows).
    To preview the change do:

    convmv -f cp1252 -t utf8 *

    To do the change:

    convmv --notest -f cp1252 -t utf8 *

How does Java handle encoding?

Java is platform independent one should think, but it isn't regarding to the encoding.

For example: if you read a file correctly under linux, this could fail if you don't specify the encoding explicitly, because it assumes it is utf8 and under windows it will use another default!

To override the default use: 'java -Dfile,encoding=UTF-8' or be explicit with the encoding! E.g read characters from a stream with the following lines:

BufferedInputStream iStream = new BufferedInputStream(urlConn.getInputStream());
InputStreamReader reader = new InputStreamReader(iStream, "UTF-8");

Another issue could be Java source files. They can have different encoding. You should use UTF8, because this is the encoding Java uses for its Strings.

In NetBeans 6.1 change it in the project properties (right-click on the project->properties)->Source->Encoding

In Eclipse 3.4 go to the preferences (menu Window) -> General ->Workspace->text file encoding

But this is only useful for desktop applications like my open source timetabler. But what if you do web development? All fine there? No not really. Then you might get additional problems with url encoding or xml parsing. For the latter one the fix is simple:

  • XML: <?xml version="1.0" encoding="UTF-8"?>

But for url encoding the following does not really work:

  • JSP: <%@page contentType="text/html; charset=UTF-8" language="java"%>

Apropos JSP - I had an encoding issue with the request. Try the following:

<% out.print("RESPONSE character encoding=" + response.getCharacterEncoding() + " ");
out.print("REQUEST character encoding=" + request.getCharacterEncoding() + " ");
out.print("JVM encoding " + System.getProperty("file.encoding") + " ");

//EVEN here we get request parameter in wrong encoding
bean.setRequest(request);
%>

You will see that the request is null if I am not wrong. And then Java will use utf8? NO!

It will use ISO-8859-1! Why? It is written in the standard!

A simple request.setCharacterEncoding("UTF-8"); would help if all browsers would send its request according to the header of the jsp. But this isn't actually working for my use case. So I grabbed the strings from the request via this helper method:

private String toUTF8(String str) {
        try {
            return new String(str.getBytes("8859_1"), "UTF8");
        } catch (UnsupportedEncodingException ex) {
            return str;
        }
}

Update 1: Read this or this to get a better workaround with a javax.servlet.Filter, webserver parameters and jsp configs.

Update 2: The following snippets could be useful if you are using maven and want to make the application UTF-8 aware:

<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>

Update 3:

A good side with a lookup table for Unicode characters

http://unicode.coeurlumiere.com/

Summary

I invite you to post all your experiences with encoding problems in java.
E.g. how to force jboss or jetty to use utf8?

10 thoughts on “Encoding issues. Solutions for linux and within Java apps.

  1. If in your project class javax.servlet.Filter is used for filter of your request…

    In the method doFilter(), implement
    request.setCharacterEncoding(“UTF-8″);

    Which seems to fix the encoding problem…Well, it worked for me.

  2. iam writing a xml file using java. there is a tag
    åäö.
    the value(åäö) is taken from the DB.this works fine on windows.but on linux(ubuntu server edition) it comes like ???.
    the files encoding is set to ISO-8859-1.
    the linux machine env | grep LANG shows LANG=en_US.UTF-8.

  3. I gues the xml throws something ala invalid char … you need to know which!

    why file encoding if it comes from the DB?

    are are reading the ‘file’ exactly?

  4. High all!

    I am importing a Cp1252 encoding file (created on windows ISO-8859-1) into a Cassandra database UTF-8 encoding through my java application running on Linux.
    I force java reading the file in ISO-8859-1 windows format.
    Here is my solution that perfectly works:

    public void importItems(File importFile) {
    try {
    FileInputStream fis = new FileInputStream(importFile);
    BufferedInputStream iStream = new BufferedInputStream(fis);
    InputStreamReader reader = new InputStreamReader(iStream, “ISO-8859-1″);

    BufferedReader bufferedReader = new BufferedReader(reader);

    FileWriter fileWriter = new FileWriter(new File(“/tmp/csv_out/User_Provisioning.csv”));

    String nextLine, newLine;
    String[] splitLine;

    while ((nextLine = bufferedReader.readLine()) != null) {
    System.out.println(“Before : ” + nextLine);

    newLine = new String(nextLine.getBytes(Charset.forName(“UTF-8″)));

    System.out.println(“After: ” + newLine);

    splitLine =nextLine.split(SEPARATOR);

    fileWriter.write(newLine);
    fileWriter.write(‘\n’);
    }

    bufferedReader.close();
    fileWriter.close();

    } catch (IOException e) {
    System.out.println(“” + e.getMessage());
    }
    }

  5. Hi eveer1!!
    ————
    My “conf”, [C:\Program Files\NetBeans 7.0\etc\netbeans.conf]
    ————
    netbeans_default_options=”-J-client -J-Xss2m -J-Xms32m -J-XX:PermSize=32m -J-XX:MaxPermSize=384m -J-Dapple.laf.useScreenMenuBar=true -J-Dapple.awt.graphics.UseQuartz=true -J-Dsun.java2d.noddraw=true
    -J-Dfile.encoding=UTF-8″

    My Help>About
    ————
    Product Version: NetBeans IDE 7.0 (Build 201104080000)
    Java: 1.6.0_24; Java HotSpot(TM) Client VM 19.1-b02
    System: Windows 7 version 6.1 running on x86; Cp1252; en_US (nb)
    Userdir: C:\Users\shri\.netbeans\7.0

    My swing GUI
    ————-
    *. showing corectly for chineese unicode charectors
    but, not for tamil unicode(only boxes).

    Whats goin wrong???
    Seems to be .conf file not effecting the Netbeans runtime… but how could it be??
    Am I editing the wrong .conf file??

  6. thanks for the reply karussell.
    Yes font is already there. Its working perfectly without netbeans(through CMD).
    problem shoul persist in the NB.

  7. I am getting a strange issue. I have a small Java method that simply reads in a result set from a DB query and writes it to a comma separated file. I am surrounding each field with ” so any imbedded commas will not skew the csv format. To make this work properly, I am replacing any ” in each field with a space. This works fine on all several test evironments (Linux) except when run by a particular Linux user. When run by this user (the application account user) it returns two sets of ” rather than replace with a space. I have tried comparing the env and nothing stands out. I can run this successfully as one user on the same Linux server, but get different results when running as another user.

    The Java Code is:

    String aSpace = ” “;
    writer.append(rs.getObject(i).toString().replaceAll(“\””, aSpace));

    Any ideas would be greatly appreicated.

Comments are closed.