将Java文件编码从ANSI转换为UTF8

5

我有一个需求,需要将一个文件的编码从ANSI(windows-1252)转换为UTF8。我写了下面的Java程序来实现它。这个程序将字符转换为UTF8,但是当我在notepad++中打开文件时,编码类型显示为ANSI作为UTF8。这会导致我在导入文件到access数据库时出错。需要的是只有UTF8编码的文件。同时,要求在不打开任何编辑器的情况下转换该文件。

public class ConvertFromAnsiToUtf8 {

    private static final char BYTE_ORDER_MARK = '\uFEFF';
    private static final String ANSI_CODE = "windows-1252";
    private static final String UTF_CODE = "UTF8";
    private static final Charset ANSI_CHARSET = Charset.forName(ANSI_CODE);

    public static void main(String[] args) {

        List<File> fileList;
        File inputFolder = new File(args[0]);
        if (!inputFolder.isDirectory()) {
            return;
        }
        File parentDir = new File(inputFolder.getParent() + "\\"
                    + inputFolder.getName() + "_converted");

        if (parentDir.exists()) {
            return;
        }
        if (parentDir.mkdir()) {

        } else {
            return;
        }

        fileList = new ArrayList<File>();
        for (final File fileEntry : inputFolder.listFiles()) {
            fileList.add(fileEntry);
        }

        InputStream in;

        Reader reader = null;
        Writer writer = null;
        try {
            for (File file : fileList) {
                in = new FileInputStream(file.getAbsoluteFile());
                reader = new InputStreamReader(in, ANSI_CHARSET);

                OutputStream out = new FileOutputStream(
                            parentDir.getAbsoluteFile() + "\\"
                                            + file.getName());
                writer = new OutputStreamWriter(out, UTF_CODE);
                writer.write(BYTE_ORDER_MARK);
                char[] buffer = new char[10];
                int read;
                while ((read = reader.read(buffer)) != -1) {
                    System.out.println(read);
                    writer.write(buffer, 0, read);
                }
            }
            reader.close();
            writer.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

任何提示都将有所帮助。
谢谢, Ashish

你尝试过使用iconv吗?这里也有一个Windows二进制链接。 - 9000
“the encoding type was displayed as ANSI as UTF8” 的意思是什么?你认为这个程序没有将文件从Windows-1252转换成UTF-8吗? - David Conrad
你正在使用哪个版本的Java? - anstarovoyt
我使用的是Java 6。"ANSI as UTF-8"表示如果您使用能够显示编码方式的编辑器打开文件,它将显示为"ANSI as UTF-8"。在我的情况下,我使用的是Notepad++,它会显示在右下角。 - Ashish
可以通过两次调用Java的native2ascii来实现相同的效果:native2ascii -encoding windows-1252 in.txt tmp.txt,然后是 native2ascii -reverse -encoding UTF-8 tmp.txt out.txt - Joop Eggen
2个回答

5

这段代码可以正确地从windows-1252转码为UTF-8。

Notepad++的提示信息令人困惑,因为“ANSI as UTF-8”没有明显的含义;这似乎是Notepad++中的缺陷。我认为Notepad++的意思是UTF-8 without BOM(请参见编码菜单)。

作为Windows程序,Microsoft Access可能希望UTF-8文件以字节顺序标记(BOM)开头。

您可以通过在文件开头写入代码点U+FEFF来将BOM注入文档:

import java.io.*;
import java.nio.charset.*;

public class Ansi1252ToUtf8 {
  private static final char BYTE_ORDER_MARK = '\uFEFF';

  public static void main(String[] args) throws IOException {
    Charset windows1252 = Charset.forName("windows-1252");
    try (InputStream in = new FileInputStream(args[0]);
        Reader reader = new InputStreamReader(in, windows1252);
        OutputStream out = new FileOutputStream(args[1]);
        Writer writer = new OutputStreamWriter(out, StandardCharsets.UTF_8)) {
      writer.write(BYTE_ORDER_MARK);
      char[] buffer = new char[1024];
      int read;
      while ((read = reader.read(buffer)) != -1) {
        writer.write(buffer, 0, read);
      }
    }
  }
}

非常感谢你,McDowell,我正在寻找你在这里给出的确切解决方案。David、Andrew、Joop,感谢你们的时间。 - Ashish
我注意到上述解决方案还存在另一个问题。当我转换一个包含许多文件的文件夹时,一些文件会被截断,而一个非常小的文件则完全为空白。我尝试通过减小缓冲区大小来转换文件,但没有成功。有人对此有任何想法吗??? - Ashish
听起来好像输出流没有被正确关闭。您是否使用了上面的完全相同的代码? - McDowell
代码已在问题本身中更新。那是我正在使用的确切代码。同时关闭流。请看一下。 - Ashish

1
在Windows 7(64位)上运行Java 8时,我必须关闭每个文件。否则,文件将被截断为4 kB的倍数。仅关闭最后一组文件是不够的,我必须关闭每个文件才能获得所需的结果。发布我修改后的版本,添加了错误消息:
import java.io.*;
import java.nio.charset.*;
import java.util.ArrayList;

public class ConvertFromAnsiToUtf8 {

    private static final char BYTE_ORDER_MARK = '\uFEFF';
    private static final String ANSI_CODE = "windows-1252";
    private static final String UTF_CODE = "UTF8";
    private static final Charset ANSI_CHARSET = Charset.forName(ANSI_CODE);
    private static final String PATH_SEP = "\\";
    private static final boolean WRITE_BOM = false;

    public static void main(String[] args) 
    {
        if (args.length != 2) {
            System.out.println("Please name a source and a target directory");
            return;
        }

        File inputFolder = new File(args[0]);
        if (!inputFolder.isDirectory()) {
            System.out.println("Input folder " + inputFolder + " does not exist");
            return;
        }
        File outputFolder = new File(args[1]);

        if (outputFolder.exists()) {
            System.out.println("Folder " + outputFolder + " exists - aborting");
            return;
        }
        if (outputFolder.mkdir()) {
            System.out.println("Placing converted files in " + outputFolder);
        } else {
            System.out.println("Output folder " + outputFolder + " exists - aborting");
            return;
        }

        ArrayList<File> fileList = new ArrayList<File>();
        for (final File fileEntry : inputFolder.listFiles()) {
            fileList.add(fileEntry);
        }

        InputStream in;
        Reader reader = null;
        Writer writer = null;
        int converted = 0;

        try {
            for (File file : fileList) {
                try {
                    in = new FileInputStream(file.getAbsoluteFile());
                    reader = new InputStreamReader(in, ANSI_CHARSET);

                    OutputStream out = new FileOutputStream(outputFolder.getAbsoluteFile() + PATH_SEP + file.getName());
                    writer = new OutputStreamWriter(out, UTF_CODE);

                    if (WRITE_BOM)
                        writer.write(BYTE_ORDER_MARK);
                    char[] buffer = new char[1024];
                    int read;
                    while ((read = reader.read(buffer)) != -1) {
                        writer.write(buffer, 0, read);
                    }
                    ++converted;
                } finally {
                    reader.close();
                    writer.close();
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

        System.out.println(converted + " files converted");
    }

}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接