将文本文件从ANSI转换为UTF-8的程序化方法

Question

将文本文件从ANSI转换为UTF-8的程序化方法

9

请帮我一下。我正在开发一个Java应用程序，将txt文件中的数据转换成数据库格式。问题在于该文件采用ANSI编码，我无法更改它，因为它来自于我的应用程序之外。当我将数据写入数据库时，会出现一些“???”的符号。我的问题是，如何将从文件中读取的数据从ANSI转换为UTF-8，以便处理这些奇怪的符号。我尝试了Byte []转换为String，但没有成功。

- wlegend

1

如果我理解正确的话，在打开输入流时您应该使用UTF-8，例如 new InputStreamReader(inputStream, "UTF-8")。 - MByD

@MByD，非常感谢您的推荐，不幸的是我已经尝试过了，但对我来说并没有起作用，结果始终如一。 - wlegend

大家好，我找到了答案，非常感谢MByD。与其使用UTF-8作为编码，我应该使用输入编码，即“windows-1252”，现在不再出现奇怪的符号了。 - wlegend

2个回答

0

1. ANSI是什么？

https://www.cnblogs.com/malecrab/p/5300486.html

2. 需要库文件

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
</dependency>

3. Java示例

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.nio.charset.StandardCharsets;
import java.util.Set;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.tika.Tika;
import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;

import com.google.common.collect.Sets;
import lombok.extern.slf4j.Slf4j;

/**
 *
 * @author wang.qingsong
 * Created on 2021/09/16
 */
@Slf4j
public class FileUtil {

    public static boolean isFileEncodingUtf8(File inputFile) throws IOException {
        return isUtf8(getFileEncoding(inputFile));
    }

    public static String getFileEncoding(File file) throws IOException {
        try (FileInputStream fileInputStream = new FileInputStream(file);) {
            return getInputStreamEncoding(fileInputStream);
        }
    }

    public static String getInputStreamEncoding(InputStream input) throws IOException {
        CharsetDetector charsetDetector = new CharsetDetector();
        BufferedInputStream buffInput = null; // close new BufferedInputStream
        try {
            charsetDetector.setText(
                input instanceof BufferedInputStream ? input : (buffInput = new BufferedInputStream(input)));
            charsetDetector.enableInputFilter(true);
            CharsetMatch cm = charsetDetector.detect();
            return cm.getName();
        } finally {
            IOUtils.closeQuietly(buffInput);
        }
    }

    public static void convertFileToUtf8(File inputFile, File outputFile) throws IOException {
        final String encoding = getFileEncoding(inputFile);
        if (StringUtils.isEmpty(encoding)) {
            throw new RuntimeException("inputFile encoding can not parsed!");
        }
        if (isUtf8(encoding)) {
            throw new RuntimeException("inputFile is already utf8, no need convert.");
        }

        if (!outputFile.exists()) {
            outputFile.createNewFile();
        }

        try (FileInputStream inputStream = new FileInputStream(inputFile);
             InputStreamReader inputReader = new InputStreamReader(inputStream, encoding);
             // output
             FileOutputStream outputStream = new FileOutputStream(outputFile);
             OutputStreamWriter outputWriter = new OutputStreamWriter(outputStream, StandardCharsets.UTF_8)) {
            IOUtils.copy(inputReader, outputWriter);
        }
    }

    private static boolean isUtf8(String encoding) {
        final Set<String> aliases = Sets.newHashSet("utf-8", "utf_8", "utf8");
        for (String utf8 : aliases) {
            if (StringUtils.equalsIgnoreCase(utf8, encoding)) {
                return true;
            }
        }
        return false;
    }
}

- greatwqs

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- McDowell · Accepted Answer

您可以使用打开解码读取器，像这样：

Reader reader = 
   new InputStreamReader(inputStream, Charset.forName(encodingName));

您应该使用哪种编码名称取决于文本文件所写的“ANSI”编码。您可以在此处找到Java 6支持的编码列表。如果是英语系统，则可能是windows-1252。

正确地将数据写入数据库取决于正确配置数据库并（有时）向JDBC驱动程序提供正确的配置。

您可以在此处和此处了解有关字符编码处理的更多信息。