如何在POI中判断文件是doc还是docx格式

Question

如何在POI中判断文件是doc还是docx格式

4

标题可能有点令人困惑。最简单的方法就是像判断扩展名那样：

// is represents the InputStream   
if (filePath.endsWith("doc")) {
    WordExtractor ex = new WordExtractor(is);
    text = ex.getText();
    ex.close();
} else if(filePath.endsWith("docx")) {
    XWPFDocument doc = new XWPFDocument(is);
    XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
    text = extractor.getText();
    extractor.close();
}

这在大多数情况下是可以的。但我发现对于某些扩展名为doc（实际上是docx文件）的文件，如果你使用winrar打开，你会发现xml文件。众所周知，docx文件是由xml文件组成的zip文件。

我相信这个问题一定不少见。但我没有找到任何相关信息。显然，通过扩展名来判断是否读取doc或docx是不合适的。

在我的情况下，我必须读取很多文件。我甚至要读取压缩文件（zip、7z或者rar）中的doc或docx文件。因此，我必须通过inputStream读取内容，而不是使用File或其他类似的方法。因此，如何从Apache POI知道一个文件是.docx还是.doc格式对于我使用ZipInputStream来说完全不适用。

最好的方法是如何判断一个文件是doc还是docx？我想要一种解决方案来读取一个可能是doc或docx的文件的内容。但不仅仅是简单地判断它是否为doc或docx。显然，ZipInpuStream对于我的情况并不是一个好方法。我相信对于其他人来说也不是一个合适的方法。为什么要通过异常来判断文件是否为doc或docx呢？

- neal

https://dev59.com/N57ha4cB1Zd3GeqPhU5L - user2080225

我也不知道@ClayFerguson的链接为什么没有回答你的问题。所提到的解决方案提供了一种简单的方法来测试文件是否为Zip文件...从而区分doc和docx。 - lockcmpxchg8b

@STaefi 请仔细阅读我的问题!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! - neal

2

@neal，所以一旦您检测到它是一个zip文件，您仍然会尝试将其视为“doc”文件吗？是的，那会“带来问题”。 - user2080225

@ClayFerguson 当你阅读普通的文档文件时，会遇到问题。 - neal

显示剩余12条评论

2个回答

0

try {
    new ZipFile(new File("/Users/giang/Documents/a.doc"));
    System.out.println("this file is .docx");
} catch (ZipException e) {
    System.out.println("this file is not .docx");
    e.printStackTrace();
}

- yelliver

问题的评论中透露了一些额外的要求，这个回答没有满足。 - lockcmpxchg8b

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Axel Richter · Accepted Answer

使用当前稳定版本的apache poi 3.17，您可以使用FileMagic。但是，在内部，这当然也会查看文件。

例如：

import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;

import org.apache.poi.poifs.filesystem.FileMagic;

import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class ReadWord {

 static String read(InputStream is) throws Exception {

System.out.println(FileMagic.valueOf(is));

  String text = "";

  if (FileMagic.valueOf(is) == FileMagic.OLE2) {
   WordExtractor ex = new WordExtractor(is);
   text = ex.getText();
   ex.close();
  } else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
   XWPFDocument doc = new XWPFDocument(is);
   XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
   text = extractor.getText();
   extractor.close();
  }

  return text;

 }

 public static void main(String[] args) throws Exception {

  InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
  System.out.println(read(is));
  is.close();

  is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
  System.out.println(read(is));
  is.close();

  is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
  System.out.println(read(is));
  is.close();

 }
}