确定文本是否为英文？

Question

确定文本是否为英文？

26

我同时使用Nltk和Scikit Learn进行一些文本处理。然而，在我的文档列表中，有一些文档不是英语。例如，以下情况可能是真实的：

[ "this is some text written in English", 
  "this is some more text written in English", 
  "Ce n'est pas en anglais" ]

为了进行分析，我希望在预处理过程中删除所有不是英语的句子。然而，有没有好的方法可以做到这一点呢？我已经通过谷歌搜索，但找不到任何特定的东西，可以让我识别字符串是否为英语。这是在 Nltk 或 Scikit learn 中都没有提供功能吗？编辑我看到了一些类似于这个和这个的问题，但都是针对单个单词的......而不是"文档"。我必须遍历句子中的每个单词来检查整个句子是否为英语吗？我正在使用Python，因此首选 Python 库，但如果需要，我可以切换语言，只是认为 Python 是最好的选择。

- ocean800

7个回答

23

你可能对我的论文“用于书面语言识别的WiLI基准数据集”感兴趣。我还对一些工具进行了基准测试。

简而言之：

CLD-2非常出色且极快
lang-detect略微优于CLD-2，但速度慢得多
langid很不错，但CLD-2和lang-detect更好
NLTK的Textcat既不高效也不有效

你可以安装lidtk并对语言进行分类：

$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
fra

- Martin Thoma

1

cld2在Python3中有支持吗？ - jeffry copps

3

如果您希望得到更详细的答案，我建议阅读这篇论文。 - Martin Thoma

Martin Thoma，我想要一个快速的Python语言检测工具。有Python3支持，但是如何调用langDetect或predict函数呢？你有任何示例吗？谢谢。 - tursunWali

谢谢分享。非常有用。我正在阅读你的论文。它已经在会议上发表或者期刊上发表了吗？这样我就可以相应地引用它了。 - Simone

1

@Simone 谢谢！如果你访问 https://arxiv.org/abs/1801.07779，你可以看到一个包含所有引用细节的Bibtex导出。该论文并未在同行评议期刊/会议上发表。 - Martin Thoma

显示剩余2条评论

5

预训练的Fast Text模型最适合我的相似需求

我有一个非常类似的需求，很感谢Martin Thoma的回答。但是，在Rabash的答案第七部分（链接在此）中，我找到了最有用的帮助。

经过尝试，我发现fasttext是一个非常优秀的工具，最适合满足我的需要，即确保在60,000多个文本文件中的文本为英语。

稍加修改后，我拥有了一个可以快速处理多个文件的工具。以下是带有注释的代码。我相信您和其他人可以修改此代码以满足更具体的需求。

class English_Check:
    def __init__(self):
        # Don't need to train a model to detect languages. A model exists
        #    that is very good. Let's use it.
        pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
        self.model = fasttext.load_model(pretrained_model_path)

    def predictionict_languages(self, text_file):
        this_D = {}
        with open(text_file, 'r') as f:
            fla = f.readlines()  # fla = file line array.
            # fasttext doesn't like newline characters, but it can take
            #    an array of lines from a file. The two list comprehensions
            #    below, just clean up the lines in fla
            fla = [line.rstrip('\n').strip(' ') for line in fla]
            fla = [line for line in fla if len(line) > 0]

            for line in fla:  # Language predict each line of the file
                language_tuple = self.model.predictionict(line)
                # The next two lines simply get at the top language prediction
                #    string AND the confidence value for that prediction.
                prediction = language_tuple[0][0].replace('__label__', '')
                value = language_tuple[1][0]

                # Each top language prediction for the lines in the file
                #    becomes a unique key for the this_D dictionary.
                #    Everytime that language is found, add the confidence
                #    score to the running tally for that language.
                if prediction not in this_D.keys():
                    this_D[prediction] = 0
                this_D[prediction] += value

        self.this_D = this_D

    def determine_if_file_is_english(self, text_file):
        self.predictionict_languages(text_file)

        # Find the max tallied confidence and the sum of all confidences.
        max_value = max(self.this_D.values())
        sum_of_values = sum(self.this_D.values())
        # calculate a relative confidence of the max confidence to all
        #    confidence scores. Then find the key with the max confidence.
        confidence = max_value / sum_of_values
        max_key = [key for key in self.this_D.keys()
                   if self.this_D[key] == max_value][0]

        # Only want to know if this is english or not.
        return max_key == 'en'

以下是我需要的类的应用/实例化和使用。

file_list = # some tool to get my specific list of files to check for English

en_checker = English_Check()
for file in file_list:
    check = en_checker.determine_if_file_is_english(file)
    if not check:
        print(file)

- Thom Ives

4

这是我之前使用过的内容。它适用于三个词以上且不超过三个未被识别的词的文本。当然，您可以根据需要调整设置，但对于我的用例（网站抓取），这些设置效果相当不错。

from enchant.checker import SpellChecker

max_error_count = 4
min_text_length = 3

def is_in_english(quote):
  d = SpellChecker("en_US")
  d.set_text(quote)
  errors = [err.word for err in d]
  return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True

print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

- grizmin

代码在 is_in_english 调用中需要同时使用单引号和双引号吗？ - David Medinets

如果你仔细看，你会发现那实际上不是普通的双引号。只是一个看起来像双引号的符号。 - grizmin

2

如果您想要一些轻量级的内容，字母三元组是一种流行的方法。每种语言都有不同的常见和不常见的三元组“轮廓”。您可以在谷歌上搜索它，或者自己编写代码。这里是我找到的一个示例实现，它使用“余弦相似度”作为样本文本和参考数据之间距离的度量。

http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

如果你的语料库中包含常见的非英语语言，那么将其转换成一个是/否测试就很容易。如果没有，你需要预测来自没有三元组统计数据的语言的句子。我建议进行一些测试，以了解文档中单句文本的相似性得分的正常范围，并选择适合英语余弦得分的阈值。

- alexis

谢谢你的回答！只是想问一下，你知道这在大型数据集上的性能如何吗？ - ocean800

1

三元模型速度很快...没什么可做的。但是你所说的“大数据集”是什么意思？如果你的每个文档都是单一语言，并且你有很多文档，以至于在整个文档上计算三元组会拖慢速度，那就在几百个单词后停止计算。 - alexis

2

使用enchant库

import enchant

dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc

dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False

以下示例直接摘自他们的网站

- lordingtar

谢谢，这个库看起来也很有趣。你知道这个库在长字符串文档上的性能如何吗？ - ocean800

我没有在非常长的文档字符串上使用它；我为此训练了自己的模型。试试看这个库是否足够强大！它还有自己的拼写检查器（该库的主要目的）。 - lordingtar

会尝试一下，看哪个库更好用，谢谢 :) - ocean800

4

“enchant” 只能对英文单词进行检查，不能对词组进行检查。例如，“Hello” 被标记为“True”，但是“hello world” 被标记为“False”。而且它已经不再被积极维护了。 - yuqli

0

import enchant
def check(text):
    text=text.split()
    dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
    for i in range(len(text)):
        if(dictionary.check(text[i])==False):
            o = "False"
            break
        else:
            o = ("True")
        return o

- Mohammad Khaddam

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- salehinejad · Accepted Answer

29

有一个叫做langdetect的库，它是从Google的language-detection移植过来的，可以在这里找到:

https://pypi.python.org/pypi/langdetect

它支持开箱即用的55种语言。

- salehinejad

4

非常感谢，您提供了我需要的内容！ :) 只有一个问题，您知道这个库在处理长文档时的性能情况吗？ - ocean800

4

我没有使用过它。如果你在这里分享你的经验，那就太好了。 - salehinejad

4

很不幸，处理长篇文件速度较慢，但还是谢谢！ - ocean800

langdetect 有时无法正确检测语言。它失败了。我试图检测单词“DRIVE”，但它说是德语。 - Pravin

2

@ocean800 你为什么关心长文档呢？如果一个文档是用英语写的，那么所有的句子都是用英语写的。这意味着只需要分析一句话就足够了。 - ceving