如何确定一个字符串是英文句子还是代码？

Question

如何确定一个字符串是英文句子还是代码？

7

考虑以下两个字符串，第一个是代码，第二个是英语句子（确切来说是短语）。我如何检测第一个是否为代码而第二个不是。

1. for (int i = 0; i < b.size(); i++) {
2. do something in English (not necessary to be a sentence).

我正在考虑计算特殊字符（如"="，";"，"++"等），并将其设置为某个阈值。是否有更好的方法来实现这一点？是否有任何Java库可用？

请注意，代码可能无法解析，因为它不是完整的方法/语句/表达式。

我的假设是英语句子非常规则，它很可能只包含“，”，“。”，“_”，“（”，“）”等。它们不包含像这样的内容：write("the whole lot of text");

- Ryan

我的，那会很困难，老实说，我建议你在这之前先做些研究并编写一些代码再来。 - DreadHeadedDeveloper

@ElliottFrisch 我只关心Java代码。 - Ryan

1

我相信你需要做的不仅仅是解决停机问题。祝你好运！也许你可以“作弊”，手动使用类似于“text:”的标记来标记文字。 - Elliott Frisch

@ElliottFrisch 谢谢！我会看一下。 - Ryan

1

代码是否保证是Java代码？有些语言的代码也可以是有效的英语。http://en.wikipedia.org/wiki/Shakespeare_(programming_language) - Louis Wasserman

显示剩余5条评论

7个回答

3

请研究词法分析和语法分析（就像编写编译器一样）。如果您不需要完整的语句，甚至可能不需要解析器。

- Platinum Azure

你的回答给了我一些提示，现在我有一些想法了。+！ - Ryan

2

基本思想是将字符串转换为一个令牌集合。例如，上面的代码行可以变成“KEY，SEPARATOR，ID，ASSIGN，NUMBER，SEPARATOR，...”。然后，我们可以使用简单的规则将代码与英语分开。在这里查看代码

- user2250367

1

不需要重复造轮子，编译器已经为您完成了这项工作。任何编译过程的第一阶段都会检查文件中的标记是否在语言范围内。这显然对我们没有帮助，因为英语和Java之间没有区别。但是第二阶段——语法分析——将打印出任何形式上正确的英语句子的错误（而不是Java代码或其他不正确的Java代码）。因此，为什么不使用已经可用的Java编译器，而不是使用外部库并尝试使用替代方法呢？

您可以拥有一个包装类，例如

public class Test{

    public static void main(){

         /*Insert code to check here*/

    }

}

如果代码成功编译，那么一切正常，你就知道它是有效的代码。当然，对于不完整的代码片段，例如在示例中放置的没有结束括号的for循环，它无法工作。如果编译不成功，可以以多种方式处理字符串，例如尝试使用基于flex-bison制作的自制伪英语语法分析器解析它，这是GNU用来制作GCC等工具的工具。我不确定你正在尝试制作什么样的程序，但通过这种方式，你可以知道它是代码、手工制作的英语句子还是你不需要关心的垃圾。解析自然语言非常困难，目前现代方法使用不准确的统计方法，因此它们并不总是正确，这可能不是你想要的程序。

- Álvaro Gómez

这假设代码不是一个完整的类。它还假设没有编程错误。 - Joseph K. Strauss

1

对于一些样本来说似乎效果还不错的非常简单的方法。去掉System.out，它仅用于说明目的。从示例输出中可以看出，代码注释看起来像文本，因此如果将大型非javadoc块注释混入代码中，则可能会得到错误的结果。硬编码的阈值是我的估计。请随意微调。

public static void main(String[] args) {
    for(String arg : args){
        System.out.println(arg);
        System.out.println(codeStatus(arg));
    }
}

static CodeStatus codeStatus (String string) {
    String[] words = string.split("\\b");
    int nonText = 0;
    for(String word: words){
        if(!word.matches("^[A-Za-z][a-z]*|[0-9]+(.[0-9]+)?|[ .,]|. $")){
            nonText ++;
        }
    }
    System.out.print("\n");
    double percentage = ((double) nonText) / words.length;
    System.out.println(percentage);
    if(percentage > .2){
        return CodeStatus.CODE;
    }
    if(percentage < .1){
        return CodeStatus.TEXT;
    }
    return CodeStatus.INDETERMINATE;
}

enum CodeStatus {
    CODE, TEXT, INDETERMINATE
}

样例输出：

You can try the OpenNLP sentence parser. It returns the n best parses for a sentence. For most English sentences it returns at least one. I believe, that for most code snippets it won't return any and hence you can be quite sure it is not an English sentence.

0.0297029702970297
TEXT
Use this code for parsing:

0.18181818181818182
INDETERMINATE
    // Initialize the sentence detector

0.125
INDETERMINATE
    final SentenceDetectorME sdetector = EasyParserUtils
            .getOpenNLPSentDetector(Constants.SENTENCE_DETECTOR_DATA);

0.6
CODE
    // Initialize the parser

0.16666666666666666
INDETERMINATE
    final Parser parser = EasyParserUtils
            .getOpenNLPParser(Constants.PARSER_DATA_LOC);

0.5333333333333333
CODE
    // Get sentences of the text

0.1
INDETERMINATE
    final String sentences[] = sdetector.sentDetect(essay);

0.38461538461538464
CODE
    // Go through the sentences and parse each

0.07142857142857142
TEXT
    for (final String sentence : sentences) {
        // Parse the sentence, produce only 1 parse
        final Parse[] parses = ParserTool.parseLine(sentence, parser, 10);
        if (parses.length == 0) {
            // Most probably this is code
        }
        else {
            // An English sentence
        }
    }

0.2537313432835821
CODE
and these are the two helper methods (from EasyParserUtils) used in the code:

0.14814814814814814
INDETERMINATE
public static Parser getOpenNLPParser(final String parserDataURL) {
    try (final InputStream isParser = new FileInputStream(parserDataURL);) {
        // Get model for the parser and initialize it
        final ParserModel parserModel = new ParserModel(isParser);
        return ParserFactory.create(parserModel);
    }
    catch (final IOException e) {

0.3835616438356164
CODE

- Joseph K. Strauss

1

你可以使用Java解析器或者使用BNF创建一个，但问题在于你说的代码可能无法解析，所以会失败。

我的建议是：使用一些自定义的正则表达式来检测代码中的特殊模式。尽可能使用多个以获得良好的成功率。

一些例子：

for\s*\(（for循环）
while\s*\(（while循环）
[a-zA-Z_$][a-zA-Z\d_$]*\s*\(（构造函数）
\)\s*\{（块/方法的开始）
...

是的，这是一个冒险，但看看你想要的东西，你没有太多选择。

- ToYonos

0

这里有一个完美且安全的解决方案。基本思路是先获取所有可用关键字和特殊字符，然后使用集合构建一个分词器。例如，问题中的代码行变成了“KEY，SEPARATOR，ID，ASSIGN，NUMBER，SEPARATOR，...”。然后我们可以使用简单规则将代码与英语分开。

- Ryan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Augustin · Accepted Answer

您可以尝试使用OpenNLP句法分析器。它会为一个句子返回n个最佳解析结果。对于大多数英语句子，它都能返回至少一个解析结果。但我相信，针对大多数代码片段，它无法返回任何解析结果，因此您可以相当确定该代码片段不是英语句子。

使用以下代码进行分析：

    // Initialize the sentence detector
    final SentenceDetectorME sdetector = EasyParserUtils
            .getOpenNLPSentDetector(Constants.SENTENCE_DETECTOR_DATA);

    // Initialize the parser
    final Parser parser = EasyParserUtils
            .getOpenNLPParser(Constants.PARSER_DATA_LOC);

    // Get sentences of the text
    final String sentences[] = sdetector.sentDetect(essay);

    // Go through the sentences and parse each
    for (final String sentence : sentences) {
        // Parse the sentence, produce only 1 parse
        final Parse[] parses = ParserTool.parseLine(sentence, parser, 10);
        if (parses.length == 0) {
            // Most probably this is code
        }
        else {
            // An English sentence
        }
    }

以下是代码中使用的两个辅助方法 (来自 EasyParserUtils)：

public static Parser getOpenNLPParser(final String parserDataURL) {
    try (final InputStream isParser = new FileInputStream(parserDataURL);) {
        // Get model for the parser and initialize it
        final ParserModel parserModel = new ParserModel(isParser);
        return ParserFactory.create(parserModel);
    }
    catch (final IOException e) {
        e.printStackTrace();
        return null;
    }
}

并且

public static SentenceDetectorME getOpenNLPSentDetector(
        final String sentDetDataURL) {
    try (final InputStream isSent = new FileInputStream(sentDetDataURL)) {
        // Get models for sentence detector and initialize it
        final SentenceModel sentDetModel = new SentenceModel(isSent);
        return new SentenceDetectorME(sentDetModel);
    }
    catch (final IOException e) {
        e.printStackTrace();
        return null;
    }
}