在一个单词中检测音节

Question

在一个单词中检测音节

nlpspell-checkinghyphenation

156

我需要找到一种相当有效的方法来检测单词中的音节。例如，

Invisible -> in-vi-sib-le

有一些可以使用的划分音节规则：

V CV VC CVC CCV CCCV CVCC

*其中V表示元音，C表示辅音。例如，

Pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)

我已经尝试了一些方法，包括使用正则表达式（只有在您想要计算音节数目时才有用）或硬编码规则定义（一种效率非常低下的暴力方法），最后使用有限状态自动机（并没有得到任何有用的结果）。

我的应用程序的目的是创建给定语言中所有音节的字典。稍后将使用此字典进行拼写检查应用程序（使用贝叶斯分类器）和文本到语音合成。

除了以前的方法之外，如果有人能够给我提供解决此问题的替代方法，我将不胜感激。

我使用Java工作，但对于C / C ++，C＃，Python，Perl等的任何提示都适用于我。

- user50705

你是想要实际的分割点还是只是单词中音节的数量？如果是后者，考虑在文本到语音字典中查找单词并计算编码元音音素的音位数。 - Adrian McCarthy

最有效的方式（从计算角度而言，不是存储角度），我猜想应该是使用Python字典，以单词作为键，音节数作为值。然而，你仍然需要一个备选方案来处理未在字典中出现的单词。如果你找到这样的字典，请告诉我！ - Brōtsyorfuzthrāx

17个回答

50

我偶然发现了这个页面，寻找同样的东西，并在这里找到了几个Liang论文的实现：

https://github.com/mnater/hyphenator 或其继任者：https://github.com/mnater/Hyphenopoly

除非你是喜欢阅读60页论文而不是为非独特问题自由使用可用代码的人。 :)

- Sean

同意 - 直接使用现有的实现要方便得多。 - hoju

43

以下是使用 NLTK 的解决方案：

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]

- hoju

嘿，感谢您的关注。在函数def nsyl(word)中应该是一个小错误，请查看以下内容：返回[d[word.lower()]中x的列表中以数字结尾的y的数量]的长度列表。 - Gourneau

6

你认为在语料库中找不到对应词汇时，应该采用什么替换方案？请给出建议。 - Dan Gayle

4

@Pureferret cmudict 是一个为北美英语单词提供发音的词典。它将单词分解成音素，比音节更短（例如，单词“cat”被分解为三个音素：K-AE-T）。但元音也有一个“重音标记”：0、1或2，取决于单词的发音（因此，在“cat”中的AE变为AE1）。答案中的代码计算重音标记和元音的数量，从而有效地给出了音节数量（请注意，在OP的示例中，每个音节都只有一个元音）。 - billy_chapters

6

这将返回音节数量，而非音节划分。 - Adam Michael Wood

21

我正在尝试解决一个程序的问题，该程序将计算一段文本的Flesch-Kincaid和Flesch阅读得分。我的算法使用了我在这个网站上找到的内容：http://www.howmanysyllables.com/howtocountsyllables.html ，并且它的结果相当接近。它仍然在像invisible和hyphenation这样的复杂单词上有困难，但我发现它对于我的目的来说已经足够接近了。

它的优点是易于实现。我发现“es”可以是音节或非音节。这是一个赌博，但我决定在我的算法中删除“es”。

private int CountSyllables(string word)
    {
        char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
        string currentWord = word;
        int numVowels = 0;
        bool lastWasVowel = false;
        foreach (char wc in currentWord)
        {
            bool foundVowel = false;
            foreach (char v in vowels)
            {
                //don't count diphthongs
                if (v == wc && lastWasVowel)
                {
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
                else if (v == wc && !lastWasVowel)
                {
                    numVowels++;
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
            }

            //if full cycle and no vowel found, set lastWasVowel to false;
            if (!foundVowel)
                lastWasVowel = false;
        }
        //remove es, it's _usually? silent
        if (currentWord.Length > 2 && 
            currentWord.Substring(currentWord.Length - 2) == "es")
            numVowels--;
        // remove silent e
        else if (currentWord.Length > 1 &&
            currentWord.Substring(currentWord.Length - 1) == "e")
            numVowels--;

        return numVowels;
    }

- Joe Basirico

1

对于我在寻找专有名词中音节的简单场景来说，这似乎初步运行良好。感谢您将其发布在此处。 - Norman H

这是一个不错的尝试，但即使经过一些简单的测试，它似乎并不是非常准确。例如，“anyone”返回1个音节而不是3个，“Minute”返回3个而不是2个，“Another”返回2个而不是3个。 - Aidan

9

这是一个特别困难的问题，由LaTeX断字算法无法完全解决。在论文Evaluating Automatic Syllabification Algorithms for English（Marchand, Adsett和Damper 2007）中，概述了一些可用方法及其中涉及的挑战。

- Chris

8

为什么要计算它？每个在线词典都有这个信息。 http://dictionary.reference.com/browse/invisible，无形的

- Cerin

5

也许它必须适用于字典中不存在的单词，例如名字？ - Wouter Lievens

4

我不认为姓名足够规范以进行自动音节划分。对于威尔士或苏格兰起源的姓名，甚至印度和尼日利亚起源的姓名，英语音节划分器都会表现糟糕，而在伦敦等地方的同一房间中，你可能会找到所有这些姓名。 - Jean-François Corbett

1

必须记住，考虑到这是一种纯启发式方法应用于一个模糊的领域，不合理地期望比人类提供更好的性能。 - Darren Ringer

6

今天我找到了这个Java实现的Frank Liang的英语或德语断字算法模式的texhyphj，它工作得非常好，并且可以在Maven Central中使用。

注意：重要的是要删除.tex模式文件的最后几行，否则这些文件就不能被当前版本的Maven Central加载。

要加载和使用断词器(hyphenator)，您可以使用以下Java代码片段。 texTable是包含所需模式的.tex文件的名称。这些文件可在项目Github网站上获取。

 private Hyphenator createHyphenator(String texTable) {
        Hyphenator hyphenator = new Hyphenator();
        hyphenator.setErrorHandler(new ErrorHandler() {
            public void debug(String guard, String s) {
                logger.debug("{},{}", guard, s);
            }

            public void info(String s) {
                logger.info(s);
            }

            public void warning(String s) {
                logger.warn("WARNING: " + s);
            }

            public void error(String s) {
                logger.error("ERROR: " + s);
            }

            public void exception(String s, Exception e) {
                logger.error("EXCEPTION: " + s, e);
            }

            public boolean isDebugged(String guard) {
                return false;
            }
        });

        BufferedReader table = null;

        try {
            table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
                    .getResourceAsStream((texTable)), Charset.forName("UTF-8")));
            hyphenator.loadTable(table);
        } catch (Utf8TexParser.TexParserException e) {
            logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
            throw new RuntimeException("Failed to load hyphenation table", e);
        } finally {
            if (table != null) {
                try {
                    table.close();
                } catch (IOException e) {
                    logger.error("Closing hyphenation table failed", e);
                }
            }
        }

        return hyphenator;
    }

接下来，连字器已经可以使用。为了检测音节，基本思想是在提供的连字符处将术语分割。

    String hyphenedTerm = hyphenator.hyphenate(term);

    String hyphens[] = hyphenedTerm.split("\u00AD");

    int syllables = hyphens.length;

由于API没有返回正常的"-"，所以您需要在"\u00AD"上进行拆分。

这种方法比Joe Basirico的答案表现更好，因为它支持许多不同的语言，并且可以更准确地检测德语连字号。

- rzo1

6

提醒 @Tihamer 和 @joe-basirico。这是一个非常有用的函数，虽然不是完美的，但对于大多数中小型项目来说还是很好的选择。Joe，我已经用Python重新实现了你的代码：

def countSyllables(word):
    vowels = "aeiouy"
    numVowels = 0
    lastWasVowel = False
    for wc in word:
        foundVowel = False
        for v in vowels:
            if v == wc:
                if not lastWasVowel: numVowels+=1   #don't count diphthongs
                foundVowel = lastWasVowel = True
                        break
        if not foundVowel:  #If full cycle and no vowel found, set lastWasVowel to false
            lastWasVowel = False
    if len(word) > 2 and word[-2:] == "es": #Remove es - it's "usually" silent (?)
        numVowels-=1
    elif len(word) > 1 and word[-1:] == "e":    #remove silent e
        numVowels-=1
    return numVowels

希望有人会觉得这个有用！

- Tersosauros

6

我前不久也遇到了完全相同的问题。

最后，我使用CMU发音字典快速准确地查找大多数单词。对于字典中没有的单词，我退而求其次，使用一个机器学习模型来预测音节计数，准确率约为98％。

我将整个过程封装在一个易于使用的Python模块中：https://github.com/repp/big-phoney 安装： pip install big-phoney 统计音节数：

from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops')  # --> 4

如果您没有使用Python并想尝试基于ML模型的方法，我在Kaggle上对音节计数模型的工作原理进行了非常详细的介绍。

这里是链接

- Ryan Epp

OP正在寻找字母的音节划分，而不是音素。 - rednoyz

5

感谢Joe Basirico分享他在C#中的快速而简单的实现方法。我曾经使用过大型库，它们可以工作，但通常速度较慢，在快速项目中，您的方法是完全可行的。

以下是您的Java代码以及测试用例：

public static int countSyllables(String word)
{
    char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
    char[] currentWord = word.toCharArray();
    int numVowels = 0;
    boolean lastWasVowel = false;
    for (char wc : currentWord) {
        boolean foundVowel = false;
        for (char v : vowels)
        {
            //don't count diphthongs
            if ((v == wc) && lastWasVowel)
            {
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
            else if (v == wc && !lastWasVowel)
            {
                numVowels++;
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
        }
        // If full cycle and no vowel found, set lastWasVowel to false;
        if (!foundVowel)
            lastWasVowel = false;
    }
    // Remove es, it's _usually? silent
    if (word.length() > 2 && 
            word.substring(word.length() - 2) == "es")
        numVowels--;
    // remove silent e
    else if (word.length() > 1 &&
            word.substring(word.length() - 1) == "e")
        numVowels--;
    return numVowels;
}

public static void main(String[] args) {
    String txt = "what";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "super";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Maryland";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "American";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "disenfranchized";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Sophia";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}

结果和预期一样（对于Flesch-Kincaid而言，它足够好用）：

txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2

- Tihamer

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jason · Accepted Answer

136

阅读有关 TeX 对连字符问题的解决方法。特别是查看 Frank Liang 的论文《计算机单词连字处理》。他的算法非常准确，并包括一个小的例外词典，用于处理算法无法处理的情况。

- Jason

64

我很欣赏你在这个话题上引用了一篇论文，这为原帖作者提供了一个小提示，表明这可能不是一个简单的问题。 - Karl

我写了一篇快速的文章，测试了这种方法，包括统计数据：http://allenporter.tumblr.com/post/9776954743/syllables -- 虽然连字号的方法很有前途，但是通过计算元音字母的临时方法似乎更准确，因为连字号算法会出现少分音节的错误。就我所知，这绝对不是一个解决的问题。 - allenporter

@allenporter 我看了你的网页。根据你的统计，连字号方法不够准确。我还阅读了两篇文章http://eprints.soton.ac.uk/264285/1/MarchandAdsettDamper_ISCA07.pdf和http://web.cs.dal.ca/~adsett/publications/AdsMar_CompSyllMeth_2009.pdf。你知道他们文章中的SbA方法吗？他们声称连字号的正确率高达95%左右。你用于评估的那个大词典（1m大小），你能告诉我在哪里以及如何获得它进行测试吗？ - Warren

11

请注意，TeX算法用于寻找合法的断词点，这与音节分割并不完全相同。确实，断词点落在音节分割上，但并非所有的音节分割都是有效的断词点。例如，通常不会在单词两端的一个或两个字母内使用连字符。我还认为TeX模式被调整为在假负和假正之间权衡（即永远不要将连字符放在不应该出现的位置，即使这意味着错过了一些合法的断词机会）。 - Adrian McCarthy

1

我也不相信连字符是答案。 - Ezequiel

显示剩余4条评论