有人有一个好的Proper Case算法吗？

Question

有人有一个好的Proper Case算法吗？

26

有没有一种可信的 Proper Case 或 PCase 算法（类似于 UCase 或 Upper）？我正在寻找一些可以将像 "GEORGE BURDELL" 或 "george burdell" 这样的值转换为 "George Burdell" 的方法。

我有一个简单的算法可以处理简单的情况。理想情况下，希望有一种方法可以处理像 "O'REILLY" 这样的情况，并将其转换为 "O'Reilly"，但我知道这更困难。

我主要关注英语，如果只考虑英语，是否会简化事情？

更新： 我使用 C# 作为编程语言，但我可以从几乎任何语言进行转换（假定存在类似功能）。

我同意 McDonald's 场景是一个棘手的问题。我原本打算将它与我的 O'Reilly 示例一起提及，但在原始帖子中没有提到。

- jttraino

13个回答

11

@zwol：我会单独回复发布它。

这是一个基于ljs的帖子的示例。

void Main()
{
    List<string> names = new List<string>() {
        "bill o'reilly", 
        "johannes diderik van der waals", 
        "mr. moseley-williams", 
        "Joe VanWyck", 
        "mcdonald's", 
        "william the third", 
        "hrh prince charles", 
        "h.r.m. queen elizabeth the third",
        "william gates, iii", 
        "pope leo xii",
        "a.k. jennings"
    };
    
    names.Select(name => name.ToProperCase()).Dump();
}

// https://dev59.com/wXVD5IYBdhLWcg3wRpeX
public static class ProperCaseHelper
{
    public static string ToProperCase(this string input)
    {
        if (IsAllUpperOrAllLower(input))
        {
            // fix the ALL UPPERCASE or all lowercase names
            return string.Join(" ", input.Split(' ').Select(word => wordToProperCase(word)));
        }
        else
        {
            // leave the CamelCase or Propercase names alone
            return input;
        }
    }

    public static bool IsAllUpperOrAllLower(this string input)
    {
        return (input.ToLower().Equals(input) || input.ToUpper().Equals(input));
    }

    private static string wordToProperCase(string word)
    {
        if (string.IsNullOrEmpty(word)) return word;

        // Standard case
        string ret = capitaliseFirstLetter(word);

        // Special cases:
        ret = properSuffix(ret, "'");   // D'Artagnon, D'Silva
        ret = properSuffix(ret, ".");   // ???
        ret = properSuffix(ret, "-");       // Oscar-Meyer-Weiner
        ret = properSuffix(ret, "Mc", t => t.Length > 4);      // Scots
        ret = properSuffix(ret, "Mac", t => t.Length > 5);     // Scots except Macey

        // Special words:
        ret = specialWords(ret, "van");     // Dick van Dyke
        ret = specialWords(ret, "von");     // Baron von Bruin-Valt
        ret = specialWords(ret, "de");
        ret = specialWords(ret, "di");
        ret = specialWords(ret, "da");      // Leonardo da Vinci, Eduardo da Silva
        ret = specialWords(ret, "of");      // The Grand Old Duke of York
        ret = specialWords(ret, "the");     // William the Conqueror
        ret = specialWords(ret, "HRH");     // His/Her Royal Highness
        ret = specialWords(ret, "HRM");     // His/Her Royal Majesty
        ret = specialWords(ret, "H.R.H.");  // His/Her Royal Highness
        ret = specialWords(ret, "H.R.M.");  // His/Her Royal Majesty

        ret = dealWithRomanNumerals(ret);   // William Gates, III

        return ret;
    }

    private static string properSuffix(string word, string prefix, Func<string, bool> condition = null)
    {
        if (string.IsNullOrEmpty(word)) return word;
        if (condition != null && ! condition(word)) return word;
        
        string lowerWord = word.ToLower();
        string lowerPrefix = prefix.ToLower();

        if (!lowerWord.Contains(lowerPrefix)) return word;

        int index = lowerWord.IndexOf(lowerPrefix);

        // If the search string is at the end of the word ignore.
        if (index + prefix.Length == word.Length) return word;

        return word.Substring(0, index) + prefix +
            capitaliseFirstLetter(word.Substring(index + prefix.Length));
    }

    private static string specialWords(string word, string specialWord)
    {
        if (word.Equals(specialWord, StringComparison.InvariantCultureIgnoreCase))
        {
            return specialWord;
        }
        else
        {
            return word;
        }
    }

    private static string dealWithRomanNumerals(string word)
    {
        // Roman Numeral parser thanks to [djk](https://stackoverflow.com/users/785111/djk)
        // Note that it excludes the Chinese last name Xi
        return new Regex(@"\b(?!Xi\b)(X|XX|XXX|XL|L|LX|LXX|LXXX|XC|C)?(I|II|III|IV|V|VI|VII|VIII|IX)?\b", RegexOptions.IgnoreCase).Replace(word, match => match.Value.ToUpperInvariant());
    }

    private static string capitaliseFirstLetter(string word)
    {
        return char.ToUpper(word[0]) + word.Substring(1).ToLower();
    }

}

- Colin

2

我们将这个加入了我们的生产系统。仅仅过了15分钟，就有一个客户问为什么“Macey”被设置为“MacEy”...所以我们删除了那一行代码并保留其他所有内容。谢谢！ - NotMe

谢谢！其实我也认识一个名叫梅西的人。嗯...等我有时间了，我会爬取苏格兰盖尔语名字维基百科页面上所有大小写混合的单词，并将其添加进来。http://en.wikipedia.org/wiki/List_of_Scottish_Gaelic_surnames - Colin

1

如果罗马数字旁边有符号，比如“Sir William III.”，这个方法就会失败。我将DealWithRomanNumerals更改为这个一行代码，效果非常好：

return new Regex(@"\b(?!Xi\b)(X|XX|XXX|XL|L|LX|LXX|LXXX|XC|C)?(I|II|III|IV|V|VI|VII|VIII|IX)?\b", RegexOptions.IgnoreCase).Replace(word, match => match.Value.ToUpperInvariant());

-- 这个方法还过滤了常见的中文名字“Xi”。 - djk

5

我快速完成了基于Lingua::EN::NameCase的https://github.com/tamtamchik/namecase的C#移植。

public static class CIQNameCase
{
    static Dictionary<string, string> _exceptions = new Dictionary<string, string>
        {
            {@"\bMacEdo"     ,"Macedo"},
            {@"\bMacEvicius" ,"Macevicius"},
            {@"\bMacHado"    ,"Machado"},
            {@"\bMacHar"     ,"Machar"},
            {@"\bMacHin"     ,"Machin"},
            {@"\bMacHlin"    ,"Machlin"},
            {@"\bMacIas"     ,"Macias"},
            {@"\bMacIulis"   ,"Maciulis"},
            {@"\bMacKie"     ,"Mackie"},
            {@"\bMacKle"     ,"Mackle"},
            {@"\bMacKlin"    ,"Macklin"},
            {@"\bMacKmin"    ,"Mackmin"},
            {@"\bMacQuarie"  ,"Macquarie"}
        };

    static Dictionary<string, string> _replacements = new Dictionary<string, string>
        {
            {@"\bAl(?=\s+\w)"         , @"al"},        // al Arabic or forename Al.
            {@"\b(Bin|Binti|Binte)\b" , @"bin"},       // bin, binti, binte Arabic
            {@"\bAp\b"                , @"ap"},        // ap Welsh.
            {@"\bBen(?=\s+\w)"        , @"ben"},       // ben Hebrew or forename Ben.
            {@"\bDell([ae])\b"        , @"dell$1"},    // della and delle Italian.
            {@"\bD([aeiou])\b"        , @"d$1"},       // da, de, di Italian; du French; do Brasil
            {@"\bD([ao]s)\b"          , @"d$1"},       // das, dos Brasileiros
            {@"\bDe([lrn])\b"         , @"de$1"},      // del Italian; der/den Dutch/Flemish.
            {@"\bEl\b"                , @"el"},        // el Greek or El Spanish.
            {@"\bLa\b"                , @"la"},        // la French or La Spanish.
            {@"\bL([eo])\b"           , @"l$1"},       // lo Italian; le French.
            {@"\bVan(?=\s+\w)"        , @"van"},       // van German or forename Van.
            {@"\bVon\b"               , @"von"}        // von Dutch/Flemish
        };

    static string[] _conjunctions = { "Y", "E", "I" };

    static string _romanRegex = @"\b((?:[Xx]{1,3}|[Xx][Ll]|[Ll][Xx]{0,3})?(?:[Ii]{1,3}|[Ii][VvXx]|[Vv][Ii]{0,3})?)\b";

    /// <summary>
    /// Case a name field into its appropriate case format 
    /// e.g. Smith, de la Cruz, Mary-Jane,  O'Brien, McTaggart
    /// </summary>
    /// <param name="nameString"></param>
    /// <returns></returns>
    public static string NameCase(string nameString)
    {
        // Capitalize
        nameString = Capitalize(nameString);
        nameString = UpdateIrish(nameString);

        // Fixes for "son (daughter) of" etc
        foreach (var replacement in _replacements.Keys)
        {
            if (Regex.IsMatch(nameString, replacement))
            {
                Regex rgx = new Regex(replacement);
                nameString = rgx.Replace(nameString, _replacements[replacement]);
            }                    
        }

        nameString = UpdateRoman(nameString);
        nameString = FixConjunction(nameString);

        return nameString;
    }

    /// <summary>
    /// Capitalize first letters.
    /// </summary>
    /// <param name="nameString"></param>
    /// <returns></returns>
    private static string Capitalize(string nameString)
    {
        nameString = nameString.ToLower();
        nameString = Regex.Replace(nameString, @"\b\w", x => x.ToString().ToUpper());
        nameString = Regex.Replace(nameString, @"'\w\b", x => x.ToString().ToLower()); // Lowercase 's
        return nameString;
    }

    /// <summary>
    /// Update for Irish names.
    /// </summary>
    /// <param name="nameString"></param>
    /// <returns></returns>
    private static string UpdateIrish(string nameString)
    {
        if(Regex.IsMatch(nameString, @".*?\bMac[A-Za-z^aciozj]{2,}\b") || Regex.IsMatch(nameString, @".*?\bMc"))
        {
            nameString = UpdateMac(nameString);
        }            
        return nameString;
    }

    /// <summary>
    /// Updates irish Mac & Mc.
    /// </summary>
    /// <param name="nameString"></param>
    /// <returns></returns>
    private static string UpdateMac(string nameString)
    {
        MatchCollection matches = Regex.Matches(nameString, @"\b(Ma?c)([A-Za-z]+)");
        if(matches.Count == 1 && matches[0].Groups.Count == 3)
        {
            string replacement = matches[0].Groups[1].Value;
            replacement += matches[0].Groups[2].Value.Substring(0, 1).ToUpper();
            replacement += matches[0].Groups[2].Value.Substring(1);
            nameString = nameString.Replace(matches[0].Groups[0].Value, replacement);

            // Now fix "Mac" exceptions
            foreach (var exception in _exceptions.Keys)
            {
                nameString = Regex.Replace(nameString, exception, _exceptions[exception]);
            }
        }
        return nameString;
    }

    /// <summary>
    /// Fix roman numeral names.
    /// </summary>
    /// <param name="nameString"></param>
    /// <returns></returns>
    private static string UpdateRoman(string nameString)
    {
        MatchCollection matches = Regex.Matches(nameString, _romanRegex);
        if (matches.Count > 1)
        {
            foreach(Match match in matches)
            {
                if(!string.IsNullOrEmpty(match.Value))
                {
                    nameString = Regex.Replace(nameString, match.Value, x => x.ToString().ToUpper());
                }
            }
        }
        return nameString;
    }

    /// <summary>
    /// Fix Spanish conjunctions.
    /// </summary>
    /// <param name=""></param>
    /// <returns></returns>
    private static string FixConjunction(string nameString)
    {            
        foreach (var conjunction in _conjunctions)
        {
            nameString = Regex.Replace(nameString, @"\b" + conjunction + @"\b", x => x.ToString().ToLower());
        }
        return nameString;
    }
}

使用方法

string name_cased = CIQNameCase.NameCase("McCarthy");

这是我的测试方法，一切看起来都通过了：

[TestMethod]
public void Test_NameCase_1()
{
    string[] names = {
        "Keith", "Yuri's", "Leigh-Williams", "McCarthy",
        // Mac exceptions
        "Machin", "Machlin", "Machar",
        "Mackle", "Macklin", "Mackie",
        "Macquarie", "Machado", "Macevicius",
        "Maciulis", "Macias", "MacMurdo",
        // General
        "O'Callaghan", "St. John", "von Streit",
        "van Dyke", "Van", "ap Llwyd Dafydd",
        "al Fahd", "Al",
        "el Grecco",
        "ben Gurion", "Ben",
        "da Vinci",
        "di Caprio", "du Pont", "de Legate",
        "del Crond", "der Sind", "van der Post", "van den Thillart",
        "von Trapp", "la Poisson", "le Figaro",
        "Mack Knife", "Dougal MacDonald",
        "Ruiz y Picasso", "Dato e Iradier", "Mas i Gavarró",
        // Roman numerals
        "Henry VIII", "Louis III", "Louis XIV",
        "Charles II", "Fred XLIX", "Yusof bin Ishak",
    };

    foreach(string name in names)
    {
        string name_upper = name.ToUpper();
        string name_cased = CIQNameCase.NameCase(name_upper);
        Console.WriteLine(string.Format("name: {0} -> {1}  -> {2}", name, name_upper, name_cased));
        Assert.IsTrue(name == name_cased);
    }

}

- user326608

1

看起来很不错！顺便说一下，当我粘贴C#7时，我会得到错误[成员名称不能与其封闭类型相同]（https://i.imgur.com/RG20zF9.png）。我只需将类重命名即可解决问题。 - KyleMit

1

我对这个进行了一些重构：https://gist.github.com/dystopiandev/df1f4a496f395643557346dfd93fe6d8 - dystopiandev

4

还有一个很棒的Perl脚本可以将文本转换成标题格式。

http://daringfireball.net/2008/08/title_case_update

#!/usr/bin/perl

#     This filter changes all words to Title Caps, and attempts to be clever
# about *un*capitalizing small words like a/an/the in the input.
#
# The list of "small words" which are not capped comes from
# the New York Times Manual of Style, plus 'vs' and 'v'. 
#
# 10 May 2008
# Original version by John Gruber:
# http://daringfireball.net/2008/05/title_case
#
# 28 July 2008
# Re-written and much improved by Aristotle Pagaltzis:
# http://plasmasturm.org/code/titlecase/
#
#   Full change log at __END__.
#
# License: http://www.opensource.org/licenses/mit-license.php
#


use strict;
use warnings;
use utf8;
use open qw( :encoding(UTF-8) :std );


my @small_words = qw( (?<!q&)a an and as at(?!&t) but by en for if in of on or the to v[.]? via vs[.]? );
my $small_re = join '|', @small_words;

my $apos = qr/ (?: ['’] [[:lower:]]* )? /x;

while ( <> ) {
  s{\A\s+}{}, s{\s+\z}{};

  $_ = lc $_ if not /[[:lower:]]/;

  s{
      \b (_*) (?:
          ( (?<=[ ][/\\]) [[:alpha:]]+ [-_[:alpha:]/\\]+ |   # file path or
            [-_[:alpha:]]+ [@.:] [-_[:alpha:]@.:/]+ $apos )  # URL, domain, or email
          |
          ( (?i: $small_re ) $apos )                         # or small word (case-insensitive)
          |
          ( [[:alpha:]] [[:lower:]'’()\[\]{}]* $apos )       # or word w/o internal caps
          |
          ( [[:alpha:]] [[:alpha:]'’()\[\]{}]* $apos )       # or some other word
      ) (_*) \b
  }{
      $1 . (
        defined $2 ? $2         # preserve URL, domain, or email
      : defined $3 ? "\L$3"     # lowercase small word
      : defined $4 ? "\u\L$4"   # capitalize word w/o internal caps
      : $5                      # preserve other kinds of word
      ) . $6
  }xeg;


  # Exceptions for small words: capitalize at start and end of title
  s{
      (  \A [[:punct:]]*         # start of title...
      |  [:.;?!][ ]+             # or of subsentence...
      |  [ ]['"“‘(\[][ ]*     )  # or of inserted subphrase...
      ( $small_re ) \b           # ... followed by small word
  }{$1\u\L$2}xig;

  s{
      \b ( $small_re )      # small word...
      (?= [[:punct:]]* \Z   # ... at the end of the title...
      |   ['"’”)\]] [ ] )   # ... or of an inserted subphrase?
  }{\u\L$1}xig;

  # Exceptions for small words in hyphenated compound words
  ## e.g. "in-flight" -> In-Flight
  s{
      \b
      (?<! -)                 # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (in-flight)
      ( $small_re )
      (?= -[[:alpha:]]+)      # lookahead for "-someword"
  }{\u\L$1}xig;

  ## # e.g. "Stand-in" -> "Stand-In" (Stand is already capped at this point)
  s{
      \b
      (?<!…)                  # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (stand-in)
      ( [[:alpha:]]+- )       # $1 = first word and hyphen, should already be properly capped
      ( $small_re )           # ... followed by small word
      (?! - )                 # Negative lookahead for another '-'
  }{$1\u$2}xig;

  print "$_";
}

__END__

但是听起来你指的是“正确大小写”只适用于人名。

- Jeff Atwood

2

我今天写了这段代码，用于我正在开发的应用程序中。我认为这段代码很容易理解，并带有注释。它并不在所有情况下都是100％准确的，但它可以轻松处理大多数西方姓名。

例如：

mary-jane => Mary-Jane o'brien => O'Brien

Joël VON WINTEREGG => Joël von Winteregg

jose de la acosta => Jose de la Acosta

该代码可扩展，因为您可以将任何字符串值添加到顶部的数组中以满足您的需求。请仔细研究并添加任何可能需要的特殊功能。

function name_title_case($str)
{
  // name parts that should be lowercase in most cases
  $ok_to_be_lower = array('av','af','da','dal','de','del','der','di','la','le','van','der','den','vel','von');
  // name parts that should be lower even if at the beginning of a name
  $always_lower   = array('van', 'der');

  // Create an array from the parts of the string passed in
  $parts = explode(" ", mb_strtolower($str));

  foreach ($parts as $part)
  {
    (in_array($part, $ok_to_be_lower)) ? $rules[$part] = 'nocaps' : $rules[$part] = 'caps';
  }

  // Determine the first part in the string
  reset($rules);
  $first_part = key($rules);

  // Loop through and cap-or-dont-cap
  foreach ($rules as $part => $rule)
  {
    if ($rule == 'caps')
    {
      // ucfirst() words and also takes into account apostrophes and hyphens like this:
      // O'brien -> O'Brien || mary-kaye -> Mary-Kaye
      $part = str_replace('- ','-',ucwords(str_replace('-','- ', $part)));
      $c13n[] = str_replace('\' ', '\'', ucwords(str_replace('\'', '\' ', $part)));
    }
    else if ($part == $first_part && !in_array($part, $always_lower))
    {
      // If the first part of the string is ok_to_be_lower, cap it anyway
      $c13n[] = ucfirst($part);
    }
    else
    {
      $c13n[] = $part;
    }
  }

  $titleized = implode(' ', $c13n);

  return trim($titleized);
}

- gillytech

1

这里是一个可能有点幼稚的C#实现：

public class ProperCaseHelper {
  public string ToProperCase(string input) {
    string ret = string.Empty;

    var words = input.Split(' ');

    for (int i = 0; i < words.Length; ++i) {
      ret += wordToProperCase(words[i]);
      if (i < words.Length - 1) ret += " ";
    }

    return ret;
  }

  private string wordToProperCase(string word) {
    if (string.IsNullOrEmpty(word)) return word;

    // Standard case
    string ret = capitaliseFirstLetter(word);

    // Special cases:
    ret = properSuffix(ret, "'");
    ret = properSuffix(ret, ".");
    ret = properSuffix(ret, "Mc");
    ret = properSuffix(ret, "Mac");

    return ret;
  }

  private string properSuffix(string word, string prefix) {
    if(string.IsNullOrEmpty(word)) return word;

    string lowerWord = word.ToLower(), lowerPrefix = prefix.ToLower();
    if (!lowerWord.Contains(lowerPrefix)) return word;

    int index = lowerWord.IndexOf(lowerPrefix);

    // If the search string is at the end of the word ignore.
    if (index + prefix.Length == word.Length) return word;

    return word.Substring(0, index) + prefix +
      capitaliseFirstLetter(word.Substring(index + prefix.Length));
  }

  private string capitaliseFirstLetter(string word) {
    return char.ToUpper(word[0]) + word.Substring(1).ToLower();
  }
}

- kronoz

1

@Colin：请将您的版本作为独立答案发表，不要过度修改其他人的答案。 - zwol

1

你使用哪种编程语言？许多语言允许回调函数用于正则表达式匹配。这些可以用于轻松地将匹配的内容转换为 Propercase。要使用的正则表达式非常简单，只需匹配所有单词字符，如下所示：

/\w+/

或者，您可以提取第一个字符作为额外的匹配项：

/(\w)(\w*)/

现在您可以分别访问匹配中的第一个字符和后续字符。回调函数可以简单地返回命中的连接。在伪 Python 中（我实际上不知道 Python）：

def make_proper(match):
    return match[1].to_upper + match[2]

顺便提一下，这种方法也可以处理“O'Reilly”这种情况，因为“O”和“Reilly”会分别匹配，并且都是propercased。然而，还有其他一些特殊情况，算法处理得不是很好，比如“麦当劳”或者通常任何带撇号的单词。后者算法会生成“Mcdonald'S”。可以实现一个特殊的撇号处理，但那会影响到前一个情况。理论上找到一个完美的解决方案是不可能的。在实践中，考虑撇号后面部分的长度可能会有所帮助。

- Konrad Rudolph

1

我知道这个帖子已经开了一段时间，但是当我在为这个问题做研究时，我发现了这个非常方便的网站，它可以让你快速地将名字大写：https://dialect.ca/code/name-case/。我想把它包括在这里，供其他进行类似研究/项目的人参考。

他们在这个链接中发布了他们用php编写的算法：https://dialect.ca/code/name-case/name_case.phps

对他们的代码进行初步测试和阅读表明，他们非常细致。

- Michael Pinkard

嗨，来自2023年 - 所提及的代码已不再在线，但我通过archive.org找到了一份副本，并在这里发布：https://github.com/alexdunae/name_case - Alex Dunae

0

这里有很多好的答案。我的解决方案非常简单，只考虑我们组织中拥有的名称。您可以根据需要进行扩展。这不是一个完美的解决方案，并且会将温哥华更改为VanCouver，这是错误的。因此，如果您使用它，请进行微调。

以下是我在C#中的解决方案。这将硬编码名称到程序中，但是通过一些工作，您可以保留程序外部的文本文件并读取名称异常（即Van、Mc、Mac），并循环遍历它们。

public static String toProperName(String name)
{
    if (name != null)
    {
        if (name.Length >= 2 && name.ToLower().Substring(0, 2) == "mc")  // Changes mcdonald to "McDonald"
            return "Mc" + Regex.Replace(name.ToLower().Substring(2), @"\b[a-z]", m => m.Value.ToUpper());

        if (name.Length >= 3 && name.ToLower().Substring(0, 3) == "van")  // Changes vanwinkle to "VanWinkle"
            return "Van" + Regex.Replace(name.ToLower().Substring(3), @"\b[a-z]", m => m.Value.ToUpper());

        return Regex.Replace(name.ToLower(), @"\b[a-z]", m => m.Value.ToUpper());  // Changes to title case but also fixes 
                                                                                   // appostrophes like O'HARE or o'hare to O'Hare
    }

    return "";
}

- Klyphtn

0

Kronoz，谢谢。我在你的函数中发现了这一行：

`if (!lowerWord.Contains(lowerPrefix)) return word`;

必须说

if (!lowerWord.StartsWith(lowerPrefix)) return word;

所以，“información”不应该被改成“InforMacIón”。

祝好，

恩里克

- Enrique Martinez

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Markus Olsson · Accepted Answer

除非我误解了你的问题，否则我认为你不需要自己编写代码，TextInfo类可以为你完成。

using System.Globalization;

CultureInfo.InvariantCulture.TextInfo.ToTitleCase("GeOrGE bUrdEll")

将返回"George Burdell". 如果有特殊规则涉及，您可以使用自己的文化。

更新: Michael（在此回答的评论中）指出，如果输入是全部大写，则该方法将假定其为首字母缩写词而无法正常工作。这种情况下的简单解决方案是在将文本提交给ToTitleCase之前将其转换为小写。