从一个字符串列表中，识别哪些是人名，哪些不是人名。

Question

从一个字符串列表中，识别哪些是人名，哪些不是人名。

7

我有一个如下所示的向量，想确定列表中哪些元素是人名，哪些不是。我找到了humaniformat包，可以格式化名称，但很遗憾它不能确定字符串是否实际上是名称。我还发现了一些实体提取包，但它们似乎需要实际文本进行词性标注，而不是单个名称。例子

pkd.names.quotes <- c("Mr. Rick Deckard", # Name
                      "Do Androids Dream of Electric Sheep", # Not a name
                      "Roy Batty", # Name 
                      "How much is an electric ostrich?", # Not a name
                      "My schedule for today lists a six-hour self-accusatory depression.", # Not a name
                      "Upon him the contempt of three planets descended.", # Not a name
                      "J.F. Sebastian", # Name
                      "Harry Bryant", # Name
                      "goat class", # Not a name
                      "Holden, Dave", # Name
                      "Leon Kowalski", # Name
                      "Dr. Eldon Tyrell") # Name

- Henry David Thorough

6

如果我的朋友Electric Ostrich看到他的名字实际上不是个名字，他肯定会非常难过。所以你需要知道什么确定了一个名字，对吧？但如今人们给孩子起的名字几乎可以是任何东西（至少在美国是这样）。拿Kanye West的孩子来说，他的名字是North West。尽管Kanye很傻，但这还是事实。那这种名字怎么能通过命名测试呢？ - Rich Scriven

哈哈，说得对。我想我可能会把Kanye的孩子们的名字搞错。不过没关系，有些错误是可以接受的。我只是希望能比仅仅依靠字符串长度、空格数量和大小写做得更好。 - Henry David Thorough

1

斯坦福命名实体识别“模块”可用于R。https://rpubs.com/lmullen/nlp-chapter提供了NLP介绍。http://nlp.stanford.edu/software/CRF-NER.shtml是Java库的官方来源，可以从中制定解决方案。 - hrbrmstr

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jlhoward · Accepted Answer

这里有一种方法。美国人口普查局总结了一个姓氏列表，其中出现100次以上的姓氏（带有频率）：共有152,000个。如果您使用完整的列表，则所有字符串都会有名称。例如，“class”、“him”和“the”在某些语言中是名称（不确定是哪些语言）。类似地，有许多名字列表（请参见此帖子）。

以下代码从2000年人口普查中获取所有姓氏，以及来自引用帖子的名字列表，然后对每个列表中最常见的10,000个进行子集划分，组合并清理列表，并将其用作tm软件包中的字典，以识别哪些字符串包含名称。您可以通过更改freq变量来控制“敏感性”（freq = 10,000似乎会生成您想要的结果）。

url <- "http://www2.census.gov/topics/genealogy/2000surnames/names.zip"
tf <- tempfile()
download.file(url,tf, mode="wb")                     # download archive of surname data
files    <- unzip(tf, exdir=tempdir())               # unzips and returns a vector of file names
surnames <- read.csv(files[grepl("\\.csv$",files)])  # 152,000 surnames occurring >100 times
url <- "http://deron.meranda.us/data/census-derived-all-first.txt"
firstnames <- read.table(url(url), header=FALSE)
freq <- 10000
dict  <- unique(c(tolower(surnames$name[1:freq]), tolower(firstnames$V1[1:freq])))
library(tm)
corp <- Corpus(VectorSource(pkd.names.quotes))
tdm  <- TermDocumentMatrix(corp, control=list(tolower=TRUE, dictionary=dict))
m    <- as.matrix(tdm)
m    <- m[rowSums(m)>0,]
m
#            Docs
# Terms       1 2 3 4 5 6 7 8 9 10 11 12
#   bryant    0 0 0 0 0 0 0 1 0  0  0  0
#   dave      0 0 0 0 0 0 0 0 0  1  0  0
#   deckard   1 0 0 0 0 0 0 0 0  0  0  0
#   eldon     0 0 0 0 0 0 0 0 0  0  0  1
#   harry     0 0 0 0 0 0 0 1 0  0  0  0
#   kowalski  0 0 0 0 0 0 0 0 0  0  1  0
#   leon      0 0 0 0 0 0 0 0 0  0  1  0
#   rick      1 0 0 0 0 0 0 0 0  0  0  0
#   roy       0 0 1 0 0 0 0 0 0  0  0  0
#   sebastian 0 0 0 0 0 0 1 0 0  0  0  0
#   tyrell    0 0 0 0 0 0 0 0 0  0  0  1
which(colSums(m)>0)
#  1  3  7  8 10 11 12