如何在Stata中提取字符串中仅为大写字母的部分?

3
以下是数据样例:
part1
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day's News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT'S HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"

每个字符串要么有大写和小写部分,要么全是大写。我一直在尝试使用正则表达式来提取字符串中仅有的大写部分,但是一直没有成功。目前我只能识别一个字符串以特定数量的大写字符开头或结尾:
generate title = regexs(0) if regexm(part1, "^[A-Z][A-Z][A-Z].*[A-Z][A-Z][A-Z]$")

我也尝试了下面这个方法,它来自论坛中的另一个问题:
generate title = regexs(0) if(regexm(part1, "\b[A-Z]{2,}\b"))

我希望你能帮我翻译一下,这段内容与it技术有关。它应该是寻找至少连续两个大写字母的单词,但对我来说它只返回缺失值。我使用的是Mac版的Stata 13.1。


1
不确定您想要什么:获取所有大写字母的段落?尝试使用 ^[^a-z]+$。但是,否定类可能不被支持。如果它不起作用,您将不得不尝试解决方法,例如 ^[A-Z][0-9A-Z~\@#$%^&*()_+ '=][{}\|'";:/?,.><-]+$`。 - Wiktor Stribiżew
3个回答

0
正如 @stribizhev 所指出的,否定可能是一种方法:
clear
set more off

input ///
str70 myvar
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day's News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT'S HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"
end

gen title = trim(regexs(2)) if regexm(myvar, "([,.]*)([^a-z]*$)")

list title

结果是

. list title

     +-----------------------------------------------+
     |                                         title |
     |-----------------------------------------------|
  1. |                           TEST MODEL SEADROME |
  2. |                            L.B. MAYER HONORED |
  3. |                                  A TOWN MOVES |
  4. |                      U.S. SAVINGS BONDS RALLY |
  5. |             N.D. NOSES OUT S.M.U. BY 27 TO 20 |
     |-----------------------------------------------|
  6. |                          BURN 2,300 SQUEALERS |
  7. |                                               |
  8. | N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING |
  9. |                                               |
 10. |                     PA. IT'S HIGHER EDUCATION |
     |-----------------------------------------------|
 11. |                               806 DECORATIONS |
 12. |                                               |
 13. |                    F.D.R. ASKS VICTORY EFFORT |
     +-----------------------------------------------+

我认为这已经接近你想要的,但并不完美。如果字符串没有一些规律结构,那么很难想象一种简单的方法来清理它们。例如,比较观测值6和10的输入/输出。

如果你有一个标题数据库,在进行初始清理后,你可以进行比较和匹配。例如,参见ssc describe strgroup


0
问题的含义似乎是你期望一个正则表达式规范来提取所有实例。无论这个想法多么合理,在Stata中正则表达式并不是这样工作的。你需要循环实例。这使用了mossssc install moss),它的主要目的就是这样。 (如果第二位程序作者正在阅读本文,那么弱弱的文字游戏暗示着积累苔藓是典型的他关心的问题。)
clear 
input str100 part1
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day's News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT'S HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"
end 
compress 

moss part1, match("([A-Z]+)") regex 
egen wanted = concat(_match*), p(" ")
l wanted

     +--------------------------------------------------+
     |                                           wanted |
     |--------------------------------------------------|
  1. |                          C M TEST MODEL SEADROME |
  2. |                                L B MAYER HONORED |
  3. |                                     A TOWN MOVES |
  4. |                          U S SAVINGS BONDS RALLY |
  5. |                        N D NOSES OUT S M U BY TO |
     |--------------------------------------------------|
  6. |                               P P BURN SQUEALERS |
  7. |                                        O B I T N |
  8. | S S N Y DIAVOLO IS STAR AT BRILLIANT SPA OPENING |
  9. |                                          R D D R |
 10. |                       P PA IT S HIGHER EDUCATION |
     |--------------------------------------------------|
 11. |                                      DECORATIONS |
 12. |                                        S H M F S |
 13. |                        F D R ASKS VICTORY EFFORT |
     +--------------------------------------------------+

我假设你想要结果之间有空格;否则很难理解。您没有指定大写字母之间的标点符号;如果您需要,您需要相应地修改正则表达式。


0

我想不出一条规则可以干净地解析这种类型的数据。通常,最好的策略是针对简单的情况,然后转向更困难的情况,直到收益递减使得进一步尝试变得不吸引人。

在使用正则表达式时,特别是观察数量很大时,要注意意外匹配。我使用 listsome(来自 SSC)进行此类工作。

看起来 part1 经常以城市名称开头,后跟州名/缩写。以下是处理简单情况和城市/州情况的代码:

clear
input str60 part1
"Cambridge, Maryland TEST MODEL SEADROME" 
"L.B. MAYER HONORED" 
"A TOWN MOVES" 
"U.S. SAVINGS BONDS RALLY" 
"N.D. NOSES OUT S.M.U. BY 27 TO 20" 
"Philadelphia, Pa. BURN 2,300 SQUEALERS" 
"Odd Bits In To-day's News" 
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPEN" 
"Risk Death in Daring Race" 
"Philadelphia, PA. IT'S HIGHER EDUCATION" 
"806 DECORATIONS" 
"Snow Hauled 20 Miles For Skiers" 
"F.D.R. ASKS VICTORY EFFORT" 
end

* take care of the easy cases where there are no lowercase letters
gen title = part1 if !regexm(part1,"[a-z]")

* this type of string work is easier if text is aligned to the left
leftalign   // (from SSC)

* target cases of City, State at the start of part1.
* with complex patterns, it's easy to miss unintended matches when
* lots of obs are involved so use -listsome- (from SSC to track changes)
gen title0 = title
replace title = trim(regexs(3)) if regexm(part1,"^([A-Z][a-z ]*)+, ([A-Z][a-z]*\.?)+([^a-z]+$)")
listsome if title != title0

list part1 title

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接