将长字符串分割为三个变量

3

我有一个数据框,看起来像这样:

df<-structure(list(string = c(" Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3", 
" Automatic data processing machines and units thereof ............................................ E ....................... 15.0", 
" Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6", 
" Optical instruments and apparatus .............................................................................. E ....................... 14.1", 
" Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3", 
" Silk .................................................................................................................................. A ....................... 13.2", 
" Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1", 
" Articles of materials described in division 58 ............................................................. D ....................... 13.1"
), id = c("1 ", "2 ", "3 ", "4 ", "5 ", "6 ", "7 ", "8 "), SH3 = c("776 ", 
"752 ", "759 ", "871 ", "553 ", "261 ", "846 ", "893 ")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))


# that looks like this

  string                                                                                                                                                                    id    SH3   
  <chr>                                                                                                                                                                     <chr> <chr> 
1 " Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3"                                          "1 "  "776 "
2 " Automatic data processing machines and units thereof ............................................ E ....................... 15.0"                                       "2 "  "752 "
3 " Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6"                               "3 "  "759 "
4 " Optical instruments and apparatus .............................................................................. E ....................... 14.1"                        "4 "  "871 "
5 " Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3"                              "5 "  "553 "
6 " Silk .................................................................................................................................. A ....................... 13.2" "6 "  "261 "
7 " Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1"                          "7 "  "846 "
8 " Articles of materials described in division 58 ............................................................. D ....................... 13.1"                            "8 "  "893 "


我想把string变量分成三个独立的变量。这个string由一系列点(...)隔开的3部分组成。
1) 第一部分由文本组成: 例如第一行中的 "Thermionic, cold and photo-cathode valves, tubes, and parts"
2) 第二部分是一个大写字母: 例如第一行中的 "E"
3) 最后一部分是一个数字: 例如第一行中的 "16.3"
我希望将字符串拆分并创建三个变量。问题在于每行点的数量都不同。 有没有人知道如何高效地完成这项任务?
如果能有效地分离大写字母(第2部分),就足以满足我的需求了。
非常感谢您提前的帮助。
1个回答

3
你可以使用一个正则表达式,寻找长度为2或更多的点[.]{2,}
strsplit(df$string, "[.]{2,}")[1:3]
# [[1]]
# [1] " Thermionic, cold and photo-cathode valves, tubes, and parts "
# [2] " E "                                                          
# [3] " 16.3"                                                        
# [[2]]
# [1] " Automatic data processing machines and units thereof " " E "                                                   
# [3] " 15.0"                                                 
# [[3]]
# [1] " Parts of and accessories suitable for 751, 752 " " E "                                             
# [3] " 14.6"                                           

使用此方法,您可以将其转换为框架:

data.frame(do.call(rbind, strsplit(df$string, "[.]{2,}")), stringsAsFactors = FALSE)
#                                                              X1  X2    X3
# 1  Thermionic, cold and photo-cathode valves, tubes, and parts   E   16.3
# 2         Automatic data processing machines and units thereof   E   15.0
# 3               Parts of and accessories suitable for 751, 752   E   14.6
# 4                            Optical instruments and apparatus   E   14.1
# 5                 Perfumery, cosmetics and toilet preparations   E   13.3
# 6                                                         Silk   A   13.2
# 7                          Undergarments, knitted or crocheted   B   13.1
# 8               Articles of materials described in division 58   D   13.1

你需要重新命名并很可能使用trimwsas.numeric对某些列进行操作,因为strsplit没有修剪字符串。
如果你只需要第二列,那么
trimws(sapply(strsplit(df$string, "[.]{2,}"), `[[`, 2))
# [1] "E" "E" "E" "E" "E" "A" "B" "D"

非常感谢,这个可以完成任务!只有两件事:是否可能保留df中的其他变量?idsh3。再次感谢,我对字符串分析不熟悉! - Alex
你的意思是 df$middleletter <- trimws(sapply(strsplit(df$string, "[.]{2,}"), \[[`, 2))` 吗? - r2evans

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接