将长字符串分割为三个变量

Question

将长字符串分割为三个变量

3

我有一个数据框，看起来像这样：

df<-structure(list(string = c(" Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3", 
" Automatic data processing machines and units thereof ............................................ E ....................... 15.0", 
" Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6", 
" Optical instruments and apparatus .............................................................................. E ....................... 14.1", 
" Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3", 
" Silk .................................................................................................................................. A ....................... 13.2", 
" Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1", 
" Articles of materials described in division 58 ............................................................. D ....................... 13.1"
), id = c("1 ", "2 ", "3 ", "4 ", "5 ", "6 ", "7 ", "8 "), SH3 = c("776 ", 
"752 ", "759 ", "871 ", "553 ", "261 ", "846 ", "893 ")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))


# that looks like this

  string                                                                                                                                                                    id    SH3   
  <chr>                                                                                                                                                                     <chr> <chr> 
1 " Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3"                                          "1 "  "776 "
2 " Automatic data processing machines and units thereof ............................................ E ....................... 15.0"                                       "2 "  "752 "
3 " Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6"                               "3 "  "759 "
4 " Optical instruments and apparatus .............................................................................. E ....................... 14.1"                        "4 "  "871 "
5 " Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3"                              "5 "  "553 "
6 " Silk .................................................................................................................................. A ....................... 13.2" "6 "  "261 "
7 " Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1"                          "7 "  "846 "
8 " Articles of materials described in division 58 ............................................................. D ....................... 13.1"                            "8 "  "893 "

我想把string变量分成三个独立的变量。这个string由一系列点（...）隔开的3部分组成。

1) 第一部分由文本组成：例如第一行中的 "Thermionic, cold and photo-cathode valves, tubes, and parts"

2) 第二部分是一个大写字母：例如第一行中的 "E"

3) 最后一部分是一个数字：例如第一行中的 "16.3"

我希望将字符串拆分并创建三个变量。问题在于每行点的数量都不同。有没有人知道如何高效地完成这项任务？

如果能有效地分离大写字母（第2部分），就足以满足我的需求了。

非常感谢您提前的帮助。

- Alex

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- r2evans · Accepted Answer

你可以使用一个正则表达式，寻找长度为2或更多的点[.] ：{2,}

strsplit(df$string, "[.]{2,}")[1:3]
# [[1]]
# [1] " Thermionic, cold and photo-cathode valves, tubes, and parts "
# [2] " E "                                                          
# [3] " 16.3"                                                        
# [[2]]
# [1] " Automatic data processing machines and units thereof " " E "                                                   
# [3] " 15.0"                                                 
# [[3]]
# [1] " Parts of and accessories suitable for 751, 752 " " E "                                             
# [3] " 14.6"

使用此方法，您可以将其转换为框架：

data.frame(do.call(rbind, strsplit(df$string, "[.]{2,}")), stringsAsFactors = FALSE)
#                                                              X1  X2    X3
# 1  Thermionic, cold and photo-cathode valves, tubes, and parts   E   16.3
# 2         Automatic data processing machines and units thereof   E   15.0
# 3               Parts of and accessories suitable for 751, 752   E   14.6
# 4                            Optical instruments and apparatus   E   14.1
# 5                 Perfumery, cosmetics and toilet preparations   E   13.3
# 6                                                         Silk   A   13.2
# 7                          Undergarments, knitted or crocheted   B   13.1
# 8               Articles of materials described in division 58   D   13.1

你需要重新命名并很可能使用trimws和as.numeric对某些列进行操作，因为strsplit没有修剪字符串。

如果你只需要第二列，那么

trimws(sapply(strsplit(df$string, "[.]{2,}"), `[[`, 2))
# [1] "E" "E" "E" "E" "E" "A" "B" "D"