如何使用正则表达式或字符串操作来拆分复杂的字符串?

3

我有以下食材列表:

Ingredients <- "Starch (Corn | Potato | Wheat) | Vegetables (27%) [Pea (23%) (Flakes | Pieces) | Carrot Pieces | Onion Powder | Spinach Powder] | Croutons (10%) (Wheat Flour | Vegetable Oil | Salt | Yeast) | Maltodextrin | Natural Flavours (Contain Milk and Soybeans) | Creamer [Contains Milk | Mineral Salts (339 or 340 | 450 or 451)] | Salt | Mineral Salt (Potassium Chloride) | Sugar | Flavour Enhancer (621) | Vegetable Oil | Bacon Powder (0.5%) | Parsley | Natural Colour (Turmeric) | Burnt Sugar | Food Acid (Lactic) | Pepper Extract"

我想将它们分成一个数据框中的值,存储在变量ingredients下。

但是我在编写代码时遇到了麻烦,因为列表中以各种方式使用分隔符|。因此,我想在不包含括号()或方括号[]的情况下拆分|。但我真的不知道如何处理这个问题。

也就是说,我们最终会得到一个配料值Starch (Corn | Potato | Wheat),另一个是Vegetables (27%) [Pea (23%) (Flakes | Pieces) | Carrot Pieces | Onion Powder | Spinach Powder],还有一个只是Salt(还有其他成分,但对我来说前两个是比较棘手的情况)。

2个回答

6

正则表达式修改自这个答案

思路是先将括号(()[])中间的|字符替换为其他字符(例如我的示例中使用了@)。剩下的|字符应该是字符串的真正分隔符。然后使用strsplit函数在|上进行分割,并将@符号替换回|。最后,使用trims()函数去除每个字符串两端的不需要的空格。

library(dplyr)

strsplit(gsub("\\|(?=[^()]*\\))", "@", Ingredients, perl=TRUE) %>% 
           gsub("\\|(?=[^\\[\\]]*\\])", "@", ., perl=TRUE), "\\|") %>% 
  unlist() %>% 
  gsub("@", "\\|", .) %>% 
  trimws()

 [1] "Starch (Corn | Potato | Wheat)"                                                                
 [2] "Vegetables (27%) [Pea (23%) (Flakes | Pieces) | Carrot Pieces | Onion Powder | Spinach Powder]"
 [3] "Croutons (10%) (Wheat Flour | Vegetable Oil | Salt | Yeast)"                                   
 [4] "Maltodextrin"                                                                                  
 [5] "Natural Flavours (Contain Milk and Soybeans)"                                                  
 [6] "Creamer [Contains Milk | Mineral Salts (339 or 340 | 450 or 451)]"                             
 [7] "Salt"                                                                                          
 [8] "Mineral Salt (Potassium Chloride)"                                                             
 [9] "Sugar"                                                                                         
[10] "Flavour Enhancer (621)"                                                                        
[11] "Vegetable Oil"                                                                                 
[12] "Bacon Powder (0.5%)"                                                                           
[13] "Parsley"                                                                                       
[14] "Natural Colour (Turmeric)"                                                                     
[15] "Burnt Sugar"                                                                                   
[16] "Food Acid (Lactic)"                                                                            
[17] "Pepper Extract" 

6
你可以使用递归正则表达式:
pat <- r"(([^\[\]|]*[\[(](?:[^\[)(\]]*(?1)?)+[\])])| ([^|]+))"
regmatches(Ingredients, gregexpr(pat, Ingredients, perl = TRUE))

[[1]]
 [1] "Starch (Corn | Potato | Wheat)"                                                                 
 [2] " Vegetables (27%) [Pea (23%) (Flakes | Pieces) | Carrot Pieces | Onion Powder | Spinach Powder]"
 [3] " Croutons (10%) (Wheat Flour | Vegetable Oil | Salt | Yeast)"                                   
 [4] " Maltodextrin "                                                                                 
 [5] " Natural Flavours (Contain Milk and Soybeans)"                                                  
 [6] " Creamer [Contains Milk | Mineral Salts (339 or 340 | 450 or 451)]"                             
 [7] " Salt "                                                                                         
 [8] " Mineral Salt (Potassium Chloride)"                                                             
 [9] " Sugar "                                                                                        
[10] " Flavour Enhancer (621)"                                                                        
[11] " Vegetable Oil "                                                                                
[12] " Bacon Powder (0.5%)"                                                                           
[13] " Parsley "                                                                                      
[14] " Natural Colour (Turmeric)"                                                                     
[15] " Burnt Sugar "                                                                                  
[16] " Food Acid (Lactic)"                                                                            
[17] " Pepper Extract"        

1
不错的原始字符串使用。+1 - Maël

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接