去除空格和分隔符

Question

去除空格和分隔符

4

我是一名刚接触 R 语言的新手。我有一个包含以下内容的向量：

> head(sampleVector)

[1] "| txt01 |   100 |         200 |       123.456 |           0.12345 |"
[2] "| txt02 |   300 |         400 |       789.012 |           0.06789 |"

我希望提取每行并将其分成单独的部分，每个部分有一个数据值。我想获得一个名为resultList的列表，最终将打印出以下内容：

> head(resultList)`

[[1]]`  
[1] ""   "txt01"    "100"       "200"     "123.456"        "0.12345" 

[[2]]`  
[1] ""   "txt02"    "300"       "400"     "789.012"        "0.06789"

我在使用 strsplit() 函数时遇到了困难，目前尝试了以下代码：

resultList  <- strsplit(sampleVector,"\\s+[|] | [|]\\s+ | [\\s+]")`          
#would give me the following output`

# [[1]]`    
# [1] "| txt01"    "100"       "200"     "123.456"        "0.12345 |"

我能否通过一次strsplit调用来获取输出？我猜测我的标记以区分分隔符和空格是错误的。任何关于此的帮助都将是有益的。

- 12341234

3个回答

4

我差点没注意到的另一种strsplit选项：

strsplit(test,"[| ]+")
#[[1]]
#[1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
# 
#[[2]]
#[1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

...而且我的原始答案是使用regmatches函数，这是我最近最喜欢的函数：

regmatches(test,gregexpr("[^| ]+",test))
#[[1]]
#[1] "txt01"   "100"     "200"     "123.456" "0.12345"
#
#[[2]]
#[1] "txt02"   "300"     "400"     "789.012" "0.06789"

按照要求进行简化解释：

[| ]+ 是一个正则表达式，用于查找单个或重复出现的空格或管道符 |。
[^| ]+ 是一个正则表达式，用于查找单个或重复出现的任何字符，不包括空格或管道符 |。
gregexpr 查找所有此模式的实例，并返回匹配模式的起始位置和长度。
regmatches 从 test 中提取所有由 gregexpr 匹配的模式。

- thelatemail

1

你看到我的“扫描”替代方案了吗？打破常规思维。 - Rich Scriven

@RichardScriven - scan很好用 - 你也可以尝试使用read.table来创造一些东西。 - thelatemail

1

那个 strsplit 的解决方案非常可靠。@MagnumOpus - 我认为你应该接受这个答案。 - Rich Scriven

好的，所以方括号 [| ] 中的任何符号都是我们放在当前上下文中的符号。方括号后面的 + 表示什么？ - 12341234

@MagnumOpus - + 表示重复出现前面的任何文本。 - thelatemail

显示剩余5条评论

0

可以先尝试使用strsplit和gsub函数：

sapply(strsplit(xx, '\\|'), function (x) gsub("^\\s+|\\s+$", "", x))
     [,1]     
[1,] ""       
[2,] "txt01"  
[3,] "100"    
[4,] "200"    
[5,] "123.456"
[6,] "0.12345"

- rnso

我想返回一个列表，其中每个列表组件都包含拆分的字符。不过还是谢谢！ - 12341234

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rich Scriven · Accepted Answer

这里有一种方法。首先使用gsub从向量中删除|。然后在空格（或任意数量的空格）上使用strsplit。这样可能会更容易些。

strsplit(gsub("|", "", sampleVector, fixed=TRUE), "\\s+")
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

这里有一种使用scan的有趣替代方案，可能会很有用，并且可能会非常快。

lapply(sampleVector, function(y) {
    s <- scan(text = y, what = character(), sep = "|", quiet = TRUE)
    (g <- gsub("\\s+", "", s))[-length(g)]
})
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"