去除标点符号但保留表情符号？

Question

去除标点符号但保留表情符号？

stringrtextgsubemoticons

10

是否有可能去除所有标点符号，但保留表情符号，例如

:-(

:)

:D

:p

structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

- user3456230

1

首先对表情符号进行标记化处理（使用类似$SMILEY1的标记替换它们），然后进行标点符号去除，最后将表情标记替换为相应的表情符号。 - hrbrmstr

嗨@RichardScriven，这是用于推文情感分析的。一些表情符号会带来积极的情感，而另一些则是负面的 :-) - user3456230

2

@Richard，它们非常重要。它们代表了一个人试图将情感、手势和身体语言重新注入虚拟空间的努力。它们携带着大量信息。如果这里有人运营在线课程并且不鼓励使用表情符号，认为它们不学术，我会向你挑战，因为这会极大地限制对话，确保你得不到真正的交流。这就像让学生在课堂上不能使用面部表情、眼神、身体或手势一样。 - Tyler Rinker

@TylerRinker，我并没有不尊重你。这个话题高度主观，可能取决于世代。 - Rich Scriven

1

@RichardScriven 抱歉，我没有不尊重你的意思。这说明了书面语言很难传达人们的语气。你把我的激动和兴奋误解为冒犯了 :-) - Tyler Rinker

显示剩余3条评论

4个回答

5

这里有一种不如@gagolews的解决方案更简单，但可能更慢。它需要您提供一个表情符号字典。您可以创建或使用qdapDictionaries包中的字典。基本方法是将表情符号转换为文本，以确保不会被误认为其他内容（我使用dat$Temp <-前缀来确保这一点）。然后，使用qdap::strip去除标点符号，再通过mgsub将占位符转换回表情符号：

library(qdap)
#reps <- emoticon
emos <- c(":-(", ":)", ":D", ":p", "X-(")
reps <- data.frame(seq_along(emos), emos)

reps[, 1] <- paste0("EMOTICONREPLACE", reps[, 1])
dat$Temp <- mgsub(as.character(reps[, 2]), reps[, 1], dat[, 1])
dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
    strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE))

查看它：

truncdf(left_just(dat[, 3, drop=F]), 50)

##   Temp                                              
## 1 RT AirAsia ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í No
## 2 You know there is a problem when customer service 
## 3 ãããæããããéãããæãããInappropriate announce:-(         
## 4 AirAsia your direct debit Maybank payment gateways
## 5 xdek ke flight AirAsia Malaysia to LA hahah:p bagi
## 6 AirAsia Apart from the slight delay and shortage o

编辑: 如需保留所请求的 ? 和 ! ，请在 strip 函数中传递 char.keep 参数:

dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
    strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))

- Tyler Rinker

很不错的解决方案，但是第三个字符串中的 :-( 怎么办呢...？ - gagolews

我认为他想要保留它。这就是为什么它在那里的原因。 - Tyler Rinker

@RichardScriven 谢谢。非常感谢您的反馈。 - Tyler Rinker

1

OP的原始数据集 - Tyler Rinker

@TylerRinker。谢谢。 - user1828605

显示剩余8条评论

1

我将这一功能添加到 qdap 版本 > 2.0.0 中，作为 sub_holder 函数。基本上，这个函数使用我之前给出的响应，但减轻了编码负担。sub_holder 函数接受一个文本向量和你想要替换的项（如表情符号），并返回一个列表，其中包含：

用占位符替换项目的文本向量
一个函数（称为 unhold），用于将占位符替换为原始术语

下面是代码：

emos <- c(":-(", ":)", ":D", ":p", "X-(")
(m <- sub_holder(emos, dat[,1]))
m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))

- Tyler Rinker

0

使用rex可能会使这种类型的任务变得更简单。它将自动转义必要的字符，并且如果放入or()函数中，将会对向量的所有元素进行或运算。使用带有全局参数的re_matches()将为您获取给定行中所有表情符号的列表。

x = structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

emots <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p"), paste0))

library(rex)
re_matches(x$text,
  rex(
    capture(name = 'emoticons',
      or(emots)
    ),
  global = T)

#>[[1]]
#>  emoticon
#>1       :D
#>2       :D
#>
#>[[2]]
#>  emoticon
#>1     <NA>
#>
#>[[3]]
#>  emoticon
#>1      :-(
#>
#>[[4]]
#>  emoticon
#>1     <NA>
#>
#>[[5]]
#>  emoticon
#>1       :p
#>
#>[[6]]
#>  emoticon
#>1       :)

- Jim

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- gagolews · Accepted Answer

1. 一个可用的纯正则表达式解决方案（又名编辑#2）

这个任务可以完全使用正则表达式来完成（非常感谢@Mike Samuel）

首先，我们建立一个表情符号的数据库：

(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+                c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)"  ";)"  ":-)" ";-)" ":("  ";("  ":-(" ";-(" ":]"  ";]"  ":-]" ";-]" ":["  ";["  ":-[" ";-[" ":D"  ";D"  ":-D" ";-D"
## [21] ":o"  ";o"  ":-o" ";-o" ":O"  ";O"  ":-O" ";-O" ":P"  ";P"  ":-P" ";-P" ":p"  ";p"  ":-p" ";-p"

一个示例输入文本：

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

一个帮助函数，用于转义一些特殊字符，以便它们可以在正则表达式模式中使用（使用 stringi 包）：

library(stringi)
escape_regex <- function(r) {
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

匹配表情符号的正则表达式：

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

现在，正如@Mike Samuel在下面建议的那样，我们只需要匹配(表情符号)|标点符号（注意表情符号在一个捕获组中）然后用捕获组1的结果替换匹配项（所以如果它是一个表情符号，我们有replacement=this emoticon，如果它是一个标点符号，则有replacement=nothing）。这将起作用，因为在ICU Regex（即stri_replace_all_regex使用的正则表达式引擎）中，与|交替使用是贪婪和左倾斜的：表情符号将比标点字符先匹配。

stri_replace_all_regex(text, stri_c(regex1, "|\\p{P}"), "$1")
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

顺便提一下，如果您只想要去掉选定的一组字符，请使用例如[.,]而不是上面的[\\p{P}]。

2. 正则表达式解决方案提示 - 我的第一个（不明智）尝试（也称为原始答案）

我的第一个想法（主要是出于“历史原因”而留下），是通过使用前瞻和后顾来解决这个问题，但是 - 如您所见 - 那远非完美。

要删除所有不跟随)、(、D、X、8、[或]的:和;，请使用负向后顾：

stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"

现在我们可以添加一些旧式的表情符号（带鼻子的，例如:-)，;-D等）。

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) --- and the salesperson said Oh boy!"

现在是连字符的删除（负向先行断言和先行断言）。

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])|(?!<[:;])[-](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-)  and the salesperson said Oh boy!"

当然，首先您应该建立自己的表情符号数据库（保留原样）和标点符号数据库（删除）。正则表达式高度依赖于这两个集合，因此很难添加新的表情符号 --- 这绝对不值得尝试（可能会让您头痛）。

3. 第二次尝试（更易读的正则表达式，称为Edit#1）

另一方面，如果您对复杂的正则表达式过敏，请尝试这种方法。这种方法有一些“教学效益” - 我们可以完全了解以下每个步骤中正在执行的操作：

定位text中的所有表情符号；
定位text中的所有标点符号；
查找不是表情符号的标点字符的位置；
移除第3步中定位到的字符。

一个示例输入文本 - 仅1个字符串 - 通用案例留作练习 ;)

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

一个帮助函数，可以转义一些特殊字符，以便在正则表达式中使用：

escape_regex <- function(r) {
   library("stringi")
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

匹配表情符号的正则表达式：

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

找到所有表情符号的起始和结束位置（即找到第一个 OR 第二个 OR ... 表情符号）：

where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
print(where_emots)
##       start end
##  [1,]     1   2
##  [2,]     4   5
##  [3,]     7   8
##  [4,]    10  11
##  [5,]    13  14
##  [6,]    16  17
##  [7,]    23  24
##  [8,]    64  65
##  [9,]    67  69

定位所有的标点符号字符（这里\\p{P}是代表标点符号字符的Unicode字符类）：

where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
print(where_punct)
##       start end
##  [1,]     1   1
##  [2,]     2   2
##  [3,]     4   4
##  [4,]     7   7
##  [5,]     8   8
## ...
## [26,]    72  72
## [27,]    73  73
## [28,]    99  99
## [29,]   107 107

由于某些标点符号出现在表情符号中，我们不应该将它们暂时移除：

which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
   any(where_punct[i,1] >= where_emots[,1] &
        where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
##       start end
##  [1,]    27  27
##  [2,]    38  38
##  [3,]    39  39
##  [4,]    40  40
##  [5,]    46  46
##  [6,]    54  54
##  [7,]    58  58
##  [8,]    60  60
##  [9,]    71  71
## [10,]    72  72
## [11,]    73  73
## [12,]    99  99
## [13,]   107 107

每个标点符号肯定只包含1个字符，因此始终 where_punct[,1]==where_punct[,2]。

现在是最后一部分。如您所见，where_punct[,1] 包含要删除的字符位置。在我看来，最简单的方法（无需循环）是将字符串转换为UTF-32（每个字符 == 1个整数），删除不需要的元素，然后再转换回文本表示形式：

text_tmp <- stri_enc_toutf32(text)[[1]]
print(text_tmp) # here - just ASCII codes...
## [1]  58  41  32  59  80  32  58  93  32  58....
text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!

结果是：

stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

在这里。