如何优化字符串检测的速度？

Question

如何优化字符串检测的速度？

3

当我进行文本分析时，我经常想要知道大量文档中是否包含一组字符串中的任何元素。如果我有数百万个文档（例如推文）和一个长列表的模式，则可能需要很长时间。

我通常使用以下软件包来优化速度： data.table dtplyr stringr 有哪些最佳实践可用于优化字符串检测和分析？是否有软件包可以让我优化此类代码？

library(data.table)
library(dtplyr)
library(stringr)

my_dt <- data.table(text = c("this is some text", "this is some more text")) #imagine many more strings
my_string <- paste(words, collapse = "|")

lazy_dt(my_dt, immutable = F) %>%
filter(filtered_text = str_detect(text, my_string)) %>%
as.data.table()

我认为直接使用data.table而不是dtplyr实现会提高速度。还有其他方式可以改进这种应用程序的性能吗？

我看了这个问题，希望我能得到一些类似的指导。希望现在问题已经足够具体了。

- Tea Tree

str_detect(text, my_string) 是您代码的瓶颈。使用纯 data.table / stringi 只会稍微提高速度。一旦问题重新开放，我将发布一个比使用 data.table 更快的答案。在 30000 条记录上，与原始代码相比，我可以获得大约 8 倍的速度提升。 - phiver

太棒了，我非常感激。有没有办法加快重新打开我的问题的速度？ - Tea Tree

如果性能是目标，我的第一个问题是为什么要使用R？为什么不使用C/C++？其次，哪些输入最不经常更改？模式列表？为什么不将模式列表预处理成特定的C++代码？那种代码很难被超越。 - Mike Dunlavey

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- phiver · Accepted Answer

正如我在评论中提到的那样，str_detect(text, my_string) 是您代码中的瓶颈。还请注意它并不完全符合您的期望。它进行的是正则表达式搜索，因此文本中所有带有“a”的单词也会被计算在内。请参见以下示例。

library(data.table)
library(dtplyr)
library(stringr)
library(dplyr)


my_dt <- data.table(id = 1:300000,
                    text = rep(c("this is some text", "this is some more text", 
                             "text palabras"), 100000)) #imagine many more strings
my_string <- paste(stringr::words, collapse = "|")

# start counting time (note System.time() is slightly faster but doesn't print the results)
timing <- Sys.time()

run code
lazy_dt(my_dt, immutable = F) %>%
  filter(filtered_text = str_detect(text, my_string)) %>%
  as.data.table()

            id                   text
     1:      1      this is some text
     2:      2 this is some more text
     3:      3          text palabras
     4:      4      this is some text
     5:      5 this is some more text
    ---                              
299996: 299996 this is some more text
299997: 299997          text palabras
299998: 299998      this is some text
299999: 299999 this is some more text
300000: 300000          text palabras

Sys.time() - timing
Time difference of 6.708245 secs

注意：您上面代码的数据表等效代码如下：

my_dt[str_detect(text, my_string), ]

计时结果约为6.52秒，所以改进不多。

从上面的结果可以看出，这个选择返回了所有的句子，因为"palabras"中有一个"a"。这里不应该有它。现在data.table有一个叫做%chin%的函数，它类似于%in%，但适用于字符向量并且速度更快。为了得到单词匹配，我们只需要对所有文本进行分词，这可以通过tidytext中的unnest_tokens函数完成。此函数符合data.table格式。之后我筛选了匹配的单词数据，删除了单词列，并取得了data.table中的唯一数据(去重)。原因是结果可能会有重复的行，因为多个单词可能匹配。尽管有更多的函数调用，但这大约快了3倍。

library(tidytext)

timing <- Sys.time()
my_dt <- unnest_tokens(my_dt, word, text, drop = F)
my_dt <- unique(my_dt[word %chin% words, ], by = c("id", "text"))[, c("id", "text")]


           id                   text
     1:     1      this is some text
     2:     2 this is some more text
     3:     4      this is some text
     4:     5 this is some more text
     5:     7      this is some text
    ---                             
199996: 299993 this is some more text
199997: 299995      this is some text
199998: 299996 this is some more text
199999: 299998      this is some text
200000: 299999 this is some more text

Sys.time() - timing
Time difference of 2.380911 secs

现在为了加快速度，您可以设置数据表使用的线程。默认情况下（在我的系统上），这个值设置为2。您可以使用getDTthreads()来检查。当我使用setDTthreads(3)添加1个线程时，新代码返回大约1.6秒。现在也许有人可以通过在data.table的.SD部分执行此操作来进一步加速。