我有两个数据框,格式如下:
这给了我一个df_all中所有找到项的完整列表,但我不知道哪个搜索字符串返回了哪个结果。
我设法用for循环使其工作,但我的数据集速度很慢(67分钟):
df_search
SEARCH
part1
anotherpart
onemorepart
df_all
FILE EXTENSION PATH
part1_1 .prt //server/folder1/part1_1
part1_2 .prt //server/folder2/part1_2
part1_2 .pdf //server/folder3/part1_2
part1_3 .prt //server/folder2/part1_3
anotherpart_1 .prt //server/folder1/anotherpart_1
anotherpart_2 .prt //server/folder3/anotherpart_2
anotherpart_3 .prt //server/folder2/anotherpart_3
anotherpart_3 .cgm //server/folder1/anotherpart_3
anotherpart_4 .prt //server/folder3/anotherpart_4
onemorepart_1 .prt //server/folder2/onemorepart_1
onemorepart_2 .prt //server/folder1/onemorepart_2
onemorepart_2 .dwg //server/folder2/onemorepart_2
onemorepart_3 .prt //server/folder1/onemorepart_3
onemorepart_4 .prt //server/folder1/onemorepart_4
完整的df_search有15,000个项目。 df_all有550,000个项目。 我正在尝试基于搜索项字符串在文件字符串中的匹配将这两个数据框合并。我期望得到的输出是这样的:
SEARCH FILE EXTENSION PATH
part1 part1_1 .prt //server/folder1/part1_1
part1 part1_2 .prt //server/folder2/part1_2
part1 part1_2 .pdf //server/folder3/part1_2
part1 part1_3 .prt //server/folder2/part1_3
anotherpart anotherpart_1 .prt //server/folder1/anotherpart_1
anotherpart anotherpart_2 .prt //server/folder3/anotherpart_2
anotherpart anotherpart_3 .prt //server/folder2/anotherpart_3
anotherpart anotherpart_3 .cgm //server/folder1/anotherpart_3
anotherpart anotherpart_4 .prt //server/folder3/anotherpart_4
onemorepart onemorepart_1 .prt //server/folder2/onemorepart_1
onemorepart onemorepart_2 .prt //server/folder1/onemorepart_2
onemorepart onemorepart_2 .dwg //server/folder2/onemorepart_2
onemorepart onemorepart_3 .prt //server/folder1/onemorepart_3
onemorepart onemorepart_4 .prt //server/folder1/onemorepart_4
简单的数据框合并不起作用,因为字符串从来不是完全匹配的(它总是一个子字符串)。我还尝试了以下基于stackoverflow上其他问题的方法:
df_all[df_all.name.str.contains('|'.join(df_search.search))]
这给了我一个df_all中所有找到项的完整列表,但我不知道哪个搜索字符串返回了哪个结果。
我设法用for循环使其工作,但我的数据集速度很慢(67分钟):
super_df = []
for search_item in df_search.search:
df_entire.loc[df_entire.file.str.contains(search_item), 'search'] = search_item
temp_df = df_entire[df_entire.file.str.contains(search_item)]
super_df = pd.concat(super_df, axis=0, ignore_index=True)
是否可以使用向量化来提高性能?
谢谢
ValueError:传递的项目数量错误12,放置意味着1
。你知道该怎么处理吗? - ah bonpat = "|".join([re.escape(x) for x in df_search.SEARCH])
,如果列中有一些特殊字符-需要转义值。 - jezrael