让我尝试通过一个例子来解释我的问题,我有一个大语料库和一个如下的子字符串:
corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""
substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""
子串和语料库非常相似,但并不完全相同。
如果我做这样的事情,
import re
re.search(substring, corpus, flags=re.I) # this will fail substring is not exact but rather very similar
在语料库中,子字符串如下所示,与我拥有的子字符串略有不同,因此正则表达式搜索失败了,是否有人能够建议一个非常好的替代方案来查找类似的子字符串?
until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now
我尝试了difflib库,但它并不能满足我的使用情况。
一些背景信息:
我现在拥有的子字符串是一段时间前使用这个正则表达式re.sub("[^a-zA-Z]", " ", corpus)
从预处理语料库中获取的。
但现在我需要使用我拥有的子字符串来进行反向查找语料库文本,并找到语料库中的起始和结束索引。
re.sub("[^a-zA-Z]", " ", corpus)
。我需要进行反向查找。 - user_12