我需要在一个非常大的文本文件中搜索特定字符串。这是一个包含约5000行文本的构建日志。有什么最好的方法可以做到这一点?使用正则表达式不应该会有任何问题,对吧?我将继续阅读每个块的行,并使用简单的查找。
我需要在一个非常大的文本文件中搜索特定字符串。这是一个包含约5000行文本的构建日志。有什么最好的方法可以做到这一点?使用正则表达式不应该会有任何问题,对吧?我将继续阅读每个块的行,并使用简单的查找。
with open('largeFile', 'r') as inF:
for line in inF:
if 'myString' in line:
# do_something
您可以进行简单的查找:
f = open('file.txt', 'r')
lines = f.read()
answer = lines.find('string')
如果可以使用简单的查找来实现,那么速度会比正则表达式快得多。
find
命令返回第一个匹配项的索引。-1 表示没有匹配项;其他值是起始索引。 - Chen A.f.read()
会将整个文件加载到内存中,当处理非常大的文件时会很慢并且没有意义。最好改为逐行迭代(使用生成器或简单的for循环)。 - Chen A.def fnd(fname, s, start=0):
with open(fname, 'rb') as f:
fsize = os.path.getsize(fname)
bsize = 4096
buffer = None
if start > 0:
f.seek(start)
overlap = len(s) - 1
while True:
if (f.tell() >= overlap and f.tell() < fsize):
f.seek(f.tell() - overlap)
buffer = f.read(bsize)
if buffer:
pos = buffer.find(s)
if pos >= 0:
return f.tell() - (len(buffer) - pos)
else:
return -1
这个想法的原理是:
我曾经使用类似于这种方法来在更大的ISO9660文件中查找文件的标识,速度相当快且不占用太多内存,您也可以使用更大的缓冲区来提高速度。
这是一个文件文本搜索的多进程示例。TODO: 如何在找到文本后停止进程并可靠地报告行号?
import multiprocessing, os, time
NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
def FindText( host, file_name, text):
file_size = os.stat(file_name ).st_size
m1 = open(file_name, "r")
#work out file size to divide up to farm out line counting
chunk = (file_size / NUMBER_OF_PROCESSES ) + 1
lines = 0
line_found_at = -1
seekStart = chunk * (host)
seekEnd = chunk * (host+1)
if seekEnd > file_size:
seekEnd = file_size
if host > 0:
m1.seek( seekStart )
m1.readline()
line = m1.readline()
while len(line) > 0:
lines += 1
if text in line:
#found the line
line_found_at = lines
break
if m1.tell() > seekEnd or len(line) == 0:
break
line = m1.readline()
m1.close()
return host,lines,line_found_at
# Function run by worker processes
def worker(input, output):
for host,file_name,text in iter(input.get, 'STOP'):
output.put(FindText( host,file_name,text ))
def main(file_name,text):
t_start = time.time()
# Create queues
task_queue = multiprocessing.Queue()
done_queue = multiprocessing.Queue()
#submit file to open and text to find
print 'Starting', NUMBER_OF_PROCESSES, 'searching workers'
for h in range( NUMBER_OF_PROCESSES ):
t = (h,file_name,text)
task_queue.put(t)
#Start worker processes
for _i in range(NUMBER_OF_PROCESSES):
multiprocessing.Process(target=worker, args=(task_queue, done_queue)).start()
# Get and print results
results = {}
for _i in range(NUMBER_OF_PROCESSES):
host,lines,line_found = done_queue.get()
results[host] = (lines,line_found)
# Tell child processes to stop
for _i in range(NUMBER_OF_PROCESSES):
task_queue.put('STOP')
# print "Stopping Process #%s" % i
total_lines = 0
for h in range(NUMBER_OF_PROCESSES):
if results[h][1] > -1:
print text, 'Found at', total_lines + results[h][1], 'in', time.time() - t_start, 'seconds'
break
total_lines += results[h][0]
if __name__ == "__main__":
main( file_name = 'testFile.txt', text = 'IPI1520' )
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # keyword_processor.add_keyword(<unclean name>, <standardised name>)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> # ['New York', 'Bay Area']
当提取偏移量时:
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)
>>> keywords_found
>>> # [('New York', 7, 16), ('Bay Area', 21, 29)]
限制:我想指出,这个解决方案并不是针对所提出问题的最佳解决方案。对于给定问题,来自eumiro的解决方案中的in
(在相关评论中由@bfontaine给出警告)肯定是最好的答案。
flashtext
是一个强大的解决方案,如果您想要找到给定文本中所有的字符串出现次数。这是in
无法做到的(并且没有被设计用于这样做)。
0
。 在laurasia的答案中,这是一个边缘情况,会返回-1
。def fnd(fname, goal, start=0, bsize=4096):
if bsize < len(goal):
raise ValueError("The buffer size must be larger than the string being searched for.")
with open(fname, 'rb') as f:
if start > 0:
f.seek(start)
overlap = len(goal) - 1
while True:
buffer = f.read(bsize)
pos = buffer.find(goal)
if pos >= 0:
return f.tell() - len(buffer) + pos
if not buffer:
return -1
f.seek(f.tell() - overlap)
5000行不算大(好吧,这取决于每行有多长...)
无论如何:假设字符串是一个单词,并且由空格分隔...
lines=open(file_path,'r').readlines()
str_wanted="whatever_youre_looking_for"
for i in range(len(lines)):
l1=lines.split()
for p in range(len(l1)):
if l1[p]==str_wanted:
#found
# i is the file line, lines[i] is the full line, etc.