如果您想对章节进行分组,可以使用
itertools.groupby
函数,并将空行作为分隔符:
from itertools import groupby
with open("in.txt") as f:
for k, sec in groupby(f,key=lambda x: bool(x.strip())):
if k:
print(list(sec))
通过一些更多的itertools技巧,我们可以使用大写标题作为分隔符来获取章节:
from itertools import groupby, takewhile
with open("in.txt") as f:
grps = groupby(f,key=lambda x: x.isupper())
for k, sec in grps:
if k:
v = next(grps)[1]
next(v,""), next(v,"")
print(list(takewhile(lambda x: bool(x.strip()), v)))
这将为您提供:
['There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.\n']
['What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.']
每个部分的开头都有一个全大写的标题,因此一旦我们看到它,就知道有两个空行,然后是第一个段落,这个模式会重复出现。
使用循环来分解它:
from itertools import groupby
from itertools import groupby
def parse_sec(bk):
with open(bk) as f:
grps = groupby(f, key=lambda x: bool(x.isupper()))
for k, sec in grps:
if k:
print("First paragraph from section titled :{}".format(next(sec).rstrip()))
v = next(grps)[1]
next(v, ""),next(v,"")
for line in v:
if not line.strip():
break
print(line)
针对您的文本:
In [11]: cat -E in.txt
THE LAY OF THE LAND$
$
$
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.$
$
Of all the kinds of interest attaching to the study of the world's wild animals, there are none that surpass the study of their minds, their morals, and the acts that they perform as the results of their mental processes.$
$
$
WILD ANIMAL TEMPERAMENT & INDIVIDUALITY$
$
$
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.
美元符号代表换行符,输出结果为:
In [12]: parse_sec("in.txt")
First paragraph from section titled :THE LAY OF THE LAND
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.
First paragraph from section titled :WILD ANIMAL TEMPERAMENT & INDIVIDUALITY
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.