在Python中如何使用正则表达式匹配\r\n?

3

我有一个文本看起来像这样:

1
00:00:01,860 --> 00:00:31,210
Affil of fifth at fat at all the social ball and said, with all this little in the

2
00:00:31,210 --> 00:01:03,060
mid limited and will cost a lot, for want of a lot of it is I never do this or below are the innocent of fat in the annual own none will bit less often were a little the earth the oven for the area of some of them some of the atom in the long will recall the law, will cost you the ball a little less of Odessa and coal rule the Vikings in at a loss

3
00:01:03,980 --> 00:01:33,150
of our lady of one of the will of the wall routing visiting little sign of the limited use of a lot of wind up with a loss of 14 and uncivil will find a site to lop off call them into solid, a London, can we stop go to work as a gay sailor kissing a lot of that scene of the law that on them in this case

4
00:01:33,950 --> 00:02:03,190
will almost a kind wilkinson's, and that a settlement, or the fog collared of the unknown, some would call and all of this was a little, some of us up a lot of letters, union would quit them or not will be or will lend money to zoning and will open the door to that of the novel opens in

5
00:02:04,240 --> 00:02:24,180
it and solidity can cut later with boats can die to only see not open only to six and 0:50 and world go back a at the fat of that at that

我希望能从文本中仅提取出句子,例如:“在社交圈中排名第五,在所有的脂肪球中都是这样说的,随着所有这些小事情的发生,限制也将很大,并且会花费很多,因为想要......”
因此,原始文本如下:
  "1\r\n00:00:01,860 --> 00:00:31,210\r\nAffil of fifth at fat at all the social ball and said, with all this little in the\r\n\r\n2\r\n00:00:31,210 --> 00:01:03,060\r\nmid limited and will cost a lot, for want of a lot of it is I never do this or below are the innocent of fat in the annual own none will bit less often were a little the earth the oven for the area of some of them some of the atom in the long will recall the law, will cost you the ball a little less of Odessa and coal rule the Vikings in at a loss\r\n\r\n3\r\n00:01:03,980 --> 00:01:33,150\r\nof our lady of one of the will of the wall routing visiting little sign of the limited use of a lot of wind up with a loss of 14 and uncivil will find a site to lop off call them into solid, a London, can we stop go to work as a gay sailor kissing a lot of that scene of the law that on them in this case\r\n\r\n4\r\n00:01:33,950 --> 00:02:03,190\r\nwill almost a kind wilkinson's, and that a settlement, or the fog collared of the unknown, some would call and all of this was a little, some of us up a lot of letters, union would quit them or not will be or will lend money to zoning and will open the door to that of the novel opens in\r\n\r\n5\r\n00:02:04,240 --> 00:02:24,180\r\nit and solidity can cut later with boats can die to only see not open only to six and 0:50 and world go back a at the fat of that at that\r\n\r\n"

通过检查原始文本,我们可以根据“\r\n”这样的分隔符将文本分开,但我不知道如何编写正则表达式。


text.split('\n').strip()? - TigerhawkT3
2
实际上,text.splitlines()[2::4] 看起来更像是这样。 - TigerhawkT3
2个回答

4
为什么不直接从第三行开始,每四行取一次呢?这样你就可以用空格连接它们。
text = '''1
00:00:01,860 --> 00:00:31,210
Affil of fifth at fat at all the social ball and said, with all this little in the

2
00:00:31,210 --> 00:01:03,060
mid limited and will cost a lot, for want of a lot of it is I never do this or below are the innocent of fat in the annual own none will bit less often were a little the earth the oven for the area of some of them some of the atom in the long will recall the law, will cost you the ball a little less of Odessa and coal rule the Vikings in at a loss

3
00:01:03,980 --> 00:01:33,150
of our lady of one of the will of the wall routing visiting little sign of the limited use of a lot of wind up with a loss of 14 and uncivil will find a site to lop off call them into solid, a London, can we stop go to work as a gay sailor kissing a lot of that scene of the law that on them in this case

4
00:01:33,950 --> 00:02:03,190
will almost a kind wilkinson's, and that a settlement, or the fog collared of the unknown, some would call and all of this was a little, some of us up a lot of letters, union would quit them or not will be or will lend money to zoning and will open the door to that of the novel opens in

5
00:02:04,240 --> 00:02:24,180
it and solidity can cut later with boats can die to only see not open only to six and 0:50 and world go back a at the fat of that at that'''
t = ' '.join(text.splitlines()[2::4])

结果:

>>> import textwrap
>>> for line in textwrap.wrap(t, width=50):
...     print(line)
...
Affil of fifth at fat at all the social ball and
said, with all this little in the mid limited and
will cost a lot, for want of a lot of it is I
never do this or below are the innocent of fat in
the annual own none will bit less often were a
little the earth the oven for the area of some of
them some of the atom in the long will recall the
law, will cost you the ball a little less of
Odessa and coal rule the Vikings in at a loss of
our lady of one of the will of the wall routing
visiting little sign of the limited use of a lot
of wind up with a loss of 14 and uncivil will find
a site to lop off call them into solid, a London,
can we stop go to work as a gay sailor kissing a
lot of that scene of the law that on them in this
case will almost a kind wilkinson's, and that a
settlement, or the fog collared of the unknown,
some would call and all of this was a little, some
of us up a lot of letters, union would quit them
or not will be or will lend money to zoning and
will open the door to that of the novel opens in
it and solidity can cut later with boats can die
to only see not open only to six and 0:50 and
world go back a at the fat of that at that

@vks - 那么,是什么让你认为这个文本是人类生成的呢?这是一个字幕文件,几乎可以确定是通过OCR自动生成的。 - TigerhawkT3
我可以问你一个问题吗?比如如何将文本分割成句子。因为转录没有真正的标点符号。 - dd90p
@dd90p 别用 join - vks
@dd90p - 在' '.join()之前的原始数据中,每个单独的行都存在,但如果这些数据不存在,则无法恢复实际的句子结构。 - TigerhawkT3

4

这里根本不需要正则表达式。 - TigerhawkT3
@TigerhawkT3 不需要附带踩一下吧?我发现这个比你的更稳定。 - vks
这个特定的表达式也非常脆弱,如果相关行以数字开头,它将会出错。 - TigerhawkT3
尽管您可以对其进行改进,但它仍然是XY问题中Y的一个脆弱的hack。以其目前的形式而言,它是如此脆弱,我几乎“期望”它会出现故障...但在不带硬编码字符之一开头的相关行被简单地丢弃时,这些故障会悄无声息地发生。这显然是错误的方法。 - TigerhawkT3
让我们在聊天中继续这个讨论。链接:http://chat.stackoverflow.com/rooms/131120/discussion-between-tigerhawkt3-and-vks。 - TigerhawkT3
显示剩余3条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接