解析二维文本

Question

解析二维文本

3

我需要解析文本文件，其中相关信息经常以非线性方式分布在多行中。例如：

1234
 1         IN THE SUPERIOR COURT OF THE STATE OF SOME STATE           
 2              IN AND FOR THE COUNTY OF SOME COUNTY                
 3                      UNLIMITED JURISDICTION                        
 4                            --o0o--                                 
 5                                                                    
 6   JOHN SMITH AND JILL SMITH,         )                             
                                        )                             
 7                  Plaintiffs,         )                             
                                        )                             
 8        vs.                           )     No. 12345
                                        )                             
 9   ACME CO, et al.,                   )                             
                                        )                             
10                  Defendants.         )                             
     ___________________________________)

我需要提取原告和被告的身份信息。

这些记录的格式非常多样化，所以我不能总是指望那些漂亮的括号在那里，或者原告和被告的信息被整齐地放在一起，例如：

 1        SUPREME COURT OF THE STATE OF SOME OTHER STATE
                      COUNTY OF COUNTYVILLE
 2                  First Judicial District
                     Important Litigation
 3  --------------------------------------------------X
    THIS DOCUMENT APPLIES TO:
 4
    JOHN SMITH,
 5                            Plaintiff,          Index No.
                                                  2000-123
 6
                                            DEPOSITION
 7                  - against -             UNDER ORAL
                                            EXAMINATION
 8                                              OF
                                            JOHN SMITH,
 9                                           Volume I

10  ACME CO,
    et al,
11                            Defendants.

12  --------------------------------------------------X

两个常量是：

“原告”将出现在原告姓名后面，但不一定在同一行上。
原告和被告的姓名将以大写字母表示。

有什么想法吗？

- alexbw

左边的数字是什么？这些是您添加的还是源代码的一部分？您说原告将大写，但“JOHN SMITH和JILL SMITH”包含小写字母。原告姓名和“原告”文本之间可能有哪些字符？纯粹是空格、括号和逗号吗？ - Martin Smith

这些是源代码的行号。我已经更正了原告姓名的大写。在原告姓名和“原告”之间可以是任何内容。只有非字母字符和空格不是保证。 - alexbw

1

你可以尝试使用神经网络。对于文本解析，它们效果很好：http://thedailywtf.com/Articles/No,_We_Need_a_Neural_Network.aspx - Robert Fraser

我认为我可能需要应用一些机器学习：很难向大家传达这些转录文件有多么不一致。你们都提交了非常好的解决方案来处理我发布的示例，但是对于每个你们处理的特殊情况，我可以找到三个以上的转录文件（当然是由不同的转录公司编写），它们违反并破坏了你们的解决方案。我正在考虑提升一个简单的词汇分析器。 - alexbw

人们在看到你的两个示例之后，说“我可以这样做”，而不是看到真正困难的一般问题（至少N ^ 3难度！）我立刻想到：必须将文本放入二维数组中，以便您可以检测“岛屿”。无论如何，仅供自己日后参考，我指向这篇微软研究论文：http://research.microsoft.com/pubs/69347/docgeom_icdar2005.pdf - Ron Burk

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mechanical_meat · Accepted Answer

我喜欢 Martin的回答。
这里也许是使用Python的更一般方法：

import re

# load file into memory 
# (if large files, provide some limit to how much of the file gets loaded)
with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens

# match all sequences of one or more alphanumeric (or underscore) characters 
# when followed by the word `Plaintiff`; this is intentionally general
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)', paren, 
    re.DOTALL|re.MULTILINE)

# join the list separating by whitespace
str_of_matches = ' '.join(list_of_matches)

# split string by digits (line numbers)
tokens = re.split(r'\d',str_of_matches)

# plaintiffs will be in 2nd-to-last group
plaintiff = tokens[-2].strip()

测试：

with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)>>> tokens = re.split(r'\d', str_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH and JILL SMITH'

with open('no_paren.txt','r') as f:
  no_paren = f.read() # example doc with no parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',no_paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH'