从字符串中提取信息并转换为列表

Question

从字符串中提取信息并转换为列表

3

我有一个如下的字符串：

[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,

[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue

[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)

我希望提取“X”值及其相关文本，并将其转换为列表。请参见以下预期输出：

预期输出：

['X=250.44','DECEMBER 31,']
['X=307.5','respectively. The net decrease in the revenue']
['X=49.5','(US$ in millions)']

我们如何在Python中实现这一点？

我的方法：

mylist = []
for line in data.split("\n"):
    if line.strip():
        x_coord = re.findall('^(X=.*)\,$', line)
        text = re.findall('^(]\w +)', line)
        mylist.append([x_coord, text])

我的方法没有为x_coord和text识别出任何值。

- Crusader

1

请查看str.split函数或Python的正则表达式库。 - D Malan

2

我会采用 re 的方式。 - Jan Stránský

str.split() 可能无法工作，但一定要使用正则表达式。我无法使用正则表达式提取信息并转换为列表，所以我去寻求了SO的帮助。 - Crusader

3个回答

2

re 解决方案:

import re

input = [
    "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,",
    "[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue",
    "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)",
]

def extract(s):
    match = re.search("(X=\d+(?:\.\d*)?).*?\](.*?)$",s)
    return match.groups()

output = [extract(item) for item in input]
print(output)

输出：

[
    ('X=250.44', 'DECEMBER 31,'),
    ('X=307.5', 'respectively. The net decrease in the revenue'),
    ('X=49.5', '(US$ in millions)'),
]

解释：

\d ... 数字
\d+ ... 一个或多个数字
(?:...) ... 非捕获组（"正常"括号）
\.\d* ... 点后面跟着零个或多个数字
(?:\.\d*)? ... 可选的（零个或一个）"小数部分"
(X=\d+(?:\.\d*)?) ... 第一组，X=数字
.*? ... 零个或多个任意字符（非贪婪模式）
\] ... ] 符号
$ ... 字符串结尾
\](.*?)$ ... 第二组，位于 ] 和字符串结尾之间的任何内容

- Jan Stránský

2

使用具有命名组的正则表达式来捕获相关部分：

>>> line = "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,"
>>> m = re.search(r'(?:\(X=)(?P<x_coord>.*?)(?:,.*])(?P<text>.*)$', line)
>>> m.groups()
('250.44', 'DECEMBER 31,')
>>> m['x_coord']
'250.44'
>>> m['text']
'DECEMBER 31,'

- tzaman

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jupiterbjy · Accepted Answer

试试这个：

(X=[^,]*)(?:.*])(.*)

import re

source = """[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,
[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue
[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)""".split('\n')

pattern = r"(X=[^,]*)(?:.*])(.*)"

for line in source:
    print(re.search(pattern, line).groups())

输出：

('X=250.44', 'DECEMBER 31,')
('X=307.5', 'respectively. The net decrease in the revenue')
('X=49.5', '(US$ in millions)')

在所有捕获中，您都有X = ，因此我只是做了一个捕获组，如果需要，请随意添加非捕获组。