从字符串中解析出始发城市和目的地城市

27

我有一个Pandas数据框,其中一列是包含某些旅行详细信息的字符串。我的目标是解析每个字符串,提取出起始城市和目的地城市(我最终想要两个名为“origin”和“destination”的新列)。

数据如下:

df_col = [
    'new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags'
]

这应该会产生以下结果:

Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)

到目前为止,我已经尝试过: 各种NLTK方法,但使我最接近目标的是使用nltk.pos_tag方法对字符串中的每个单词进行标记。结果是一个由每个单词和相关标签组成的元组列表。以下是一个示例...

[('Fly', 'NNP'), ('to', 'TO'), ('Australia', 'NNP'), ('&', 'CC'), ('New', 'NNP'), ('Zealand', 'NNP'), ('from', 'IN'), ('Paris', 'NNP'), ('from', 'IN'), ('€422', 'NNP'), ('return', 'NN'), ('including', 'VBG'), ('2', 'CD'), ('checked', 'VBD'), ('bags', 'NNS'), ('!', '.')]

我卡在这个阶段,不确定如何最好地实现这个。有人能指点一下我吗?谢谢。


10
我觉得你在这里要求的是魔法 =) - alvas
1个回答

150

简短版

乍一看似乎不可能,除非您可以访问包含相当复杂组件的API。

详细版

第一眼看起来,似乎您正在神奇地解决自然语言问题。但是,让我们将其分解并将其范围限制到可构建的点上。

首先,要识别国家和城市,您需要枚举它们的数据,因此让我们尝试:https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json

在搜索结果的顶部,我们找到https://datahub.io/core/world-cities ,这导致了world-cities.json文件。现在我们将它们加载到国家和城市的集合中。

import requests
import json

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])

现在有数据,让我们尝试构建组件一

  • 任务:检测文本中是否存在任何与城市/国家匹配的子字符串。
  • 工具:https://github.com/vi3k6i5/flashtext(快速字符串搜索/匹配)
  • 指标:正确识别字符串中的城市/国家数量

让我们把它们结合起来。

import requests
import json
from flashtext import KeywordProcessor

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])


keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))


texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
keyword_processor.extract_keywords(texts[0])

[out]:

['York', 'Venice', 'Italy']

嘿,出了什么问题?!

首先进行尽职调查,第一反应是数据中没有"纽约",

>>> "New York" in cities
False

这是什么鬼?!#$%^&* 为了保持理智,我们检查以下内容:

>>> len(countries)
244
>>> len(cities)
21940

是的,你不能只信任单一的数据源,所以让我们尝试获取所有的数据源。

https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json,你可以找到另一个链接https://github.com/dr5hn/countries-states-cities-database,让我们对其进行处理...

import requests
import json

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities1_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries1 = set([city['country'] for city in cities1_json])
cities1 = set([city['name'] for city in cities1_json])

dr5hn_cities_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/cities.json"
dr5hn_countries_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/countries.json"

cities2_json = json.loads(requests.get(dr5hn_cities_url).content.decode('utf8'))
countries2_json = json.loads(requests.get(dr5hn_countries_url).content.decode('utf8'))

countries2 = set([c['name'] for c in countries2_json])
cities2 = set([c['name'] for c in cities2_json])

countries = countries2.union(countries1)
cities = cities2.union(cities1)

现在我们变得神经质了,所以我们需要进行健全性检查。

>>> len(countries)
282
>>> len(cities)
127793

哇,这比以前的城市数量要多得多。

让我们再次尝试使用flashtext代码。

from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']

keyword_processor.extract_keywords(texts[0])

[out]:

['York', 'Venice', 'Italy']

真的吗?!没有纽约?!$%^&*

好的,为了更多的检查,请在城市列表中搜索“york”。

>>> [c for c in cities if 'york' in c.lower()]
['Yorklyn',
 'West York',
 'West New York',
 'Yorktown Heights',
 'East Riding of Yorkshire',
 'Yorke Peninsula',
 'Yorke Hill',
 'Yorktown',
 'Jefferson Valley-Yorktown',
 'New York Mills',
 'City of York',
 'Yorkville',
 'Yorkton',
 'New York County',
 'East York',
 'East New York',
 'York Castle',
 'York County',
 'Yorketown',
 'New York City',
 'York Beach',
 'Yorkshire',
 'North Yorkshire',
 'Yorkeys Knob',
 'York',
 'York Town',
 'York Harbor',
 'North York']

哦!原来叫“纽约市”而不是“纽约”,我恍然大悟了!

你:这是什么恶作剧?!

语言学家:欢迎来到自然语言处理的世界,自然语言是社交建构的、具有主观性的、因群体和个人语言差异而变化的。

你:别扯淡了,告诉我怎么解决。

NLP从业者(一位专门从事处理嘈杂用户生成文本的人):你只需要添加到列表中。但在此之前,需要检查已有列表给出的指标(metric)

对于样本“测试集(test set)”中的每个文本,您应该提供一些真实标签以确保可以“衡量指标”。

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris'))]

# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0

for text, label in texts_labels:
    extracted = keyword_processor.extract_keywords(text)

    # We're making some assumptions here that the order of 
    # extracted and the truth must be the same.
    true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
    false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
    total_truth += len(label)

    # Just visualization candies.
    print(text)
    print(extracted)
    print(label)
    print()

实际上,看起来并不那么糟糕。我们获得了90%的准确度:

>>> true_positives / total_truth
0.9

但我想要100%的提取!!

好的,好的,那么看看上述方法正在产生的“唯一”错误,它只是因为“纽约”不在城市列表中。

:为什么我们不把“纽约”添加到城市列表中呢?即

keyword_processor.add_keyword('New York')

print(texts[0])
print(keyword_processor.extract_keywords(texts[0]))

[out]:

['New York', 'Venice', 'Italy']

You: 看,我做到了!现在我应该喝一杯啤酒。 Linguist: 那'我住在马拉维'怎么样?

>>> keyword_processor.extract_keywords('I live in Marawi')
[]

NLP从业者(插话):那'我住在济州岛'怎么样?

>>> keyword_processor.extract_keywords('I live in Jeju')
[]

来自远方的Raymond Hettinger粉丝: “一定有更好的方法!”

是的,如果我们尝试一些愚蠢的方法,比如将以“City”结尾的城市关键词添加到我们的keyword_processor中,那么会有更好的办法吗?

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])
            print(c[:-5])

它可用!

现在让我们重新执行回归测试示例:

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris')),
('I live in Florida', ('Florida')), 
('I live in Marawi', ('Marawi')), 
('I live in jeju', ('Jeju'))]

# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0

for text, label in texts_labels:
    extracted = keyword_processor.extract_keywords(text)

    # We're making some assumptions here that the order of 
    # extracted and the truth must be the same.
    true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
    false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
    total_truth += len(label)

    # Just visualization candies.
    print(text)
    print(extracted)
    print(label)
    print()

[out]:

new york to venice, italy for usd271
['New York', 'Venice', 'Italy']
('New York', 'Venice', 'Italy')

return flights from brussels to bangkok with etihad from €407
['Brussels', 'Bangkok']
('Brussels', 'Bangkok')

from los angeles to guadalajara, mexico for usd191
['Los Angeles', 'Guadalajara', 'Mexico']
('Los Angeles', 'Guadalajara')

fly to australia new zealand from paris from €422 return including 2 checked bags
['Australia', 'New Zealand', 'Paris']
('Australia', 'New Zealand', 'Paris')

I live in Florida
['Florida']
Florida

I live in Marawi
['Marawi']
Marawi

I live in jeju
['Jeju']
Jeju

100%没错,NLP-bunga !!!

但是,说真的,这只是问题的冰山一角。如果你有这样一个句子:

>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
['Adam', 'Bangkok', 'Singapore', 'China']

为什么Adam被提取成一个城市?!

然后你进行了更多神经质的检查:

>>> 'Adam' in cities
Adam

恭喜你,又跳进了一个多义词的NLP兔子洞,即使忽略这个多义词的问题,Adam 在这个句子中很可能是指一个人,但也恰好是一个城市的名字(根据你获取的数据)。

我知道你在干什么... 即使我们忽略这个多义词的问题,你仍然没有给我想要的输出:

[输入]:

['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags'
]

[out]:

Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)
语言学家: 即使假设前置介词(例如fromto)给出城市的“起点”/“终点”标签,你将如何处理“多段”航班的情况,例如{{multi-leg}}?
>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')

这句话的期望输出是什么:

> Adam flew to Bangkok from Singapore and then to China

也许是这样吗?规格是什么?您的输入文本有多少(不)结构化?
> Origin: Singapore
> Departure: Bangkok
> Departure: China

尝试构建第二个组件以检测介词。

我们可以利用同样的flashtext方法,尝试一些技巧来实现您的假设。

如果我们将tofrom添加到列表中会怎样呢?

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']


for text in texts:
    extracted = keyword_processor.extract_keywords(text)
    print(text)
    print(extracted)
    print()

[out]:

new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']

return flights from brussels to bangkok with etihad from €407
['from', 'Brussels', 'to', 'Bangkok', 'from']

from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']

fly to australia new zealand from paris from €422 return including 2 checked bags
['to', 'Australia', 'New Zealand', 'from', 'Paris', 'from']

哎呀,使用to/from这样的规则太糟糕了

  1. 如果"from"指的是车票价格怎么办?
  2. 如果国家/城市前面没有"to/from"怎么办?

好吧,让我们使用上面的输出来解决问题1。也许检查"from"后面的术语是否为城市,如果不是,则删除to/from?

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']


for text in texts:
    extracted = keyword_processor.extract_keywords(text)
    print(text)

    new_extracted = []
    extracted_next = extracted[1:]
    for e_i, e_iplus1 in zip_longest(extracted, extracted_next):
        if e_i == 'from' and e_iplus1 not in cities and e_iplus1 not in countries:
            print(e_i, e_iplus1)
            continue
        elif e_i == 'from' and e_iplus1 == None: # last word in the list.
            continue
        else:
            new_extracted.append(e_i)

    print(new_extracted)
    print()

看起来这样就可以解决问题并删除不在城市/国家之前的from了。

[out]:

new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']

return flights from brussels to bangkok with etihad from €407
from None
['from', 'Brussels', 'to', 'Bangkok']

from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']

fly to australia new zealand from paris from €422 return including 2 checked bags
from None
['to', 'Australia', 'New Zealand', 'from', 'Paris']

但“来自纽约”的问题还没有解决!

语言学家:仔细想想,模棱两可的短语是否应该通过做出知情决策来消除歧义?如果是这样,知情决策中的“信息”是什么?在填写歧义之前,是否应该首先遵循某个特定模板来检测信息?

:我对你失去耐心了……你让我一直打圈子,那些可以理解人类语言的AI在哪里?我从新闻、谷歌、Facebook和所有这些地方都听说过。

:你给我的都是基于规则的,那AI在哪里?

NLP从业者:难道你不想要100%吗?编写“业务逻辑”或基于规则的系统将是实现特定数据集上的“100%”的唯一方法,而没有任何预设数据集可用于“训练AI”。

:训练AI是什么意思?我为什么不能使用Google、Facebook、Amazon、Microsoft甚至IBM的AI?

NLP从业者:让我向您介绍

欢迎来到计算语言学和自然语言处理的世界!

简而言之

是的,没有真正现成的神奇解决方案,如果你想使用"AI"或机器学习算法,很可能需要更多的训练数据,就像上面示例中显示的texts_labels对。


23
对于一个事后看来可能不好的问题,这是一个极好的回答。Bravo @alvas。 - Merv Merzoug
1
来这里爬数据,却因为这里的信息和欢笑而留下了! - Oussama Essamadi
3
很棒的回答,Alvas,感谢你的教程,你应该在某个地方写博客。 - Umar.H
1
最佳答案。哇,Alvas。你刚才说到了核心。喜欢阅读你的答案。 - Joish
2
尽管存在种种缺陷、错误和可疑的方向,但这正是 StackOverflow 仍然闪耀的地方:看到魔法师在工作。++ - Jan
显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接