如何从文本中解析自由格式的街道/邮政地址,并将其分解成组件

172
我们主要在美国经营业务,并试图通过将所有地址字段合并到单个文本区域中来改善用户体验。但是有几个问题:
  • 用户输入的地址可能不正确或不符合标准格式
  • 必须将地址分成部分(街道,城市,州等)以处理信用卡付款
  • 用户可能会输入更多内容(例如他们的姓名或公司名称)
  • Google可以做到这一点,但是服务条款和查询限制是禁止性的,尤其是在紧张的预算情况下
显然,这是一个常见的问题: 有没有一种方法可以从周围文本中隔离出地址并将其拆分成部分?是否有用于解析地址的正则表达式?
6个回答

335
我曾在一家地址验证公司工作时经常看到这个问题。我在这里发布答案,以便程序员更容易地找到相同的问题。我所在的公司处理了数十亿个地址,我们在这个过程中学到了很多东西。
首先,我们需要了解一些关于地址的事情。

地址不是常规

这意味着正则表达式不适用。我见过各种正则表达式,从简单的匹配特定格式地址的表达式,到这个:

/\s+(\d{2,5}\s+)(?![a|p]m\b)(([a-zA-Z|\s+]{1,5}){1,2})?([\s|,|.]+)?(([a-zA-Z|\s+]{1,30}){1,4})(court|ct|street|st|drive|dr|lane|ln|road|rd|blvd)([\s|,|.|;]+)?(([a-zA-Z|\s+]{1,30}){1,2})([\s|,|.]+)?\b(AK|AL|AR|AZ|CA|CO|CT|DC|DE|FL|GA|GU|HI|IA|ID|IL|IN|KS|KY|LA|MA|MD|ME|MI|MN|MO|MS|MT|NC|ND|NE|NH|NJ|NM|NV|NY|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VA|VI|VT|WA|WI|WV|WY)([\s|,|.]+)?(\s+\d{5})?([\s|,|.]+)/i

...到this,其中一个900多行的类文件即时生成超级大的正则表达式以进行更多匹配。我不建议使用这些(例如,这里有上面正则表达式的fiddle,它会犯很多错误)。没有简单的魔法公式可以让这个工作。理论上来说,使用正则表达式无法匹配地址。

USPS出版物28记录了可能出现的所有地址格式及其关键字和变化。最糟糕的是,地址经常是模糊的。单词可能有多种含义(“St”既可以是“Saint”也可以是“Street”),还有一些我很确定他们是发明出来的词语。(谁知道“Stravenue”是街道后缀?)

你需要一些真正理解地址的代码,如果这样的代码存在,那就是商业机密。但如果你真的喜欢这个,你可能可以自己编写。

地址形态千奇百怪

以下是一些虚构但完整的地址:

1)  102 main street
    Anytown, state

2)  400n 600e #2, 52173

3)  p.o. #104 60203

甚至这些也可能是有效的:

4)  829 LKSDFJlkjsdflkjsdljf Bkpw 12345

5)  205 1105 14 90210

显然,这些并不是标准化的。标点和换行不能保证。以下是发生的情况:

  1. 第一条已经完成,因为它包含了街道地址和城市州名字。有了这些信息,就足以识别地址,并且可以被视为“可投递的”(需要一些标准化)。

  2. 第二条也已经完成,因为它包含了街道地址(带有次要/单元号码)和5位数字的邮政编码,足以识别地址。

  3. 第三条是一个完整的邮政信箱格式,因为它包含了邮政编码。

  4. 第四条同样也已经完成,因为该邮政编码是唯一的,意味着私人实体或公司已经购买了该地址空间。唯一的邮政编码用于高容量或集中交付空间。任何寄到邮政编码12345的信件都会寄到纽约州Schenectady的通用电气公司。这个例子不会到达特定的人,但是美国邮政服务仍然能够投递它。

  5. 第五条也是完整的,信不信由你。只需使用这些数字,就可以在与所有可能的地址数据库进行解析时发现完整的地址。当您将每个数字视为一个组件时,填写缺失的方向指示器、次要设计器和ZIP+4代码是微不足道的。这就是它完全展开和标准化的样子:

205 N 1105 W Apt 14 Beverly Hills CA 90210-5221

地址数据不属于您自己

在大多数向持牌供应商提供正式地址数据的国家中,地址数据本身属于管理机构。在美国,USPS拥有地址。对于加拿大邮政、皇家邮政和其他公司也是如此,尽管每个国家对所有权的执行或定义略有不同。了解这一点非常重要,因为通常禁止反向工程地址数据库。您必须小心地获取、存储和使用数据。

Google Maps是快速解决地址问题的常用工具,但TOS相当严格;例如,您不能在不显示Google地图的情况下使用它们的数据或API,仅限于非商业目的(除非您付费),并且您不能存储数据(除了临时缓存)。这很有道理。Google的数据是全球最好的之一。然而,Google Maps并没有验证地址。如果地址不存在,它仍会向您显示该地址的位置,如果存在的话(请在自己的街道上尝试一下;使用您知道不存在的房屋号码)。有时这很有用,但请注意这一点。
Nominatim的usage policy同样受到限制,特别是对于高频和商业用途,数据主要来自免费来源,因此维护得不太好(这是开放项目的本质)--但这可能仍能满足您的需求。它得到了伟大社区的支持。

美国邮政服务局本身有一个API,但它经常崩溃且没有任何保证和支持。它可能也很难使用。一些人会适度使用而没有任何问题。但很容易忽略的是,美国邮政要求您仅使用他们的API来确认通过其进行运输的地址。

人们认为地址应该很难

不幸的是,我们的社会已经习惯了认为地址很复杂。互联网上有许多关于良好用户体验的文章,但事实是,如果您有一个带有单独字段的地址表单,那就是用户所期望的,即使这使得对于不符合表单预期格式的边缘案例地址更加困难,或者表单需要某个不应该存在的字段。或者用户不知道在哪里放置他们地址的某个部分。

我可以继续谈论现今结账表单的糟糕用户体验,但我想说的是将地址合并到单个字段中将是一个受欢迎的变化--人们将能够按照自己的方式输入地址,而不是试图弄清楚您冗长的表单。然而,这种变化将是意外的,用户可能会在一开始感到有点不适。只要注意这一点即可。
通过将国家/地区字段放在地址之前,可以缓解部分痛苦。当他们首先填写国家/地区字段时,您就知道如何呈现表单。也许您有一种处理美国单一字段地址的好方法,因此如果他们选择美国,您可以将表单缩减为单个字段,否则显示组件字段。这些都是需要考虑的事情!
现在我们知道为什么很难了,你可以做些什么呢?
USPS通过称为CASS™认证的过程向供应商授权,以向客户提供经过验证的地址。这些供应商可以访问USPS数据库,每月更新。他们的软件必须符合严格的标准才能获得认证,并且通常不需要同意上述限制性条款。
有许多CASS认证的公司可以处理列表或具有API:Melissa Data、Experian QAS和SmartyStreets等等。
(由于因“广告”受到指责,我在此截断了我的回答。你需要自行找到适合你的解决方案。)
事实是:真的,朋友们,我并不在这些公司工作。这不是广告。

问:当您在处理CASS合规性时......仅凭记忆估计,代码中有多少是基于算法的(从正则表达式到机器学习),而不是数据驱动的(州/邮政编码/道路查找)? - Scott Brickey
1
很难说。这里是实施CASS认证软件的技术手册。最终,只要通过测试,他们并不关心你如何做到这一点。这需要大量算法,但也有许多数据查询。 - Matt
@Matt 谢谢...我已经多次查阅了他们的文档...其中很多看起来都是查询顺序(验证邮政编码与城市/州)...夹杂着一些模糊文本匹配(Levenshtein 似乎是常见的方法)...但我一直认为它不能那么直接,肯定有我遗漏的方面。 - Scott Brickey
你的正则表达式与205 N 1105 W Apt 14 Beverly Hills CA 90210-5221不匹配。 - Kamal Hussain
2
@KamalHussain 他并没有暗示正则表达式适用于所有情况。两句话之后,他说:“我不推荐使用这些……” - Michael
1
我在90年代初为一家大型人口统计营销/直邮公司编写汇编语言和C语言程序。我记得当时标准化/完善地址以及通过精确编码达到优惠资格的能力是多么宝贵。 - grantwparks

15

有许多街道地址解析器,它们分为两种基本类型 - 一种具有地名和街道名称的数据库,另一种没有。

使用正则表达式的街道地址解析器可以轻松达到约95%的成功率。然后你就会遇到不寻常的情况。CPAN中的Perl版本“Geo::StreetAddress::US”大约就能做到这么好。还有Python和Javascript版本,全部都是开源的。我在Python中有一个改进版,通过处理更多的案例来略微提高成功率。但是,要最后3%正确,您需要数据库来帮助消除歧义。

拥有3位数字邮政编码和美国州名缩写的数据库非常有帮助。当解析器看到一致的邮政编码和州名时,它可以开始锁定格式。这对美国和英国非常有效。

正确的街道地址解析从结尾开始并向后工作。这就是美国邮政服务系统的工作方式。地址在结尾处最不会产生歧义,因为国家名称、城市名称和邮政编码相对容易识别。街道名称通常可以被隔离出来。在街道上的位置是最复杂的解析部分; 这里会遇到诸如“第五层”和“斯台普斯展馆”之类的事情。这就是数据库帮助的大好时机。


还有一个名为Lingua:EN::AddressParse的CPAN模块。虽然比“Geo::StreetAddress::US”慢,但成功率更高。 - Kim Ryan
您能推荐一个不需要数据库的英国网站吗?谢谢! - rex

10
更新:Geocode.xyz现在支持全球使用。有关示例,请参见https://geocode.xyz

对于美国、墨西哥和加拿大,请查看geocoder.ca

例如:

输入:something going on near the intersection of main and arthur kill rd new york

输出:

<geodata>
  <latt>40.5123510000</latt>
  <longt>-74.2500500000</longt>
  <AreaCode>347,718</AreaCode>
  <TimeZone>America/New_York</TimeZone>
  <standard>
    <street1>main</street1>
    <street2>arthur kill</street2>
    <stnumber/>
    <staddress/>
    <city>STATEN ISLAND</city>
    <prov>NY</prov>
    <postal>11385</postal>
    <confidence>0.9</confidence>
  </standard>
</geodata>
您也可以在Web界面中检查结果,或者获取Json或Jsonp输出。例如:我正在寻找纽约123号主街附近的餐厅

你是如何使用openaddress实现地址解析系统的?你是否采用了暴力策略? - Nithin
1
“brute force”是什么意思?将文本分解为可能的地址字符串的所有可能组合,并将每个组合与地址数据库进行比较,这不切实际,需要比此系统提供答案的时间更长。 Openaddresses是构建算法地址格式“训练集”的数据源之一。 它使用此信息从非结构化文本中解析地址。 - Ervin Ruci
3
另一个类似的系统是 Geo::libpostal(http://perltricks.com/article/announcing-geo--libpostal/)。它们似乎也使用 openstreetmap 和 openaddresses,动态构建地址模板。 - Ervin Ruci
1
我刚刚在数百个实际地址上测试过geocode.xyz的地理解析器(输入文本,返回位置)。与谷歌地图API并排比较,对于全球地址集,geocode.xyz的“scantext”方法大部分时间都失败了。它总是选择“Geneva, US”而不是“Geneva, Switzerland”,并且通常存在美国偏见。 - Marc Maxmeister
这取决于上下文。https://geocode.xyz/?scantext=Geneva,%20Switzerland 将产生:匹配位置 Geneva, Switzerland, CH 置信度得分:0.8,而 https://geocode.xyz/?scantext=Geneva,%20USA 将产生匹配位置 Geneva,US 置信度得分:1.0 此外,您可以按以下方式进行区域偏差:https://geocode.xyz/?scantext=Geneva,%20USA&region=CH - Ervin Ruci

5

没有代码?真可惜!

这里有一个简单的JavaScript地址解析器。正如Matt在他的论文中所说的那样(我几乎完全同意:地址是复杂的类型,人们会犯错;最好外包和自动化处理 - 当你负担得起时),它因为所有原因都很糟糕。

但我不想哭泣,我决定尝试:

这段代码对于解析大多数Esri结果的findAddressCandidate以及其他一些返回用逗号分隔街道/城市/州的单行地址的(反)地理编码器来说,工作得还不错。您可以扩展它或编写特定于国家的解析器。或者只是将其用作这个练习有多具有挑战性或我对JavaScript有多糟糕的案例研究。我承认,我只花了大约30分钟来完成这个(未来的迭代版可以添加缓存、邮政编码验证和状态查找以及用户位置上下文),但它对我的用例有效:最终用户看到将地理编码搜索响应解析为4个文本框的表单。如果地址解析出现错误(除非源数据很差,否则很少见),那么问题不大 - 用户可以验证和修复它!(但对于自动化解决方案,可以将其丢弃/忽略或标记为错误,以便开发人员可以支持新格式或修复源数据。)

/* 
address assumptions:
- US addresses only (probably want separate parser for different countries)
- No country code expected.
- if last token is a number it is probably a postal code
-- 5 digit number means more likely
- if last token is a hyphenated string it might be a postal code
-- if both sides are numeric, and in form #####-#### it is more likely
- if city is supplied, state will also be supplied (city names not unique)
- zip/postal code may be omitted even if has city & state
- state may be two-char code or may be full state name.
- commas: 
-- last comma is usually city/state separator
-- second-to-last comma is possibly street/city separator
-- other commas are building-specific stuff that I don't care about right now.
- token count:
-- because units, street names, and city names may contain spaces token count highly variable.
-- simplest address has at least two tokens: 714 OAK
-- common simple address has at least four tokens: 714 S OAK ST
-- common full (mailing) address has at least 5-7:
--- 714 OAK, RUMTOWN, VA 59201
--- 714 S OAK ST, RUMTOWN, VA 59201
-- complex address may have a dozen or more:
--- MAGICICIAN SUPPLY, LLC, UNIT 213A, MAGIC TOWN MALL, 13 MAGIC CIRCLE DRIVE, LAND OF MAGIC, MA 73122-3412
*/

var rawtext = $("textarea").val();
var rawlist = rawtext.split("\n");

function ParseAddressEsri(singleLineaddressString) {
  var address = {
    street: "",
    city: "",
    state: "",
    postalCode: ""
  };

  // tokenize by space (retain commas in tokens)
  var tokens = singleLineaddressString.split(/[\s]+/);
  var tokenCount = tokens.length;
  var lastToken = tokens.pop();
  if (
    // if numeric assume postal code (ignore length, for now)
    !isNaN(lastToken) ||
    // if hyphenated assume long zip code, ignore whether numeric, for now
    lastToken.split("-").length - 1 === 1) {
    address.postalCode = lastToken;
    lastToken = tokens.pop();
  }

  if (lastToken && isNaN(lastToken)) {
    if (address.postalCode.length && lastToken.length === 2) {
      // assume state/province code ONLY if had postal code
      // otherwise it could be a simple address like "714 S OAK ST"
      // where "ST" for "street" looks like two-letter state code
      // possibly this could be resolved with registry of known state codes, but meh. (and may collide anyway)
      address.state = lastToken;
      lastToken = tokens.pop();
    }
    if (address.state.length === 0) {
      // check for special case: might have State name instead of State Code.
      var stateNameParts = [lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken];

      // check remaining tokens from right-to-left for the first comma
      while (2 + 2 != 5) {
        lastToken = tokens.pop();
        if (!lastToken) break;
        else if (lastToken.endsWith(",")) {
          // found separator, ignore stuff on left side
          tokens.push(lastToken); // put it back
          break;
        } else {
          stateNameParts.unshift(lastToken);
        }
      }
      address.state = stateNameParts.join(' ');
      lastToken = tokens.pop();
    }
  }

  if (lastToken) {
    // here is where it gets trickier:
    if (address.state.length) {
      // if there is a state, then assume there is also a city and street.
      // PROBLEM: city may be multiple words (spaces)
      // but we can pretty safely assume next-from-last token is at least PART of the city name
      // most cities are single-name. It would be very helpful if we knew more context, like
      // the name of the city user is in. But ignore that for now.
      // ideally would have zip code service or lookup to give city name for the zip code.
      var cityNameParts = [lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken];

      // assumption / RULE: street and city must have comma delimiter
      // addresses that do not follow this rule will be wrong only if city has space
      // but don't care because Esri formats put comma before City
      var streetNameParts = [];

      // check remaining tokens from right-to-left for the first comma
      while (2 + 2 != 5) {
        lastToken = tokens.pop();
        if (!lastToken) break;
        else if (lastToken.endsWith(",")) {
          // found end of street address (may include building, etc. - don't care right now)
          // add token back to end, but remove trailing comma (it did its job)
          tokens.push(lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken);
          streetNameParts = tokens;
          break;
        } else {
          cityNameParts.unshift(lastToken);
        }
      }
      address.city = cityNameParts.join(' ');
      address.street = streetNameParts.join(' ');
    } else {
      // if there is NO state, then assume there is NO city also, just street! (easy)
      // reasoning: city names are not very original (Portland, OR and Portland, ME) so if user wants city they need to store state also (but if you are only ever in Portlan, OR, you don't care about city/state)
      // put last token back in list, then rejoin on space
      tokens.push(lastToken);
      address.street = tokens.join(' ');
    }
  }
  // when parsing right-to-left hard to know if street only vs street + city/state
  // hack fix for now is to shift stuff around.
  // assumption/requirement: will always have at least street part; you will never just get "city, state"  
  // could possibly tweak this with options or more intelligent parsing&sniffing
  if (!address.city && address.state) {
    address.city = address.state;
    address.state = '';
  }
  if (!address.street) {
    address.street = address.city;
    address.city = '';
  }

  return address;
}

// get list of objects with discrete address properties
var addresses = rawlist
  .filter(function(o) {
    return o.length > 0
  })
  .map(ParseAddressEsri);
$("#output").text(JSON.stringify(addresses));
console.log(addresses);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea>
27488 Stanford Ave, Bowden, North Dakota
380 New York St, Redlands, CA 92373
13212 E SPRAGUE AVE, FAIR VALLEY, MD 99201
1005 N Gravenstein Highway, Sebastopol CA 95472
A. P. Croll &amp; Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947
11522 Shawnee Road, Greenwood, DE 19950
144 Kings Highway, S.W. Dover, DE 19901
Intergrated Const. Services 2 Penns Way Suite 405, New Castle, DE 19720
Humes Realty 33 Bridle Ridge Court, Lewes, DE 19958
Nichols Excavation 2742 Pulaski Hwy, Newark, DE 19711
2284 Bryn Zion Road, Smyrna, DE 19904
VEI Dover Crossroads, LLC 1500 Serpentine Road, Suite 100 Baltimore MD 21
580 North Dupont Highway, Dover, DE 19901
P.O. Box 778, Dover, DE 19903
714 S OAK ST
714 S OAK ST, RUM TOWN, VA, 99201
3142 E SPRAGUE AVE, WHISKEY VALLEY, WA 99281
27488 Stanford Ave, Bowden, North Dakota
380 New York St, Redlands, CA 92373
</textarea>
<div id="output">
</div>


免责声明:我的客户拥有他们的地址数据并运行自己的Esri服务器。如果您从Google、OSM、ArcGisOnline或其他地方获取数据,请确保可以存储和使用它(许多服务对存储方式和时间有限制)。 - nothingisnecessary
以上第一个答案有力地证明了,如果你正在处理全球地址列表,那么这个问题是无法用正则表达式解决的。200个国家有太多的例外情况。在我的测试中,你可以相当可靠地从字符串中确定国家,然后查找每个国家的特定正则表达式——这可能就是更好的API如何工作的方式。 - Marc Maxmeister

2
对于美国地址解析,我更喜欢使用 usaddress,该包可以在 pip 中获得。
python3 -m pip install usaddress

使用示例:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# address_parser.py
import sys
from usaddress import tag
from json import dumps, loads

if __name__ == '__main__':
    tag_mapping = {
        'Recipient': 'recipient',
        'AddressNumber': 'addressStreet',
        'AddressNumberPrefix': 'addressStreet',
        'AddressNumberSuffix': 'addressStreet',
        'StreetName': 'addressStreet',
        'StreetNamePreDirectional': 'addressStreet',
        'StreetNamePreModifier': 'addressStreet',
        'StreetNamePreType': 'addressStreet',
        'StreetNamePostDirectional': 'addressStreet',
        'StreetNamePostModifier': 'addressStreet',
        'StreetNamePostType': 'addressStreet',
        'CornerOf': 'addressStreet',
        'IntersectionSeparator': 'addressStreet',
        'LandmarkName': 'addressStreet',
        'USPSBoxGroupID': 'addressStreet',
        'USPSBoxGroupType': 'addressStreet',
        'USPSBoxID': 'addressStreet',
        'USPSBoxType': 'addressStreet',
        'BuildingName': 'addressStreet',
        'OccupancyType': 'addressStreet',
        'OccupancyIdentifier': 'addressStreet',
        'SubaddressIdentifier': 'addressStreet',
        'SubaddressType': 'addressStreet',
        'PlaceName': 'addressCity',
        'StateName': 'addressState',
        'ZipCode': 'addressPostalCode',
    }
    try:
        address, _ = tag(' '.join(sys.argv[1:]), tag_mapping=tag_mapping)
    except:
        with open('failed_address.txt', 'a') as fp:
            fp.write(sys.argv[1] + '\n')
        print(dumps({}))
    else:
        print(dumps(dict(address)))

运行 address_parser.py:

python3 address_parser.py 9757 East Arcadia Ave. Saugus MA 01906
{"addressStreet": "9757 East Arcadia Ave.", "addressCity": "Saugus", "addressState": "MA", "addressPostalCode": "01906"}

1

我来晚了,但这是我多年前为澳大利亚编写的Excel VBA脚本。它可以轻松修改以支持其他国家。我在GitHub上创建了C#代码的存储库。我将其托管在我的网站上,您可以在此处下载:http://jeremythompson.net/Rocks/ParseAddress.xlsm

策略

对于任何具有数字邮政编码或可以与正则表达式匹配的国家,我的策略都非常有效:

  1. 首先,我们检测名字和姓氏,它们被假定为在顶部一行。通过取消复选框(称为“名称是顶行”,如下所示),可以轻松跳过名称并从地址开始。

  2. 接下来,可以安全地期望包含街道和门牌号的地址出现在区和St、Pde、Ave、Av、Rd、Cres、loop等分隔符之前。

  3. 检测区与州甚至国家可能会使最复杂的解析器出现问题,因为可能存在冲突。为了克服这个问题,我使用邮政编码查找,基于以下事实:在剥离街道和公寓/单元号码以及PoBox、Ph、Fax、Mobile等后,只有邮政编码数字会保留。这很容易与正则表达式匹配,然后查找区和国家。

    您的国家邮政服务将提供一个包含免费的区域和州的邮政编码列表,您可以将其存储在Excel表格、数据库表、文本/JSON/XML文件等中。

  4. 最后,由于某些邮政编码有多个区,因此我们检查地址中出现的哪个区。


例子

Screenshot of Excel cells

VBA 代码

免责声明,我知道这段代码并不完美,甚至写得不好,但它非常容易转换为任何编程语言,并在任何类型的应用程序中运行。策略是根据您所在的国家和规则而定,将此代码作为示例:

Option Explicit

Private Const TopRow As Integer = 0

Public Sub ParseAddress()
Dim strArr() As String
Dim sigRow() As String
Dim i As Integer
Dim j As Integer
Dim k As Integer
Dim Stat As String
Dim SpaceInName As Integer
Dim Temp As String
Dim PhExt As String

On Error Resume Next

Temp = ActiveSheet.Range("Address")

'Split info into array
strArr = Split(Temp, vbLf)

'Trim the array
For i = 0 To UBound(strArr)
strArr(i) = VBA.Trim(strArr(i))
Next i

'Remove empty items/rows    
ReDim sigRow(LBound(strArr) To UBound(strArr))
For i = LBound(strArr) To UBound(strArr)
    If Trim(strArr(i)) <> "" Then
        sigRow(j) = strArr(i)
        j = j + 1
    End If
Next i
ReDim Preserve sigRow(LBound(strArr) To j)

'Find the name (MUST BE ON THE FIRST ROW UNLESS CHECKBOX UNTICKED)
i = TopRow
If ActiveSheet.Shapes("chkFirst").ControlFormat.Value = 1 Then

SpaceInName = InStr(1, sigRow(i), " ", vbTextCompare) - 1

If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
Else
 If MsgBox("First Name: " & VBA.Mid$(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
End If

If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
Else
  If MsgBox("Surame: " & VBA.Mid(sigRow(i), SpaceInName + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
End If
sigRow(i) = ""
End If

'Find the Street by looking for a "St, Pde, Ave, Av, Rd, Cres, loop, etc"
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
    For j = 0 To 8
    If InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) > 0 Then
    
    'Find the position of the street in order to get the suburb
    SpaceInName = InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) + Len(Street(j)) - 1
    
    'If its a po box then add 5 chars
    If VBA.Right(Street(j), 3) = "BOX" Then SpaceInName = SpaceInName + 5
    
    If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
    ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
    Else
      If MsgBox("Street Address: " & VBA.Mid(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
    End If
    'Trim the Street, Number leaving the Suburb if its exists on the same line
    sigRow(i) = VBA.Mid(sigRow(i), SpaceInName) + 2
    sigRow(i) = Replace(sigRow(i), VBA.Mid(sigRow(i), 1, SpaceInName), "")
    
    GoTo PastAddress:
    End If
    Next j
End If
Next i
PastAddress:

'Mobile
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
    For j = 0 To 3
    Temp = Mb(j)
        If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
        If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
        ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
        Else
          If MsgBox("Mobile: " & VBA.Mid(sigRow(i), Len(Temp) + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
        End If
    sigRow(i) = ""
    GoTo PastMobile:
    End If
    Next j
End If
Next i
PastMobile:

'Phone
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
    For j = 0 To 1
    Temp = Ph(j)
        If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
            
            'TODO: Detect the intl or national extension here.. or if we can from the postcode.
            If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
            ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
            Else
              If MsgBox("Phone: " & VBA.Mid(sigRow(i), Len(Temp) + 3), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
            End If
        
        sigRow(i) = ""
        GoTo PastPhone:
        End If
    Next j
End If
Next i
PastPhone:


'Email
For i = 1 To UBound(sigRow)
    If Len(sigRow(i)) > 0 Then
        'replace with regEx search
        If InStr(1, sigRow(i), "@", vbTextCompare) And InStr(1, VBA.UCase(sigRow(i)), ".CO", vbTextCompare) Then
        Dim email As String
        email = sigRow(i)
        email = Replace(VBA.UCase(email), "EMAIL:", "")
        email = Replace(VBA.UCase(email), "E-MAIL:", "")
        email = Replace(VBA.UCase(email), "E:", "")
        email = Replace(VBA.UCase(Trim(email)), "E ", "")
        email = VBA.LCase(email)
        
            If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
            ActiveSheet.Range("Email") = email
            Else
              If MsgBox("Email: " & email, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Email") = email
            End If
        sigRow(i) = ""
        Exit For
        End If
    End If
Next i

'Now the only remaining items will be the postcode, suburb, country
'there shouldn't be any numbers (eg. from PoBox,Ph,Fax,Mobile) except for the Post Code

'Join the string and filter out the Post Code
Temp = Join(sigRow, vbCrLf)
Temp = Trim(Temp)

For i = 1 To Len(Temp)

Dim postCode As String
postCode = VBA.Mid(Temp, i, 4)
    
'In Australia PostCodes are 4 digits
If VBA.Mid(Temp, i, 1) <> " " And IsNumeric(postCode) Then

    If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
    ActiveSheet.Range("PostCode") = postCode
    Else
      If MsgBox("Post Code: " & postCode, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("PostCode") = postCode
    End If

    'Lookup the Suburb and State based on the PostCode, the PostCode sheet has the lookup
    Dim mySuburbArray As Range
    Set mySuburbArray = Sheets("PostCodes").Range("A2:B16670")
    
    Dim suburbs As String
    For j = 1 To mySuburbArray.Columns(1).Cells.Count
    If mySuburbArray.Cells(j, 1) = postCode Then
        'Check if the suburb is listed in the address
        If InStr(1, UCase(Temp), mySuburbArray.Cells(j, 2), vbTextCompare) > 0 Then

        'Set the Suburb and State
        ActiveSheet.Range("Suburb") = mySuburbArray.Cells(j, 2)
        Stat = mySuburbArray.Cells(j, 3)
        ActiveSheet.Range("State") = Stat
                
        'Knowing the State - for Australia we can get the telephone Ext
        PhExt = PhExtension(VBA.UCase(Stat))
        ActiveSheet.Range("PhExt") = PhExt
        
        'remove the phone extension from the number
        Dim prePhone As String
        prePhone = ActiveSheet.Range("Phone")
        prePhone = Replace(prePhone, PhExt & " ", "")
        prePhone = Replace(prePhone, "(" & PhExt & ") ", "")
        prePhone = Replace(prePhone, "(" & PhExt & ")", "")
        ActiveSheet.Range("Phone") = prePhone
        Exit For
        End If
    End If
    Next j
Exit For
End If
Next i

End Sub

  
Private Function PhExtension(ByVal State As String) As String
Select Case State
Case Is = "NSW"
PhExtension = "02"
Case Is = "QLD"
PhExtension = "07"
Case Is = "VIC"
PhExtension = "03"
Case Is = "NT"
PhExtension = "04"
Case Is = "WA"
PhExtension = "05"
Case Is = "SA"
PhExtension = "07"
Case Is = "TAS"
PhExtension = "06"
End Select
End Function

Private Function Ph(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Ph = "PH"
Case Is = 1
Ph = "PHONE"
'Case Is = 2
'Ph = "P"
End Select
End Function

Private Function Mb(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Mb = "MB"
Case Is = 1
Mb = "MOB"
Case Is = 2
Mb = "CELL"
Case Is = 3
Mb = "MOBILE"
'Case Is = 4
'Mb = "M"
End Select
End Function

Private Function Fax(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Fax = "FAX"
Case Is = 1
Fax = "FACSIMILE"
'Case Is = 2
'Fax = "F"
End Select
End Function

Private Function State(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
State = "NSW"
Case Is = 1
State = "QLD"
Case Is = 2
State = "VIC"
Case Is = 3
State = "NT"
Case Is = 4
State = "WA"
Case Is = 5
State = "SA"
Case Is = 6
State = "TAS"
End Select
End Function

Private Function Street(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Street = " ST"
Case Is = 1
Street = " RD"
Case Is = 2
Street = " AVE"
Case Is = 3
Street = " AV"
Case Is = 4
Street = " CRES"
Case Is = 5
Street = " LOOP"
Case Is = 6
Street = "PO BOX"
Case Is = 7
Street = " STREET"
Case Is = 8
Street = " ROAD"
Case Is = 9
Street = " AVENUE"
Case Is = 10
Street = " CRESENT"
Case Is = 11
Street = " PARADE"
Case Is = 12
Street = " PDE"
Case Is = 13
Street = " LANE"
Case Is = 14
Street = " COURT"
Case Is = 15
Street = " BLVD"
Case Is = 16
Street = "P.O. BOX"
Case Is = 17
Street = "P.O BOX"
Case Is = 18
Street = "PO BOX"
Case Is = 19
Street = "POBOX"
End Select
End Function

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接