为Elasticsearch标准化英式和美式英语

9
有没有一种最佳实践来规范 Elasticsearch 中的英式和美式英语?
使用同义词令牌过滤器需要一个非常长的配置文件。英式和美式英语实际上有几千个不同拼写的单词,几乎不可能找到一个真正全面的单词列表。这里有一个近 2000 个单词的列表,但远远不够完整。
最好是创建一个 ES 分析器/过滤器,带有将美式转换为英式的规则。也许这是更好的方法,但我不知道从哪里开始——我需要哪种类型的过滤器?它不必涵盖所有内容——它应该仅规范大多数搜索术语。例如,“grey” - “gray”,“colour” - “color”,“center” - “centre”等。
2个回答

5

在摸索了一段时间后,我采取了以下方法。它是基本规则、"修复"和同义词的结合:首先,应用char_filter来强制执行一组基本拼写规则。虽然不是100%正确,但它的效果相当不错:

"char_filter": {
    "en_char_filter": { "type": "mapping", "mappings": [
        # fixes
        "aerie=>axerie", "aeroplane=>airplane", "aloe=>aloxe", "canoe=>canoxe", "coerce=>coxerce", "poem=>poxem", "prise=>prixse",
        # whole words
        "armour=>armor", "behaviour=>behavior", "centre=>center" "colour=>color", "clamour=>clamor", "draught=>draft", "endeavour=>endeavor", "favour=>favor", "flavour=>flavor", "harbour=>harbor", "honour=>honor",
        "humour=>humor", "labour=>labor", "litre=>liter", "metre=>meter", "mould=>mold", "neighbour=>neighbor", "plough=>plow", "saviour=>savior", "savour=>savor",
        # generic transformations
        "ae=>e", "ction=>xion", "disc=>disk", "gramme=>gram", "isable=>izable", "isation=>ization", "ise=>ize", "ising=>izing", "ll=>l", "oe=>e", "ogue=>og", "sation=>zation", "yse=>yze", "ysing=>yzing"
    ] }
}

“fixes”条目的作用是防止其他规则被错误应用。例如,“prise=>prixse”可以防止“prise”被改成具有不同含义的“prize”。您可能需要根据自己的需求进行调整。
接下来,包括一个同义词过滤器,以捕捉最常用的例外情况:
"en_synonym_filter": { "type": "synonym", "synonyms": EN_SYNONYMS }

以下是我们列出的同义词列表,包含与我们使用案例最相关的关键字。您可能希望根据自己的需求来调整此列表:
EN_SYNONYMS = (
    "accolade, prize => award",
    "accoutrement => accouterment",
    "aching, pain => hurt",
    "acw, anticlockwise, counterclockwise, counter-clockwise => ccw",
    "adaptor => adapter",
    "advocate, attorney, barrister, procurator, solicitor => lawyer",
    "ageing => aging",
    "agendas, agendum => agenda",
    "almanack => almanac",
    "aluminium => aluminum",
    "america, united states, usa",
    "amphitheatre => amphitheater",
    "anti-aliased, anti-aliasing => antialiased",
    "arbour => arbor",
    "ardour => ardor",
    "arse => ass",
    "artefact => artifact",
    "aubergine => eggplant",
    "automobile, motorcar => car",
    "axe => ax",
    "bannister => banister",
    "barbecue => bbq",
    "battleaxe => battleax",
    "baulk => balk",
    "beetroot => beet",
    "biassed => biased",
    "biassing => biasing",
    "biscuit => cookie",
    "black american, african american, afro-american, negro",
    "bobsleigh => bobsled",
    "bonnet => hood",
    "bulb, electric bulb, light bulb, lightbulb",
    "burned => burnt",
    "bussines, bussiness => business",
    "business man, business people, businessman",
    "business woman, business people, businesswoman",
    "bussing => busing",
    "cactus, cactuses => cacti",
    "calibre => caliber",
    "candour => candor",
    "candy floss, candyfloss, cotton candy",
    "car park, parking area, parking ground, parking lot, parking-lot, parking place, parking",
    "carburettor => carburetor",
    "castor => caster",
    "cataloguing => cataloging",
    "catboat, sailboat, sailing boat",
    "champion, gainer, victor, win, winner => victory",
    "chat => talk",
    "chequebook => checkbook",
    "chequer => checker",
    "chequerboard => checkerboard",
    "chequered => checkered",
    "christmas tree ball, christmas tree ball ornament, christmas ball ornament, christmas bauble",
    "christmas, x-mas => xmas",
    "cinema => movies",
    "clangour => clangor",
    "clarinettist => clarinetist",
    "conditioning => conditioner",
    "conference => meeting",
    "coriander => cilantro",
    "corporate => company",
    "cosmos, universe => outer space",
    "cosy, cosiness => cozy",
    "criminal => crime",
    "curriculums => curricula",
    "cypher => cipher",
    "daddy, father, pa, papa => dad",
    "defence => defense",
    "defenceless => defenseless",
    "demeanour => demeanor",
    "departure platform, station platform, train platform, train station",
    "dishrag => dish cloth",
    "dishtowel, dishcloth => dish towel",
    "doughnut => donut",
    "downspout => drainpipe",
    "drugstore => pharmacy",
    "e-mail => email",
    "enamoured => enamored",
    "england => britain",
    "english => british",
    "epaulette => epaulet",
    "exercise, excercise, training, workout => fitness",
    "expressway, motorway, highway => freeway",
    "facebook => facebook, social media",
    "fanny => buttocks",
    "fanny pack => bum bag",
    "farmyard => barnyard",
    "faucet => tap",
    "fervour => fervor",
    "fibre => fiber",
    "fibreglass => fiberglass",
    "flashlight => torch",
    "flautist => flutist",
    "flier => flyer",
    "flower fly, hoverfly, syrphid fly, syrphus fly",
    "foot-walk, sidewalk, sideway => pavement",
    "football, soccer",
    "forums => fora",
    "fourth => 4",
    "freshman => fresher",
    "chips, fries, french fries",
    "gaol => jail",
    "gaolbird => jailbird",
    "gaolbreak => jailbreak",
    "gaoler => jailer",
    "garbage, rubbish => trash",
    "gasoline => petrol",
    "gases, gasses",
    "gauge => gage",
    "gauged => gaged",
    "gauging => gaging",
    "gipsy, gipsies, gypsies => gypsy",
    "glamour => glamor",
    "glueing => gluing",
    "gravesite, sepulchre, sepulture => sepulcher",
    "grey => gray",
    "greyish => grayish",
    "greyness => grayness",
    "groyne => groin",
    "gryphon, griffon => griffin",
    "hand shake, shake hands, shaking hands, handshake",
    "haulier => hauler",
    "hobo, homeless, tramp => bum",
    "new year, new year's eve, hogmanay, silvester, sylvester",
    "holiday => vacation",
    "holidaymaker, holiday-maker, vacationer, vacationist => tourist",
    "homosexual, fag => gay",
    "inbox, letterbox, outbox, postbox => mailbox",
    "independence day, 4th of july, fourth of july, july 4th, july 4, 4th july, july fourth, forth of july, 4 july, fourth july, 4th july",
    "infant, suckling, toddler => baby",
    "infeasible => unfeasible",
    "inquire, inquiry => enquire",
    "insure => ensure",
    "internet, website => www",
    "jelly => jam",
    "jewelery, jewellery => jewelry",
    "jogging => running",
    "journey => travel",
    "judgement => judgment",
    "kerb => curb",
    "kiwifruit => kiwi",
    "laborer => worker",
    "lacklustre => lackluster",
    "ladybeetle, ladybird, ladybug => ladybird beetle",
    "larrikin, scalawag, rascal, scallywag => naughty boy",
    "leaf => leaves",
    "licence, licenced, licencing => license",
    "liquorice => licorice",
    "lorry => truck",
    "loupe, magnifier, magnifying, magnifying glass, magnifying lens, zoom",
    "louvred => louvered",
    "louvres => louver",
    "lustre => luster",
    "mail => post",
    "mailman => postman",
    "marriage, married, marry, marrying, wedding => wed",
    "mayonaise => mayo",
    "meagre => meager",
    "misdemeanour => misdemeanor",
    "mitre => miter",
    "mom, momma, mummy, mother => mum",
    "moonlight => moon light",
    "moult => molt",
    "moustache, moustached => mustache",
    "nappy => diaper",
    "nightlife => night life",
    "normalcy => normality",
    "octopus => kraken",
    "odour => odor",
    "odourless => odorless",
    "offence => offense",
    "omelette => omelet",
    "# fix torres del paine",
    "paine => painee",
    "pajamas => pyjamas",
    "pantyhose => tights",
    "parenthesis, parentheses => bracket",
    "parliament => congress",
    "parlour => parlor",
    "persnickety => pernickety",
    "philtre => filter",
    "phoney => phony",
    "popsicle => iced-lolly",
    "porch => veranda",
    "pretence => pretense",
    "pullover, jumper => sweater",
    "pyjama => pajama",
    "railway => railroad",
    "rancour => rancor",
    "rappel => abseil",
    "row house, serial house, terrace house, terraced house, terraced housing, town house",
    "rigour => rigor",
    "rumour => rumor",
    "sabre => saber",
    "saltpetre => saltpeter",
    "sanitarium => sanatorium",
    "santa, santa claus, st nicholas, st nicholas day",
    "sceptic, sceptical, scepticism, sceptics => skeptic",
    "sceptre => scepter",
    "shaikh, sheikh => sheik",
    "shivaree => charivari",
    "silverware, flatware => cutlery",
    "simultaneous => simultanous",
    "sleigh => sled",
    "smoulder, smouldering => smolder",
    "sombre => somber",
    "speciality => specialty",
    "spectre => specter",
    "splendour => splendor",
    "spoilt => spoiled",
    "street => road",
    "streetcar, tramway, tram => trolley-car",
    "succour => succor",
    "sulphate, sulphide, sulphur, sulphurous, sulfurous => sulfur",
    "super hero, superhero => hero",
    "surname => last name",
    "sweets => candy",
    "syphon => siphon",
    "syphoning => siphoning",
    "tack, thumb-tack, thumbtack => drawing pin",
    "tailpipe => exhaust pipe",
    "taleban => taliban",
    "teenager => teen",
    "television => tv",
    "thank you, thanks",
    "theatre => theater",
    "tickbox => checkbox",
    "ticked => checked",
    "timetable => schedule",
    "tinned => canned",
    "titbit => tidbit",
    "toffee => taffy",
    "tonne => ton",
    "transportation => transport",
    "trapezium => trapezoid",
    "trousers => pants",
    "tumour => tumor",
    "twitter => twitter, social media",
    "tyre => tire",
    "tyres => tires",
    "undershirt => singlet",
    "university => college",
    "upmarket => upscale",
    "valour => valor",
    "vapour => vapor",
    "vigour => vigor",
    "waggon => wagon",
    "windscreen, windshield => front shield",
    "world championship, world cup, worldcup",
    "worshipper, worshipping => worshiping",
    "yoghourt, yoghurt => yogurt",
    "zip, zip code, postal code, postcode",
    "zucchini => courgette"
)

3
我知道这个答案与OP最初的问题有些不同,但如果你只想将美式英语和英式英语的拼写变体规范化,你可以在这里找到一个管理良好的列表(~1,700替换):http://www.tysto.com/uk-us-spelling-list.html。我相信还有其他的资源可以用来创建一个综合的主列表。
除了拼写变化外,你必须非常小心,不要轻率地用美式英语中对应的词汇来孤立地替换单词。我建议只进行最可靠的词汇替换。例如,这个没问题:
"anticlockwise, counterclockwise, counter-clockwise => counter-clockwise"
但是这个就不行:
"hobo, homeless, tramp => bum"

将索引中的"A homeless man" => *"A bum man",这是无意义的。(更不用说流浪汉、无家可归者和“流浪者”之间的明显区别了--http://knowledgenuts.com/2014/11/26/the-difference-between-hobos-tramps-and-bums/。)

总之,除了拼写变化外,美式英语与英式英语之间的方言差异很复杂,不能简单地通过列表查找来减少。

P.S. 如果你真的想做到这一点(即考虑语法上下文等因素),你可能需要一个上下文敏感的释义模型来“翻译”英式英语为美式英语(或反之亦然,取决于你的需求),然后再将其输入ES索引。这可以使用现成的统计翻译模型或甚至使用自然语言解析、POS标记、块划分等方法的定制软件来完成(具备足够的平行数据)。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接