sqlite3
模块?用例示例:
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
此查询应匹配ID为1的行,但不匹配ID为2的行:
c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')
如何在Sqlite+Python中实现这一功能?
以下是我目前了解到的注意事项:
The Levenshtein distance, i.e. the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other, can be useful, but I'm not sure if an official implementation exists in Sqlite (I've seen a few custom implementations, like this one)
The Damerau-Levenshtein is the same, except it also allows transposition between 2 adjacent characters; it is also called the Edit distance
I know it's possible to define a function myself, but implementing such a distance will be non-trivial (doing natural language processing comparison super efficiently for databases is really non-trivial), that's why I wanted to see if Python / Sqlite already features such a tool
Sqlite has FTS (Full Text Seach) features: FTS3, FTS4, FTS5
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */ CREATE TABLE enrondata2(content TEXT); /* Ordinary table */ SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */ SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
but I don't find about string comparison with such a "similarity distance", FTS's features
MATCH
orNEAR
don't seem to have similarity measure with letters changes, etc.Moreover this answer shows that:
SQLite's FTS engine is based on tokens - keywords that the search engine tries to match.
A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.The latter (the fact that "jmups" cannot be found as similar to "jumps") makes it unpractical for my use case, sadly.