Python：复杂字符串算法

Question

Python：复杂字符串算法

3

我有一个列表。

listcdtitles = 

["""    Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka)   """,
""" Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro)  """,
""" Tchaikovsky, 'The Tempest' Fantasy. Liszt, Symphonic Poem #1. (London Symph./Butt)  """,
""" Duffy, John: 'Heritage: Civilization and the Jews'- Fanfare & Chorale, Symphonic Dances + Orchestral Suite. Bernstein, 'On the Town' Dance Episodes. (Royal Phil./R.Williams)   """,
""" Lilien, Ignace {1897-1963}: Songs, 1920-1935. (Anja van Wijk, mezzo & Frans van Ruth, piano)    """,
""" Hindemith, Trauermusik. Purcell, 'Fairy Queen' Suite. Rossini, String Sonata #6. Petrov, 'Creation of the World' Ballet Suite. Bartok, Romanian Folkdances Sz 56. Tartini, Flute Concerto in G {w.A.Maiorov} (Leningrad Orch.for Ancient & Modern Music/ Serov) """,
""" Bizet, Verdi, Massenet, Puccini: Arias from Carmen, Rigoletto, Werther, Manon Lescaut, Tosca, Turandot + Songs by Lara, Di Capua et al. (Peter Dvorsky, tenor w.Bratislava Orch./Lenard {Also performing 'Carmen' Overt.& 'Thais' Meditation}. Rec.Live, 10/87) """,
""" Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)    """,
""" Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting) """,
""" Gluck, Mozart, Beethoven, Weber, Verdi, Wagner, Ponchielli, Mascagni, Puccini: Arias from Alceste, Don Giovanni, Fidelio, Oberon, Ballo, Tristan, Walkure, Siegfried, Gotterdammerung, Gioconda, Cavalleria, Tosca. (Helene Wildbrunn. Rec.1919-24) """,
""" Stanley, Wesley, Stubley, Boyce, Handel, Heron, Russell, Hook: '18th Century Organ Music on Period Instruments' (Same instruments and artist as above)  """,
""" Reimann, 'Unrevealed' for Baritone & String Quartet to Texts by Lord Byron {R.Salter w.Kreuzberger Quartet}; Variations for Piano (David Levine)    """,
""" Bruckner, Symphony #9. (Berlin Philharmonic/ Jochum. Rec. 'live', 11/28/77) """,
""" Bruckner, Symphony #5. (Haas Edition. BBC Symph./ Horenstein. Rec.9/71) """,
..............................]

我有大约14,000个元素在这个列表中。

我想把那些有相似单词的字符串放在一起。

你有任何关于如何做到这一点的想法吗？我认为没有对错之分。

非常感谢您的任何建议。

- Alex Gordon

4

请进一步清晰地解释“bunch up”的含义。 - fredley

我希望它们被连接起来，不管怎样都可以，如果你想的话也可以获取位置。 - Alex Gordon

如果与Levenshtein / Soundex类似，则可能需要在每个字符串之间创建距离矩阵。如果可以通过排序来实现相似性...那么将其读入列表并使用“sorted（）”方法。 - Wolph

是的，可以。我会首先过滤掉那些我不想要的单词，比如“作为，这个，然后，对于”。 - Alex Gordon

你可能想要了解信息检索方法，例如这个。 - Daniel Beck

显示剩余2条评论

2个回答

1

首先，解析所有内容并将每个标记与频率关联起来。高频标记将被列入黑名单。

然后，您需要比较字符串，迭代它们，并将元组与距离分数相关联。根据此分数，您将连接它们 - 或不连接。

这将是一个简单的方法来完成这个任务。

- Guillaume Lebourgeois

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Zafer · Accepted Answer

我是Python语言的新手，但是我已经编写了一些代码，可以计算列表中条目之间的相似度得分。

代码如下。

import re
import array

listcdtitles = ["""    Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka)   """,
""" Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro)  """,
""" Tchaikovsky, 'The Tempest' Fantasy. Liszt, Symphonic Poem #1. (London Symph./Butt)  """,
""" Duffy, John: 'Heritage: Civilization and the Jews'- Fanfare & Chorale, Symphonic Dances + Orchestral Suite. Bernstein, 'On the Town' Dance Episodes. (Royal Phil./R.Williams)   """,
""" Lilien, Ignace {1897-1963}: Songs, 1920-1935. (Anja van Wijk, mezzo & Frans van Ruth, piano)    """,
""" Hindemith, Trauermusik. Purcell, 'Fairy Queen' Suite. Rossini, String Sonata #6. Petrov, 'Creation of the World' Ballet Suite. Bartok, Romanian Folkdances Sz 56. Tartini, Flute Concerto in G {w.A.Maiorov} (Leningrad Orch.for Ancient & Modern Music/ Serov) """,
""" Bizet, Verdi, Massenet, Puccini: Arias from Carmen, Rigoletto, Werther, Manon Lescaut, Tosca, Turandot + Songs by Lara, Di Capua et al. (Peter Dvorsky, tenor w.Bratislava Orch./Lenard {Also performing 'Carmen' Overt.& 'Thais' Meditation}. Rec.Live, 10/87) """,
""" Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)    """,
""" Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting) """,
""" Gluck, Mozart, Beethoven, Weber, Verdi, Wagner, Ponchielli, Mascagni, Puccini: Arias from Alceste, Don Giovanni, Fidelio, Oberon, Ballo, Tristan, Walkure, Siegfried, Gotterdammerung, Gioconda, Cavalleria, Tosca. (Helene Wildbrunn. Rec.1919-24) """,
""" Stanley, Wesley, Stubley, Boyce, Handel, Heron, Russell, Hook: '18th Century Organ Music on Period Instruments' (Same instruments and artist as above)  """,
""" Reimann, 'Unrevealed' for Baritone & String Quartet to Texts by Lord Byron {R.Salter w.Kreuzberger Quartet}; Variations for Piano (David Levine)    """,
""" Bruckner, Symphony #9. (Berlin Philharmonic/ Jochum. Rec. 'live', 11/28/77) """,
""" Bruckner, Symphony #5. (Haas Edition. BBC Symph./ Horenstein. Rec.9/71) """]

entryDictionary = {}
i=0
for entry in listcdtitles:
    #remove unnecessary characters from the string
    entry=re.sub(r'[^\w ]', '', entry.lower(), flags=re.IGNORECASE)
    #split the entry into words and store it in the 
    entryDictionary[i]=entry.split(" ")
    i=i+1
# print the dictionary
print("Entries")
print(entryDictionary)

# define a score matrix, compare the words in each entry and if
# a word is same in both entries, that is one point
scoreMatrix = []
for k in range(i):
    scoreMatrix.append([])
    for j in range (i):
        if j>k:
            scoreMatrix[k].append(0)
        else:
            scoreMatrix[k].append("-")
k=0
j=0

for k in range(i-1):
    entry1 = entryDictionary[k]
    for j in range(k+1,i):
        entry2 = entryDictionary[j]
        for kk in range(len(entry1)):
            for jj in range(len(entry2)):
                if entry1[kk] != "" and entry1[kk] == entry2[jj]:
                    scoreMatrix[k][j] = scoreMatrix[k][j] + 1

print "Score Matrix (Higher numbers denote heigher similarity between two entries"

print repr("").rjust(10),
for k in range(i-1):
    print repr("Entry " + str(k)).rjust(10),
print repr("Entry " + str(i-1)).rjust(10)

for k in range(i):
    scoreMatrix.append([])
    print repr("Entry " + str(k)).rjust(10),
    for j in range (i-1):
        print repr(scoreMatrix[k][j]).rjust(10),
    print repr(scoreMatrix[k][i-1]).rjust(10)

结果如下：得分矩阵（较高的数字表示两个条目之间的相似度较高）

        ''  'Entry 0'  'Entry 1'  'Entry 2'  'Entry 3'  'Entry 4'  'Entry 5'  'Entry 6'  'Entry 7'  'Entry 8'  'Entry 9' 'Entry 10' 'Entry 11' 'Entry 12' 'Entry 13'
 'Entry 0'        '-'          2          3          2          0          1          1          0          1          1          0          0          0          0
 'Entry 1'        '-'        '-'          0          0          0          0         11          0          2          5          0          0          0          0
 'Entry 2'        '-'        '-'        '-'          3          0          1          0          1          0          0          0          0          0          0
 'Entry 3'        '-'        '-'        '-'        '-'          0          4          0          2          0          0          2          0          0          0
 'Entry 4'        '-'        '-'        '-'        '-'        '-'          0          1          0          0          0          0          1          0          0
 'Entry 5'        '-'        '-'        '-'        '-'        '-'        '-'          0          3          1          0          1          1          0          0
 'Entry 6'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          2          5          0          1          0          0
 'Entry 7'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0          0          0          0          0
 'Entry 8'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          2          0          0          0          0
 'Entry 9'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0          0          0
'Entry 10'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0          0
'Entry 11'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          0          0
'Entry 12'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'          2
'Entry 13'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'        '-'