我刚开始学习Python的NumPy和正则表达式。我尝试从每行的Pandas文本列中提取模式。根据我的要求,有许多可能的情况,因此我编写了以下不同的正则表达式。为了迭代并搜索给定的模式,我使用Python的np.where
,但我遇到了性能问题。是否有任何方法可以提高性能或任何替代方案以实现以下输出。
x_train['Description'] is my pandas column.
54672 rows in my dataset.
Code:
pattern1 = re.compile(r'\bAGE[a-z]?\b[\s\w]*\W+\d+.*(?:year[s]|month[s]?)',re.I)
pattern2 = re.compile(r'\bfor\b[\s]*age[s]?\W+\d+\W+(?:month[s]?|year[s]?)',re.I)
pattern3 = re.compile(r'\badult[s]?.[\w\s]\d+',re.I)
pattern4 = re.compile(r'\b\d+\W+(?:month[s]?|year[s]?)\W+of\W+age[a-z]?',re.I)
pattern5 = re.compile(r'[a-z][a-z\s]+(?:month[s]?|year[s]?)[\w\s]+age[s]?',re.I)
pattern6 = re.compile(r'\bage.*?\s\d+[\s]*\+',re.I)
pattern7 = re.compile(r'\bbetween[\s]*age[s]?[\s]*\d+.*(?:month[s]?|year[s]?)',re.I)
pattern8 = re.compile(r'\b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)',re.I)
np_time = time.time()
x_train['pattern'] = np.where(x_train['Description'].str.contains(pattern1), x_train['Description'].str.findall(pattern1),
np.where (x_train['Description'].str.contains(pattern2), x_train['Description'].str.findall(pattern2),
np.where (x_train['Description'].str.contains(pattern3), x_train['Description'].str.findall(pattern3),
np.where (x_train['Description'].str.contains(pattern4), x_train['Description'].str.findall(pattern4),
np.where (x_train['Description'].str.contains(pattern5), x_train['Description'].str.findall(pattern5),
np.where (x_train['Description'].str.contains(pattern6), x_train['Description'].str.findall(pattern6),
np.where (x_train['Description'].str.contains(pattern7), x_train['Description'].str.findall(pattern7),
np.where (x_train['Description'].str.contains(pattern8), x_train['Description'].str.findall(pattern8),
'NO PATTERN')
)))))))
print "pattern extraction ran in = "
print("--- %s seconds ---" % (time.time() - np_time))
pattern extraction ran in =
--- 99.5106501579 seconds ---
以下是上述代码的示例输入和输出:
Description pattern
0 **AGE RANGE: 6 YEARS** AND UP 10' LONG AGE RANGE: 6 YEARS
STRING OF BEAUTIFUL LIGHTS MULTIPLE
LIGHT EFFECTS FADE IN AND OUT
1 DIMENSIONS OVERALL HEIGHT - TOP AGE GROUP: -2 YEARS/3 TO 4
TO BOTTOM: 34.5'' OVERALL WIDTH - SIDE YEARS/5 TO 6 YEARS/7 TO 8
YEARS/7 TO 8 YEARS.
TO SIDE: 20'' OVERALL DEPTH -
FRONT TO BACK: 15'' COUNTER TOP
HEIGHT - TOP TO BOTTOM: 23'' OVERALL
PRODUCT WEIGHT: 38 LBS "
**"AGE GROUP: -2 YEARS/3 TO 4 YEARS/5 TO 6
YEARS/7 TO 8 YEARS**.
2 THE FLAME-RETARDANT FOAM ALSO CONTAINS AGED 1-5 YEARS
ANTIMICROBIAL PROTECTION, SO IT WON'T GROW
MOLD OR BACTERIA IF IT GETS WET. THE
BRIGHTLY-COLORED
VINYL EXTERIOR IS EASY TO WIPE CLEAN. FOAMMAN
IS DESIGNED FOR KIDS **AGED 1-5 YEARS**