例如,一个行如下:
keyword_text | keyword_norm | keyword_GAD_id| keyword_account | keyword_MCC_id | keyword_campaign | keyword_campaign_GAD_id | keyword_ad_group | keyword_ad_group_GAD_id| keyword_destination_url | keyword_max_cpc | keyword_status | keyword_match_type | keyword_campaign_status | keyword_ad_group_status | db_id | created_at |
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
"lebanese home delivery jai", "lebanese home delivery jai", 61557127036, "IN [S_02] Cuisine", 7795189055, "IN-JAI[S[Cui_30_EN]: Lebanese", 301573516, "IN-JAI[S[Cui_30_EN|del_02|geo_01]_ex: (Lebanese) Lebanese home delivery Jaipur", 11043049036, http://www.bla.in/restaurants/index/cuisines/lebanese/city/jaipur, 480000, ENABLED, EXACT, PAUSED, PAUSED, 1, "2014-07-18 18:42:43"
表格是使用以下代码创建的:
CREATE TABLE adword_keywords
(
keyword_text character varying(1000) NOT NULL,
keyword_norm character varying(1000) NOT NULL,
"keyword_GAD_id" bigint NOT NULL,
keyword_account character varying NOT NULL,
"keyword_MCC_id" bigint NOT NULL,
keyword_campaign character varying NOT NULL,
"keyword_campaign_GAD_id" bigint NOT NULL,
keyword_ad_group character varying NOT NULL,
"keyword_ad_group_GAD_id" bigint NOT NULL,
keyword_destination_url character varying NOT NULL,
keyword_max_cpc double precision,
keyword_status keyword_status,
keyword_match_type match_type,
keyword_campaign_status keyword_c_status,
keyword_ad_group_status keyword_ag_status,
db_id bigserial NOT NULL,
created_at timestamp without time zone,
CONSTRAINT adword_keywords_pkey PRIMARY KEY (db_id)
)
WITH (
OIDS=FALSE
);
CREATE INDEX ix_adword_keywords_keyword_norm
ON adword_keywords
USING btree
(keyword_norm COLLATE pg_catalog."default");
我尝试了以下查询:
SELECT adword_keywords.*
FROM adword_keywords
JOIN (
SELECT adword_keywords.keyword_text AS keyword_text,adword_keywords.keyword_match_type AS keyword_match_type
FROM adword_keywords GROUP BY adword_keywords.keyword_text, adword_keywords.keyword_match_type
HAVING count(adword_keywords.db_id) > 1) AS anon_1
ON adword_keywords.keyword_text = anon_1.keyword_text AND adword_keywords.keyword_match_type = anon_1.keyword_match_type
WHERE adword_keywords.keyword_campaign_status = 'ENABLED' AND adword_keywords.keyword_ad_group_status = 'ENABLED' AND adword_keywords.keyword_status = 'ENABLED'
很遗憾,这个查询返回了错误的结果。这也意味着在以 ['keyword_text','match_type'] 为分组依据时,由一个元素组成的组也是如此!
有人知道这个查询出了什么问题吗?
请注意,如果我从数据库中提取所有数据并将其放入pandas数据结构中,则可以使用以下查询:
SELECT * FROM adword_keywords
WHERE adword_keywords.keyword_campaign_status = \'ENABLED\'
AND adword_keywords.keyword_ad_group_status = \'ENABLED\'
AND adword_keywords.keyword_status = \'ENABLED\'
我可以按照以下方式筛选我想要的群组:
df.groupy(['keyword_text','match_type']).filter(lambda x: x.shape[0]>1)
这种后续过程返回了正确的结果。
然而,出于性能和内存问题的原因(数据集很大,无法完全加载到RAM中),我想使用相同的SQL查询。
编辑
根据ypercube的答案,有三个替代查询可以返回正确的结果。我已经收集了它们以供参考,并列出了它们的运行时间:第一个版本是最快的。
使用EXISTS
,1个循环,3个中的最佳:每个循环2.22秒
:
WITH cte AS
( SELECT *
FROM adword_keywords
WHERE keyword_campaign_status = 'ENABLED'
AND keyword_ad_group_status = 'ENABLED'
AND keyword_status = 'ENABLED'
)
SELECT a.*
FROM cte AS a
WHERE EXISTS
( SELECT *
FROM cte AS b
WHERE (b.keyword_text, b.keyword_match_type)
= (a.keyword_text, a.keyword_match_type)
AND b.db_id <> a.db_id
) ;
使用PARTITION
,1个循环,3次测试的最佳结果:每个循环5.7秒
WITH cte AS
( SELECT *,
COUNT(*) OVER (PARTITION BY keyword_text, keyword_match_type) AS cnt
FROM adword_keywords
WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED')
)
SELECT *
FROM cte
WHERE cnt >= 2 ;
使用
GROUP BY
,1次循环,3次中最佳:每个循环5.11秒
:select ak.*
from
adword_keywords ak
inner join (
select keyword_text, keyword_match_type
from adword_keywords
where
keyword_campaign_status = 'ENABLED' AND
keyword_ad_group_status = 'ENABLED' AND
keyword_status = 'ENABLED'
group by keyword_text, keyword_match_type
having count(db_id) > 1
) an1 using (keyword_text, keyword_match_type)
where
keyword_campaign_status = 'ENABLED' AND
keyword_ad_group_status = 'ENABLED' AND
keyword_status = 'ENABLED'