如何提高BigQuery中GeoIP查询的性能?

11

我已经将我的应用程序日志加载到BigQuery中,现在需要根据这些日志中的IP地址计算国家。

我已经在自己的表格和从MaxMind下载的GeoIP映射表之间编写了联接查询。

理想的查询应该是带有范围过滤器的OUTER JOIN,然而BQ仅支持联接条件中的=。因此,查询会执行一个INNER JOIN,处理JOIN每一侧缺失的值。

我已修改原始查询以便其可以在维基百科公共数据集上运行。

请问是否有人能帮我让这个查询运行得更快?

SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name

FROM
    (SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One
    FROM [publicdata:samples.wikipedia] Limit 1000) AS A1

JOIN 
    (SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One
    FROM

        -- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set
        -- all Ranges of valid IPs:
        (SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP])

        -- Missing rages lower from From_IP 
        ,(SELECT
            PriorRangeEndIP + 1 From_IP_Code, 
            From_IP_Code - 1 AS To_IP_Code, 
            'NA' AS Country_Name
        FROM

            -- use of LAG function to find prior valid range
            (SELECT 
                From_IP_Code, 
                To_IP_Code, Country_Name, 
                LAG(To_IP_Code, 1, INTEGER(0)) 
                OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP                 
            FROM [QA_DATASET.GeoIP]) A

            -- If gap from prior valid range is > 1 than its a gap to fill
            WHERE From_IP_Code > PriorRangeEndIP + 1)

        -- Missing rages higher tan Max To_IP
        ,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name
        FROM [QA_DATASET.GeoIP])
    ) AS B
ON A1.ONE = B.ONE    -- fake join condition to overcome allowed use of only = in joins

-- Join condition where valid IP exists on left
WHERE
    A1.client_ip_code >= B.From_IP_Code
    AND A1.client_ip_code <= B.To_IP_Code
    OR (A1.client_ip_code IS NULL 
    AND B.From_IP_Code = 1)    -- where there is no valid IP on left contributor_ip

看起来很酷!我会研究一下的。你能把GeoIP表放到一个新数据集中并公开吗(以便尝试查询)?谢谢! - Felipe Hoffa
我一直在尝试这个。感谢您将表格公开。到目前为止,最困难的部分是处理空值。如果可以跳过空值,使用您编写的查询可以更快地完成。 - Felipe Hoffa
3个回答

16

2019年,大幅优化的答案:

#standardSQL
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
  SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0')  ip, COUNT(*) c
  FROM `publicdata.samples.wikipedia`
  WHERE contributor_ip IS NOT null  
  GROUP BY 1
)
SELECT country_name, SUM(c) c
FROM (
  SELECT ip, country_name, c
  FROM (
    SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
    FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
  )
  JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`  
  USING (network_bin, mask)
)
GROUP BY 1
ORDER BY 2 DESC

清理后的答案位于:http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html

让我整理一下原始查询:

SELECT
  id,
  client_ip,
  client_ip_code,
  B.Country_Name AS Country_Name
FROM (
  SELECT
    id,
    contributor_ip AS  client_ip,
    INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
    1 AS One
  FROM
    [publicdata:samples.wikipedia]
  WHERE contributor_ip IS NOT NULL
  LIMIT
    1000
    ) AS A1
LEFT JOIN
  (
  SELECT
    From_IP_Code,
    To_IP_Code,
    Country_Name,
    1 AS One
  FROM
    --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
    (
    SELECT
      From_IP_Code,
      To_IP_Code,
      Country_Name
    FROM
      [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
    ,
    (
    SELECT
      PriorRangeEndIP+1 From_IP_Code,
      From_IP_Code-1 AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
    from(
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name
        ,
        LAG(To_IP_Code,
          1,
          INTEGER(0)) OVER(
        ORDER BY
          From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
      FROM
        [playscape-proj:GeoIP.GeoIP])A
    WHERE
     From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
      ,
    (
    SELECT
      MAX(To_IP_Code)+1 AS From_IP_Code,
      INTEGER (4311810304) AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages higher tan Max To_IP
    FROM
      [playscape-proj:GeoIP.GeoIP])
    ) AS B
  ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
  A1.client_ip_code>=B.From_IP_Code
  AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
  OR (A1.client_ip_code IS NULL
    AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;

这是一个很长的查询!(也非常有趣)。 运行时间为14秒。我们如何优化它?

我发现了一些技巧:

  • 跳过 NULL。 如果日志中没有 IP 地址,则不要尝试匹配。
  • 减少组合。 不要将每个左侧记录与每个右侧记录 JOIN,而是将左侧的 39.x.x.x 记录仅与右侧的 39.x.x.x 记录 JOIN。 只有几个(3或4个)覆盖多个范围的规则。 可以很容易地在 geolite 表中添加几个规则来填补这些间隙并添加其他规则。

所以我正在更改:

  • 1 AS One 改成 INTEGER(PARSE_IP(contributor_ip)/(256*256*256))AS One(两次)。
  • 添加“WHERE contributor_ip IS NOT NULL”。

现在它只需要3秒运行!5%的 IP 无法定位,可能是由于上述间隙导致的(很容易修复)。

现在,将 LIMIT 1000 更改为 LIMIT 300000。需要多长时间?

37秒!比描述的25分钟好多了。如果你想进一步提高性能,我建议将右侧表格转换为静态表格,因为一旦计算完成,它就不会发生变化,只是基本规则的扩展。 然后你可以使用 JOIN EACH。

SELECT
  id,
  client_ip,
  client_ip_code,
  B.Country_Name AS Country_Name
FROM (
  SELECT
    id,
    contributor_ip AS  client_ip,
    INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
    INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
  FROM
    [publicdata:samples.wikipedia]
  WHERE contributor_ip IS NOT NULL
  LIMIT
    300000
    ) AS A1
JOIN 
  (
  SELECT
    From_IP_Code,
    To_IP_Code,
    Country_Name,
    INTEGER(From_IP_Code/(256*256*256)) AS One
  FROM
    --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
    (
    SELECT
      From_IP_Code,
      To_IP_Code,
      Country_Name
    FROM
      [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
    ,
    (
    SELECT
      PriorRangeEndIP+1 From_IP_Code,
      From_IP_Code-1 AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
    from(
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name
        ,
        LAG(To_IP_Code,
          1,
          INTEGER(0)) OVER(
        ORDER BY
          From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
      FROM
        [playscape-proj:GeoIP.GeoIP])A
    WHERE
     From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
      ,
    (
    SELECT
      MAX(To_IP_Code)+1 AS From_IP_Code,
      INTEGER (4311810304) AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages higher tan Max To_IP
    FROM
      [playscape-proj:GeoIP.GeoIP])
    ) AS B
  ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
  A1.client_ip_code>=B.From_IP_Code
  AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
  OR (A1.client_ip_code IS NULL
    AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;

谢谢您的详细回答!1. 我会将右侧放在一个静态表格中。2. 我担心“From_IP_Code/(256256256) as One”可能不太好用,因为有一些范围分布在2甚至3个值之间。3. 在连接中跳过空值很棘手,因为它可能导致在最终数据集中丢弃记录(但这可以通过明智的连接条件处理)。我看到你们即将添加一些灵活性到连接条件中,希望能简化此类查询。 - N.N.
1
是的,涉及到扩展2或3个值的范围问题很容易解决。幸运的是,这种情况只有大约5个左右。最简单的解决方法是添加这些规则以填补空缺。例如,如果一个范围从4.0.0.0到5.255.255.255,您可以将其复制为4.0.0.0到4.255.255.255和5.0.0.0到5.255.255.255。只需覆盖几个案例,就能在以后获得更快的查询速度,而且投入成本也很低。 - Felipe Hoffa
对于任何使用标准SQL的人来说,都存在Net函数来替换类似PARSE_IP()这样的旧函数。 - Flair

7
作为一个很酷的补充(请查看之前的答案了解细节):哪些国家是维基百科编辑贡献最多的?
Row Country_Name    c    
1   United States   36605405     
2   United Kingdom  10355936     
3   Canada          4988835  
4   Australia       3387582  
5   India           1447756  
6   Germany         1414713 
7   Philippines     765874   
8   Netherlands     668850   
9   Ireland         651370   
10  France          602113   
11  New Zealand     590554   
12  Sweden          556544
....
Query complete (28.5s elapsed, 1.07 GB processed)

查询:

SELECT Country_Name, COUNT(*) c
FROM (
SELECT
 id,
 client_ip,
 client_ip_code,
 B.Country_Name AS Country_Name
FROM (
 SELECT
   id,
   contributor_ip AS client_ip,
   INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
   INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
 FROM
   [publicdata:samples.wikipedia]
 WHERE contributor_ip IS NOT NULL
 -- NO LIMITS - use ALL the data!
   ) AS A1
JOIN
 (
 SELECT
   From_IP_Code,
   To_IP_Code,
   Country_Name,
   INTEGER(From_IP_Code/(256*256*256)) AS One
 FROM
   --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
   (
   SELECT
     From_IP_Code,
     To_IP_Code,
     Country_Name
   FROM
     [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
   ,
   (
   SELECT
     PriorRangeEndIP+1 From_IP_Code,
     From_IP_Code-1 AS To_IP_Code,
     'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
   from(
     SELECT
       From_IP_Code,
       To_IP_Code,
       Country_Name,
       LAG(To_IP_Code,
         1,
         INTEGER(0)) OVER(
       ORDER BY
         From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
     FROM
       [playscape-proj:GeoIP.GeoIP])A
   WHERE
    From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
     ,
   (
   SELECT
     MAX(To_IP_Code)+1 AS From_IP_Code,
     INTEGER (4311810304) AS To_IP_Code,
     'NA' AS Country_Name -- Missing rages higher tan Max To_IP
   FROM
     [playscape-proj:GeoIP.GeoIP])
   ) AS B
 ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
 A1.client_ip_code>=B.From_IP_Code
 AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
 OR (A1.client_ip_code IS NULL
   AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;
)
GROUP BY 1 ORDER BY 2 DESC

这是使用BQ的IP的绝佳示例。我无法在公共数据集中看到表[playscape-proj:GeoIP.GeoIP]。我想运行上述查询。你能指导我吗?谢谢。 - bunny
请看以下链接中有关使用Google BigQuery实现GeoIP地理定位的查询: https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html - Felipe Hoffa
请问这个 GeoIP 转换到城市、经度和纬度的准确度如何?我们可以依赖它吗,还是基于 IP 的位置只是一个近似值? - bunny
最近的CSV格式数据是否可以执行相同的操作? 引用您的话 - “也有一个更新的数据库可用,但我还没有使用它,因为目前它只以二进制形式提供。” - bunny

1

https://medium.com/@hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2启发,我提出了这个解决方案,可以处理IPv6地址:

WITH test_data AS (
    SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL 
    SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
    SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
    FROM test_data 
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
    SELECT ip, city_name, country_name, latitude, longitude
    FROM (
        SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
        FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
    )
    JOIN `demo_bq_dataset.geoip_city_v4`
    USING (network_bin, mask)
), ipv6 AS (
    SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
    FROM test_data 
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
    SELECT ip, city_name, country_name, latitude, longitude
    FROM (
        SELECT  ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
        FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
    )
    JOIN `demo_bq_dataset.geoip_city_v6`  
    USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL 
SELECT * FROM ipv6d

为了获取geoip_city_v4geoip_city_v6,您需要从https://maxmind.com/下载geoip数据库。
要获得完整的解决方案,您可以按照此教程更新和准备数据集:https://hodo.dev/posts/post-37-gcp-bigquery-geoip/

正确的文章链接是 https://hodo.dev/posts/post-37-gcp-bigquery-geoip/。 - Gabriel Hodoroaga

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接