从geonames.org数据库导入数据到MySQL数据库

16

你是否尝试导入zip文件?如果没有,那么文件结构是什么? - bensiu
4个回答

25

我通过查看你在链接的zip文件中所提到的名为“主要'GeoName'表具有以下字段:”部分的自述文件找到了以下内容。

首先,在您的MySQL实例上创建数据库和表。每行都给出了上面我引用标题的部分中的字段类型。

CREATE DATABASE DO_test;
CREATE TABLE `DO_test`.`DO_table` (
  `geonameid` INT,
  `name` varchar(200),
  `asciiname` varchar(200),
  `alternatenames` varchar(5000),
  `latitude` DECIMAL(10,7),
  `longitude` DECIMAL(10,7),
  `feature class` char(1),
  `feature code` varchar(10),
  `country code` char(2),
  `cc2` char(60),
  `admin1 code` varchar(20),
  `admin2 code` varchar(80),
  `admin3 code` varchar(20),
  `admin4 code` varchar(20),
  `population` bigint,
  `elevation` INT,
  `gtopo30` INT,
  `timezone` varchar(100),
  `modification date` date
)
CHARACTER SET utf8;

在创建完表格后,您可以从文件中导入数据。字段由制表符分隔,行由换行符表示:

LOAD DATA INFILE '/path/to/your/file/DO.txt' INTO TABLE `DO_test`.`DO_table`;

1
哎呀,您需要对纬度和经度进行不同的处理。我正在修复这个问题。我认为它们应该是“点”而不是“十进制数”。 - wilbbe01
1
@LuisDUrraca:你试过把它放在 /tmp 目录下,然后从那里加载吗? - wilbbe01
@wofgang - 核心可能与导入速度没有太大关系,但15分钟似乎有点长。 - wilbbe01
如果您只需要城市和邮政编码列表,这是表结构:https://pastebin.com/2VTHv76g 顺便说一下,目前Geonames数据只需要DECIMAL(6,4)格式的纬度和DECIMAL(7,4)格式的经度。 我还读到说最好避免在列名中使用空格。 您也可以使用mysqlimport命令导入数据。 由于反斜杠的原因出现了错误,所以我不得不将其删除:tr -d '\\' < geonames_cities.txt > cities.csv - baptx
编辑:如果在my.cnf中的[mysqld]下添加sql-mode =“”,则无需删除反斜杠。 https://dev59.com/RF4b5IYBdhLWcg3w3k6Y#1dbonYgBc1ULPQZFWUjd - baptx
显示剩余7条评论

17

我最近制作了一个shell脚本,从GeoNames网站下载最新数据并将其导入MySQL数据库。它基于GeoNames论坛上的知识,并节省了我很多时间。

它是第一个版本,但完全功能正常。也许它可以帮助你。

你可以在http://codigofuerte.github.com/GeoNames-MySQL-DataImport/访问它。


谢谢。这节省了我们一些麻烦。似乎甚至可以在Cygwin上运行(我在同事的电脑上看到了它)。现在我正在我的Linux机器上运行它。似乎可以工作,但是需要花费很长时间! :) - Walialu
谢谢,这非常有用。 - wolfgang

8
未来的每个人: 2008年在geonames.org论坛中有一个名为“将所有地理名称转存到MySQL”的帖子:http://forum.geonames.org/gforum/posts/list/732.page 同时搜索以下内容:import dump into [postgresql OR SQL server OR MySQL] site:forum.geonames.org,以查找更多2006年以来的答案。 编辑提供摘要: 在geoname官方的read me中:http://download.geonames.org/export/dump/。我们会发现有关转存文件及其内容的良好描述。 可以直接将转存文件导入到MySQL数据表中。例如:
SET character_set_database=utf8;
LOAD DATA INFILE '/home/data/countryInfo.txt' INTO TABLE _geo_countries IGNORE 51 LINES(ISO2,ISO3,ISO_Numeric,FIPSCode,AsciiName,Capital,Area_SqKm,Population,ContinentCode,TLD,CurrencyCode,CurrencyName,PhoneCodes,PostalCodeFormats,PostalCodeRegex,Languages,GeonameID,Neighbours,EquivalentFIPSCodes);
SET character_set_database=default;

请注意字符集,因为如果我们使用2012年旧版phpmyadmin的CSV LOAD DATA准备导入程序,则即使列的排序设置为utf8_general_ci,我们也可能会丢失utf字符。
目前有4个必要的数据表:大陆、国家(countryInfo.txt)、行政区划(admin1)和城市或地点(geonames)。
admin1、2、3、4转储文件是不同级别的国家内部行政区划,如美国的州或其他国家的省份。 admin2更详细,是该州或省的内部划分。以此类推,第3和4层级别也是如此。
国家转储文件中列出的不仅是城市,而且包括该国家的所有位置,甚至包括一个存储中心。还有一个名为“allCountries.txt”的庞大文件,在从zip文件中提取后将超过1GB。 如果我们只想要城市,我们应该选择其中一个转储文件:cities1000.txt,cities5000.txt,cities15000.txt,其中数字表示所列城市的最小人口。我们在geonames datatable中存储城市(您可以称其为地理位置或地理城市)。
在导入*.txt转储文件之前,请在MySQL文档中对LOAD DATA语法进行一些研究。
read me文本文件(也在dump页面的页脚中)提供了足够的描述,例如:
The main 'geoname' table has the following fields :
---------------------------------------------------
geonameid         : integer id of record in geonames database
name              : name of geographical point (utf8) varchar(200)
asciiname         : name of geographical point in plain ascii characters, varchar(200)
alternatenames    : alternatenames, comma separated varchar(5000)
latitude          : latitude in decimal degrees (wgs84)
longitude         : longitude in decimal degrees (wgs84)
feature class     : see http://www.geonames.org/export/codes.html, char(1)
feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
country code      : ISO-3166 2-letter country code, 2 characters
cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 60 characters
admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) 
admin3 code       : code for third level administrative division, varchar(20)
admin4 code       : code for fourth level administrative division, varchar(20)
population        : bigint (8 byte int) 
elevation         : in meters, integer
dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone          : the timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format

关于varchar(5000),我们需要知道MySQL 5.0或更高版本中每行的64kb大小: 在MySQL中,VARCHAR(20000)是否有效?


4
这是我成功导入后的笔记。
在写作时,我正在Windows 7上测试MySQL 5.7.16。按照以下步骤进行导入:
  1. Download desired data file from the official download page. In my case I chose cities1000.zip because it's much smaller in size (21MB) than the all-inclusive allcountries.zip (1.4GB).

  2. Create the following schema and table according to readme.txt on the download page, where the fields are specified below the text "the main 'geoname' table has the following fields".

    CREATE SCHEMA geonames DEFAULT CHARSET utf8 COLLATE utf8_general_ci;
    
    CREATE TABLE geonames.cities1000 (
     id                INT,
     name              VARCHAR(200),
     ascii_name        VARCHAR(200),
     alternate_names   VARCHAR(10000) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
     latitude          DECIMAL(10, 7),
     longitude         DECIMAL(10, 7),
     feature_class     CHAR(1),
     feature_code      VARCHAR(10),
     country_code      CHAR(2),
     cc2               CHAR(60),
     admin1_code       VARCHAR(20),
     admin2_code       VARCHAR(80),
     admin3_code       VARCHAR(20),
     admin4_code       VARCHAR(20),
     population        BIGINT,
     elevation         INT,
     dem               INT,
     timezone          VARCHAR(100),
     modification_date DATE
    )
     CHARACTER SET utf8;
    

    Field names are arbitrary as long as the column size and field types are the same as specified. alternate_names are specially defined with the character set utf8mb4 because the values for this column in the file contain 4-byte unicode characters which are not supported by the character set utf8 of MySQL.

  3. Check the values of these parameters: character_set_client, character_set_results, character_set_connection. 7

    SHOW VARIABLES LIKE '%char%';
    

    If they are not utf8mb4, then change them:

    SET character_set_client = utf8mb4;
    SET character_set_results = utf8mb4;
    SET character_set_connection = utf8mb4;
    
  4. Import data from file using LOAD DATA INFILE ...

    USE geonames;
    
    LOAD DATA INFILE 'C:\\ProgramData\\MySQL\\MySQL Server 5.7\\Uploads\\cities1000.txt' INTO TABLE cities1000
    CHARACTER SET utf8mb4 (id, name, ascii_name, alternate_names, latitude, longitude, feature_class, feature_code,
                           country_code, cc2, admin1_code, admin2_code, admin3_code, admin4_code, population, @val1,
                           @val2, timezone, modification_date)
    SET elevation = if(@val1 = '', NULL, @val1), dem = if(@val2 = '', NULL, @val2);
    

    Explanation for the statement:

    • The file should be placed under a designated location by MySQL for importing data from files. You can check the location with SHOW VARIABLES LIKE 'secure_file_priv';. In my case it's C:\ProgramData\MySQL\MySQL Server 5.7\Uploads. In Windows you need to use double slashes to represent one slash in the path. This error would be shown when the path is not given correctly: [HY000][1290] The MySQL server is running with the --secure-file-priv option so it cannot execute this statement.

    • With CHARACTER SET utf8mb4 you're telling MySQL what encoding to expect from the file. When this is not given explicitly, or the column encoding is not utf8mb4, an error prompt like this will be seen: [HY000][1300] Invalid utf8 character string: 'Gorad Safija,SOF,Serdica,Sofi,Sofia,Sofiae,Sofie,Sofii,Sofij,Sof'. 5 In my case I found it's due to the existence of Gothic letters in the alternate names, such as (id 727011), (id 3464975), and (id 3893894). These letters need to be stored as 4-byte characters (utf8mb4) while my then encoding was utf8 which only supports up to 3-byte characters. 6 You can change column encoding after the table is created:

      ALTER TABLE cities1000 MODIFY alternate_names VARCHAR(10000) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
      

      To check the encoding of a column:

      SELECT character_set_name, COLLATION_NAME FROM information_schema.COLUMNS WHERE table_schema = 'geonames' AND table_name = 'cities1000' AND column_name = 'alternate_names';
      

      To test if the characters can be stored:

      UPDATE cities1000 SET alternate_names = '' WHERE id = 1;
      
    • Values for some columns need to be "improved" before they are inserted, such as elevation and dem. They are of type INT and values for them from the file could be empty strings, which can't be stored by an INT type column. So you need to convert those empty strings to null for those columns. The latter part of the statement just serves this purpose. This error would be shown when the values are not property converted first: [HY000][1366] Incorrect integer value: '' for column 'elevation' at row 1. 3, 4

参考文献

  1. http://www.geonames.org/
  2. http://download.geonames.org/export/dump/
  3. https://dev.mysql.com/doc/refman/8.0/en/load-data.html
  4. https://dba.stackexchange.com/a/111044/94778
  5. https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-conversion.html
  6. https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
  7. https://dev59.com/qGgu5IYBdhLWcg3w8bhs#35156926
  8. https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html

1
谢谢,utf8mb4解决了我遇到的一些问题。我的项目需要来自全世界的地点信息。因此,我将id设置为主键,并将name和ascii-name设置为索引。这大大提高了性能。 - Rockney
太棒了。在出现“参数未定义”错误的情况下,似乎需要在连接字符串中声明“Allow User Variables=True”才能使用用户值,例如val1和val2:https://dev59.com/TWEh5IYBdhLWcg3wRRuh#22604492 - A. Niese
cc2 必须至少为 100 个字符。 - Héctor Prats

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接