Order
表有一个St_Id
列 - 你可能会通过符号分析推断该列很可能与State
表或Status
表相关。 St_Id
列有6个离散值,并且90%的记录由2个值覆盖。 State
表有200行,Status
表有9行。你可以合理地推断St_Id
列与Status
表相关 - 它给出了更大的表行覆盖率(表中的2/3行被“使用”,而State
表中只有3%的行将被使用)。这在大多数情况下都是个不容易解决的任务。如果你很幸运地分析到现代框架(如Ruby on Rails、CakePHP),并且开发人员严格遵循列约定,那么你有一个合理的机会找到许多,但不是所有的暗示关系。
也就是说,如果你的表使用像user_id
这样的列来引用users
表中的条目。
请注意:一些实体名称可能会存在不规则的复数形式(例如:entity变为entities,而不是entitys),这些比较难以捕捉(但仍然可以)。然而,像admin_id
这样的键与用户表上的user.id
连接,不能推断出它们之间的关系。这些情况需要手动处理。
你没有指定RDBMS,但我经常使用MySQL,并且目前正在处理这个问题。
以下MySQL脚本将推断出由列名所暗示的大多数关系。然后,它会列出任何它找不到表名的关系,以便你至少知道你缺少哪些关系。其中包括父子关系、单数和复数名称,以及暗示的关系:
-- this DB is where MySQL keeps schema information
use information_schema;
-- change this to the DB you want to analyse
set @db_name = "example_DB";
-- infer relationships
-- NB: this won't catch names that pluralise irregularly like category -> categories or bus_id -> buses etc.
select LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ) as inferred_parent_singular
, CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") as inferred_parent_plural
, C.TABLE_NAME as child_table
, CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME)-3), "s has many ", C.TABLE_NAME) as inferred_relationship
from COLUMNS C
JOIN TABLES T on C.TABLE_NAME = T.TABLE_NAME
and C.TABLE_SCHEMA = T.TABLE_SCHEMA
and T.TABLE_TYPE != "VIEW" -- filter out views; comment this line if you want to include them
where COLUMN_NAME like "%_id" -- look for columns of the form <name>_id
and C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name
-- and C.TABLE_NAME not like "wwp%" -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- finally make sure to filter out any inferred names that aren't really tables
and CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") -- this is the inferred_parent_plural, but can't use column aliases in the where clause sadly
in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;
-- Now list any inferred parents that weren't real tables to see see why (irregular plurals and columns not named according to convention)
select LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ) as inferred_parent_singular
, CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") as inferred_parent_plural
, C.TABLE_NAME as child_table
from COLUMNS C
JOIN TABLES T on C.TABLE_NAME = T.TABLE_NAME
and C.TABLE_SCHEMA = T.TABLE_SCHEMA
and T.TABLE_TYPE != "VIEW" -- filter out views, comment this line if you want to include them
where COLUMN_NAME like "%_id"
and C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name
-- and C.TABLE_NAME not like "wwp%" -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- this time only include inferred names that aren't real tables
and CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s")
not in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;
<name>_id
列名作为参数,去除_id
部分,然后应用一些启发式方法来尝试正确地形成复数。ERwin http://www.ascent.co.za/products/ca_erwin_data_profiler.html
并且 XCaseForI http://xcasefori.com/discovering/index.html
像Kirk建议的那样,能够提供类似于范围分布和创建时间的相似度排名的统计方法似乎是正确的方法。我需要使用SAS EG或任何免费工具来实现它。
SELECT COUNT(DISTINCT colname) = SELECT COUNT (DISTINCT *)
来查找候选主键,并通过测试其值集合是否包含在候选主键的值集合中来查找候选外键,这也可以通过单个 SELECT 查询完成。这会找到很多错误的结果。 - reinierpost我不知道有哪些软件可以帮助你搜索所需内容,但以下查询将帮助您入门。它列出了当前数据库中的所有外键关系。
SELECT
K_Table = FK.TABLE_NAME,
FK_Column = CU.COLUMN_NAME,
PK_Table = PK.TABLE_NAME,
PK_Column = PT.COLUMN_NAME,
Constraint_Name = C.CONSTRAINT_NAME
FROM
INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS C
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS FK
ON C.CONSTRAINT_NAME = FK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS PK
ON C.UNIQUE_CONSTRAINT_NAME = PK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE CU
ON C.CONSTRAINT_NAME = CU.CONSTRAINT_NAME
INNER JOIN (
SELECT
i1.TABLE_NAME,
i2.COLUMN_NAME
FROM
INFORMATION_SCHEMA.TABLE_CONSTRAINTS i1
INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE i2
ON i1.CONSTRAINT_NAME = i2.CONSTRAINT_NAME
WHERE
i1.CONSTRAINT_TYPE = 'PRIMARY KEY'
) PT
ON PT.TABLE_NAME = PK.TABLE_NAME