在 Sql Server 表中高效地随机化(洗牌)数据

5

我有一张数据表需要随机化。通过随机化,我指的是使用来自随机行的数据来更新该列中的另一行。问题在于,这个表本身很大(超过2,000,000行)。

我写了一段使用while循环的代码,但速度比较慢。

有没有人对更有效地实现随机化有任何建议?


基本上,我从我的表中选择所有 ID 并将其放入临时表中,然后从该表中选择一个 ID,从某个随机行中查找值并进行更新。之后,我会从我的临时表中删除该 ID。我使用以下代码生成随机行: 'code' SELECT TOP 1 MyColumn FROM MyTable where Id >= RAND() * NumberOfRowsInTable - Milhad
5个回答

6

为了更新行,需要进行显著的处理时间(CPU + I / O)。

您是否测量过随机化行与执行更新的相对费用?

你只需要选择随机行,这里有一种有效的方法来选择一部分随机的行(在这种情况下是1%的行)。

SELECT * FROM myTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), pkID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

pkID 是您的主键列。

这篇文章可能会有所帮助:


我有一张表格里面存储了私人用户数据(名字,姓氏,VIN等),我想要对这些数据进行混淆,以便任何人都无法找到真实信息,并且我可以将该数据库用于测试目的。随机化整行并不能帮助我,我需要对整行进行完全更新。 - Milhad

3
我将上面找到的答案结合在一起,组成了一个单一查询。该查询重新随机每一列,最终产生了完全随机化的记录。
UPDATE MyTable SET
  columnA = columnA.newValue,
  columnB = columnB.newValue,
  -- Optionally, for maintaining a group of values like street, zip, city in an address
  columnC = columnGroup.columnC,
  columnD = columnGroup.columnD,
  columnE = columnGroup.columnE
FROM MyTable
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY id) AS rn, id FROM MyTable
) AS PKrows ON MyTable.id = PKrows.id
-- repeat the following JOIN for each column you want to randomize
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnA AS newValue FROM MyTable
) AS columnA ON PKrows.rn = columnA.rn
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnB AS newValue FROM MyTable
) AS columnB ON PKrows.rn = columnB.rn

-- Optionally, if you want to maintain a group of values spread out over several columns
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnC, columnD, columnE FROM MyTable
) AS columnGroup ON PKrows.rn = columnGroup.rn

这个查询在一张10K行、8列的表格上运行了8秒,使用了一个拥有16GB内存和4个2.93GHz XEON核心的Windows 2008 R2机器。


3

简化版的原始回答:

根据Mitch Wheats的答案链接到这篇混淆数据的文章,你可以使用以下代码来混淆一堆字段,不仅限于ID:

;WITH Randomize AS 
( 
SELECT ROW_NUMBER() OVER (ORDER BY [UserID]) AS orig_rownum, 
      ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum, 
      * 
FROM [UserTable]
) 
UPDATE T1 
   SET [UserID] = T2.[UserID]
      ,[FirstName] = T2.[FirstName]
      ,[LastName] = T2.[LastName]
      ,[AddressLine1] =  T2.[AddressLine1]
      ,[AddressLine2] =  T2.[AddressLine2]
      ,[AddressLine3] =  T2.[AddressLine3]
      ,[City] = T2.[City]
      ,[State] = T2.[State]
      ,[Pincode] = T2.[Pincode]
      ,[PhoneNumber] = T2.[PhoneNumber]
      ,[MobileNumber] = T2.[MobileNumber]
      ,[Email] = T2.[Email]
      ,[Status] = T2.[Status] 
FROM Randomize T1 
      join Randomize T2 on T1.orig_rownum = T2.new_rownum 
;

所以你不仅仅局限于文章中所展示的做法:
;WITH Randomize AS 
( 
SELECT ROW_NUMBER() OVER (ORDER BY Id) AS orig_rownum, 
      ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum, 
      * 
FROM [MyTable]
) 
UPDATE T1 SET Id = T2.Id 
FROM Randomize T1 
      join Randomize T2 on T1.orig_rownum = T2.new_rownum 
;

这种方法的危险在于您需要处理的数据量。使用公共表表达式(CTE)会将所有这些内容塞入内存,因此,虽然我发现这相当快速(对于一个500k行表,需要19秒),但如果您有数百万条记录的表格,则需要谨慎。您应该考虑实际上需要多少数据或者是要测试和开发的良好人口样本。

编辑/更新的答案:

如我在下面的评论中提到的,我建议寻找一款工具来执行此操作并购买许可证。例如:https://www.red-gate.com/products/oracle-development/data-masker-for-oracle/

有时我们受到预算限制,在这种情况下,这是我“随机化”数据的方法。以下是一个示例,后面跟着说明/警告,在 MSSQL 2017 上运行:

-- DROP/CREATE FAKEY TEMP TABLE --
IF OBJECT_ID('tempdb..#Employees') IS NOT NULL
    DROP TABLE #Employees

CREATE TABLE #Employees (
     Id INT IDENTITY PRIMARY KEY
    ,EmployeeNumber VARCHAR(20)
    ,UserName VARCHAR(384)
    ,FirstName VARCHAR(128)
    ,MiddleName VARCHAR(128)
    ,LastName VARCHAR(256)
    ,HireDate DATETIME2
    ,BirthDate DATETIME2
    ,Email VARCHAR(384)
)

/*
==== POPULATE FAKEY TEMP TABLE ====
Generated here: https://www.mockaroo.com/
VimFu turns CSV to INSERT statements.
*/
INSERT INTO #Employees VALUES('29-7477088','cstudman0','Moyna','Cobb','Studman','2004-05-03 22:45:55','1991-09-02 18:14:58','cstudman0@amazon.com')
INSERT INTO #Employees VALUES('39-0211165','cgremane1','Cristabel','Carlye','Gremane','2016-06-24 06:31:06','2006-01-09 21:17:28','cgremane1@exblog.jp')
INSERT INTO #Employees VALUES('50-1527461','jandree2','Kaja','Justinian','Andree','2020-04-03 06:48:12','2015-10-21 00:48:05','jandree2@blinklist.com')
INSERT INTO #Employees VALUES('53-4435748','pportman3','Zachery','Prudi','Portman','2018-07-23 20:27:23','2017-10-17 18:38:06','pportman3@squidoo.com')
INSERT INTO #Employees VALUES('51-6508890','lnutbeam4','Bobbee','Lilah','Nutbeam','2017-02-03 00:37:52','2002-08-25 11:16:41','lnutbeam4@behance.net')
INSERT INTO #Employees VALUES('85-8633318','gleach5','Jaquith','Glenda','Leach','2019-03-12 19:32:33','2008-07-17 08:42:00','gleach5@google.fr')
INSERT INTO #Employees VALUES('54-1590858','cwandrach6','Anna-diane','Conrad','Wandrach','1997-12-10 06:09:28','1994-07-22 08:15:27','cwandrach6@phoca.cz')
INSERT INTO #Employees VALUES('79-7072949','mrankling7','Zeke','Molli','Rankling','2018-12-20 02:35:47','2020-03-10 22:28:33','mrankling7@cargocollective.com')
INSERT INTO #Employees VALUES('92-8250045','emebius8','Tabor','Eachelle','Mebius','1994-04-15 09:16:44','1995-02-20 17:17:29','emebius8@bloglovin.com')
INSERT INTO #Employees VALUES('87-4813998','olyes9','Serene','Ophelie','Lyes','2002-10-08 23:38:19','1996-08-08 03:08:25','olyes9@abc.net.au')
INSERT INTO #Employees VALUES('31-3555215','ykornasa','Faydra','Yancy','Kornas','2004-02-23 12:32:51','1995-10-24 12:27:54','ykornasa@intel.com')
INSERT INTO #Employees VALUES('16-6787860','pgoldsbyb','Cristian','Phyllis','Goldsby','2012-02-26 11:55:51','1994-03-04 07:18:46','pgoldsbyb@posterous.com')
INSERT INTO #Employees VALUES('11-2332324','bnavarijoc','Werner','Bartlett','Navarijo','2013-02-18 09:02:08','2013-07-19 19:34:39','bnavarijoc@sciencedirect.com')
INSERT INTO #Employees VALUES('69-9128351','rprobartd','Wolfgang','Ricardo','Probart','1999-01-16 03:33:25','2014-08-02 06:07:44','rprobartd@php.net')
INSERT INTO #Employees VALUES('83-0754916','mmaclachlane','Johannah','Mozelle','MacLachlan','2018-09-28 01:02:17','2004-11-11 06:25:56','mmaclachlane@goodreads.com')
INSERT INTO #Employees VALUES('86-9635344','hortegaf','Giulietta','Hetti','Ortega','2003-12-01 07:03:20','2006-11-27 11:53:03','hortegaf@vistaprint.com')
INSERT INTO #Employees VALUES('87-5426191','akeatingg','Gilbertine','Anabelle','Keating','1991-09-04 07:24:51','2018-09-27 15:30:29','akeatingg@sitemeter.com')
INSERT INTO #Employees VALUES('94-4568960','aalsinah','Rhona','Alysia','Alsina','1994-12-19 09:09:01','2010-02-12 14:23:18','aalsinah@patch.com')
INSERT INTO #Employees VALUES('32-6080729','tgarradi','Bank','Trumann','Garrad','2015-06-07 05:42:30','2002-11-06 08:49:17','tgarradi@google.co.jp')
INSERT INTO #Employees VALUES('88-0899323','fghentj','Daphene','Feodor','Ghent','2003-04-19 14:19:10','2006-08-07 02:19:51','fghentj@hibu.com')
INSERT INTO #Employees VALUES('00-7608833','arosebladek','Ed','Andrea','Roseblade','2019-09-05 00:25:28','2006-10-24 07:12:49','arosebladek@china.com.cn')
INSERT INTO #Employees VALUES('49-8817469','achallinl','Mandel','Andonis','Challin','2002-12-25 16:34:16','2003-08-15 15:52:04','achallinl@unc.edu')
INSERT INTO #Employees VALUES('89-3690501','pgurkom','Jock','Peterus','Gurko','1995-09-18 02:50:40','1992-11-14 05:57:15','pgurkom@ehow.com')
INSERT INTO #Employees VALUES('70-7699938','rklimpn','Florida','Ricky','Klimp','1991-11-10 01:36:31','1997-10-18 06:09:28','rklimpn@google.com.br')
INSERT INTO #Employees VALUES('57-7312244','vrentilllo','Tremaine','Verla','Rentilll','2005-07-06 16:21:15','1993-02-04 00:57:48','vrentilllo@dropbox.com')
INSERT INTO #Employees VALUES('49-5173804','fcarenp','Carly','Fabiano','Caren','2019-10-19 22:59:51','2006-04-28 05:18:18','fcarenp@livejournal.com')
INSERT INTO #Employees VALUES('82-6348827','lwhitewoodq','Gavin','Loren','Whitewood','2019-12-02 22:20:23','1998-07-25 13:15:13','lwhitewoodq@clickbank.net')
INSERT INTO #Employees VALUES('15-1749593','veronier','Candace','Vincenty','Eronie','2001-08-21 12:18:37','2000-11-09 00:43:32','veronier@umich.edu')
INSERT INTO #Employees VALUES('28-1121210','dharuards','Clim','Dionysus','Haruard','1996-02-19 20:43:40','2013-10-26 16:11:15','dharuards@drupal.org')
INSERT INTO #Employees VALUES('89-0288115','dstigglest','Katalin','Danit','Stiggles','2007-03-25 01:04:44','2000-09-24 00:22:34','dstigglest@walmart.com')
INSERT INTO #Employees VALUES('74-3532636','tmelroseu','Gregorio','Thoma','Melrose','2010-10-26 12:55:55','2012-10-03 09:51:15','tmelroseu@mail.ru')
INSERT INTO #Employees VALUES('22-8027830','hbeazev','Peirce','Hettie','Beaze','2001-07-30 16:36:55','1992-07-17 13:21:09','hbeazev@example.com')
INSERT INTO #Employees VALUES('62-2571142','etoonw','Carlita','Emmett','Toon','2015-07-13 02:44:51','2001-09-18 07:06:14','etoonw@typepad.com')
INSERT INTO #Employees VALUES('99-2911468','ayarntonx','Caye','Antonin','Yarnton','2019-02-14 23:26:54','2020-04-08 12:42:46','ayarntonx@cbc.ca')
INSERT INTO #Employees VALUES('33-8734931','tmationy','Coreen','Tomi','Mation','2011-06-18 09:49:40','2014-05-06 01:34:35','tmationy@issuu.com')
INSERT INTO #Employees VALUES('89-1230660','gordeltz','Sammie','Granny','Ordelt','1996-05-10 03:58:30','1991-05-18 11:19:27','gordeltz@wunderground.com')
INSERT INTO #Employees VALUES('80-9988795','ilaverock10','Selina','Isabelle','Laverock','2015-06-29 12:38:49','2002-07-27 00:43:15','ilaverock10@ed.gov')
INSERT INTO #Employees VALUES('61-0387718','sbraithwaite11','Riordan','Sonnnie','Braithwaite','2000-06-08 03:36:23','1998-12-29 14:13:12','sbraithwaite11@engadget.com')
INSERT INTO #Employees VALUES('75-4850210','bolagen12','Lurette','Bradley','O''Lagen','1990-01-05 10:55:13','1995-04-26 17:47:18','bolagen12@yandex.ru')
INSERT INTO #Employees VALUES('40-2385038','riron13','Mignon','Ronny','Iron','2004-06-05 19:46:00','2006-05-02 15:17:31','riron13@techcrunch.com')
INSERT INTO #Employees VALUES('97-5320734','kkibbe14','Manon','Kyrstin','Kibbe','1993-02-25 16:49:38','2006-09-03 04:20:53','kkibbe14@opensource.org')
INSERT INTO #Employees VALUES('54-6524877','sstorms15','Shaughn','Sandra','Storms','2013-09-30 17:52:42','1996-10-28 03:54:59','sstorms15@whitehouse.gov')
INSERT INTO #Employees VALUES('11-8287102','tkamen16','Allyn','Tim','Kamen','1991-06-16 08:46:49','2020-04-11 01:15:58','tkamen16@issuu.com')
INSERT INTO #Employees VALUES('38-6081847','ralpe17','Cathrin','Rubetta','Alpe','2017-03-26 06:50:37','1993-05-30 05:11:09','ralpe17@wordpress.com')
INSERT INTO #Employees VALUES('79-8160581','iparsonson18','Gael','Isabelita','Parsonson','2015-06-28 22:51:35','2001-08-26 14:07:54','iparsonson18@amazon.de')
INSERT INTO #Employees VALUES('56-6871425','movendon19','Nikki','Margalit','Ovendon','2019-05-13 15:34:38','2013-10-02 09:38:17','movendon19@soundcloud.com')
INSERT INTO #Employees VALUES('86-6665602','nfulep1a','Odessa','Nessa','Fulep','2010-04-21 14:38:58','2002-07-24 13:45:01','nfulep1a@pagesperso-orange.fr')
INSERT INTO #Employees VALUES('41-6048604','hstrang1b','Tucky','Hammad','Strang','2011-06-07 02:45:51','2013-10-05 06:21:30','hstrang1b@sohu.com')
INSERT INTO #Employees VALUES('55-0900438','hmaccleay1c','Winny','Harmon','MacCleay','1997-02-12 01:56:29','2014-11-22 08:01:34','hmaccleay1c@zdnet.com')
INSERT INTO #Employees VALUES('12-7876247','apaty1d','Heidi','Andris','Paty','2006-07-31 02:42:55','2001-06-10 05:38:51','apaty1d@shutterfly.com')



-- SELECT EVERYTHING BEFORE WE MANGLE DATA --
SELECT *
FROM #Employees


-- MANGLE THE DATA --
UPDATE
     E
SET
     E.EmployeeNumber = (SELECT TOP 1 EmployeeNumber FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.UserName = (SELECT TOP 1 UserName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.FirstName = (SELECT TOP 1 FirstName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.LastName = (SELECT TOP 1 LastName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.MiddleName = (SELECT TOP 1 MiddleName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.HireDate = (SELECT TOP 1 HireDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.BirthDate = (SELECT TOP 1 BirthDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.Email = (SELECT TOP 1 Email FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())

FROM #Employees E

-- SELECT MANGLED DATA --
SELECT *
FROM #Employees

这个例子中重要的部分是我们更新和获取随机值的方式:

UPDATE
     E
SET
     E.EmployeeNumber = (SELECT TOP 1 EmployeeNumber FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.UserName = (SELECT TOP 1 UserName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.FirstName = (SELECT TOP 1 FirstName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.LastName = (SELECT TOP 1 LastName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.MiddleName = (SELECT TOP 1 MiddleName FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.HireDate = (SELECT TOP 1 HireDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.BirthDate = (SELECT TOP 1 BirthDate FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())
    ,E.Email = (SELECT TOP 1 Email FROM #Employees WHERE E.Id = E.Id ORDER BY NEWID())

FROM #Employees E

注意子查询中的WHERE:

WHERE E.Id = E.Id 

那只是一种强制SQL Server每次重新执行子查询的技巧。如果我们在子查询中没有WHERE子句,我们将得到相同的行/值,因为子查询只会执行一次。
您需要考虑对任何要“随机化”的列进行唯一约束。很可能会遇到重复的问题。在上面的示例中,这可能会发生在像UserName这样的列上。
此外,根据您的DDL,外键约束也可能成为采取此方法时的问题。
最后,这远非完美的解决方案,如果您有超级敏感的数据,例如医疗保健信息,您应该使用经过认证的工具/方法。

我想问一下,这种技术只是交换整行,对吗?如果你想要独立地洗牌名字的姓、中间名和名字(以便排列方式改变),那么你必须将其作为三个单独的语句来执行,这样你就可以得到三组新的随机ID了? - Mir
1
@Mir - 没错。这将交换两行之间的值,您可以对部分或全部列执行此操作。我有一个更新和升级的版本,它使用2或3个随机选择的行来更新当前行。我会看看是否能找到它并更新答案。PS:根据用例,您可能需要考虑只创建随机数据的产品。例如:https://www.red-gate.com/products/sql-development/sql-data-generator/PSS:该链接不是推荐,我没有使用过Redgate的那个产品,而是使用了其他类似的产品。所以,请小心。 - virtualadrian

2

要对10列数据进行洗牌,使每行的10个值被其他行的值替换,这将是一项昂贵的操作。

你需要读取200万行数据10次。

SELECT语句如下:

SELECT
    FirstName, LastName, VIN, ...
FROM
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

我不会更新,我会创建一个新的表格。
SELECT
    FirstName, LastName, VIN, ...
INTO
    StagingTable
FROM
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

然后添加键等,删除旧表并重命名它。或者使用SYNONYM指向新表。
如果您想更新,则我会这样做。或者将其分成10个更新。
UPDATE
   M
SET
   Firstname = FirstName.FirstName,
   LastName = LastName.LastName,
   ...
FROM
    MyTable M
    JOIN 
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName ON 1=1
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

在 SQL Server 2008 中(至少),这些连接会失败并显示错误代码1033:“除非指定TOP或FOR XML,否则在视图、内联函数、派生表、子查询和公共表达式中,ORDER BY 子句无效。” - T.J. Crowder
@T.J.Crowder 这是一个不同的问题。无论如何,为什么要在视图中使用 ORDER BY? - gbn
@ gbn:你在上面的连接中使用了它(ORDER BY NEWID()),大概是为了在混合时获得足够随机的顺序。我只是试图精确地执行上面显示的更新,唯一的变化是表和列名。最终我用一个公共表达式来完成了它。虽然不高效,但完成了任务。 - T.J. Crowder

2

你只需要在select语句中使用随机排序

快速简单

在SQL Server中

SELECT * FROM TableName
order by NEWID()

在Oracle中

SELECT * FROM TableName
ORDER BY DBMS_RANDOM.VALUE

在MySQL中

SELECT * FROM TableName
ORDER BY RAND()

PostgreSQL

SELECT * FROM TableName
ORDER BY random()

结论

对 SQL 查询结果集进行随机排序是常见需求,因此根据底层关系型数据库类型,了解可以在 ORDER BY 子句中使用哪些 SQL 函数非常有用。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接