在T-SQL中设置种子生成均匀分布的随机值

10
我想在 T-SQL 中为给定数据表的每一行生成一个从均匀分布中随机抽取的值,其中平均值=0,标准差=1。此外,我希望设置一个种子以确保分析的可重现性。以下是未能实现上述目标的方法:
  1. 使用声明数字的 RAND() 函数无法实现此目的:数据集的每一行都会生成相同的随机值。

  2. 这样的解决方案:

    SELECT ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) AS [RandomNumber]

也无法解决问题,因为它不具备可重复性。

编辑:

由于我的表格有数亿条记录,因此性能很重要。


您可能会发现以下问题的答案有用:在给定区间内生成不同的随机时间,特别是讨论模乘逆元的答案。 - Vladimir Baranov
5个回答

4

Rand()函数可以通过传递一个整数种子值来在开始时进行初始化。如果在生成任何随机数之前只执行一次此操作,则随机数的序列将可重复。单独生成的值将确保Rand()函数以顺序返回数字。以下代码将生成均值为0,标准差为1的n个伪随机数的均匀分布:

    DECLARE @Mean    FLOAT = 0.0; 
    DECLARE @stDev   FLOAT = 1.0; 
    DECLARE @n   INT = 100000;   -- count of random numbers to generate
    DECLARE @U   TABLE(x FLOAT); -- table of random numbers

    DECLARE @SEED    INT = 123456;    -- seed to ensure list is reproducible
    SELECT RAND(@Seed);

    SET NOCOUNT ON;
    BEGIN TRAN
    DECLARE @x INT = 0; -- counter
    WHILE @x < @n
      BEGIN
      INSERT INTO @U (x)
        SELECT @Mean + (2 * SQRT(3) * @stDev) * (RAND() - 0.5)
      SET @x = @x + 1;
      END;
    COMMIT

-- Check the results    
    SELECT * from @U;

    SELECT AVG([@U].x) AS mean,
        STDEV([@U].x) AS stDev
        FROM @U;

不要在while循环中插入临时表,可以使用游标循环遍历现有表中的记录,并对每条记录进行更新。正如评论中提到的,性能可能是一个问题,但这满足了要求"均匀分布,平均值=0,标准偏差=1"和"可重复性"。Rand()函数的工作方式强制执行"逐个更新"。

下面是一种替代方案,它将具有更好的性能(应该在1百万行下运行不到2秒)并替换Rand()函数。这允许在单个UPDATE中更新记录,但依赖于表中的唯一数字ID字段,并更新名为RandomNumber的字段。 Rand()函数被替换为( (ID * @SEED ) % 1000 ) / 1000,这可能会得到改进。

DECLARE @Mean    FLOAT = 0.0; 
DECLARE @stDev   FLOAT = 1.0; 
DECLARE @SEED numeric(18,0)    = 1234567890.0;    -- seed to ensure list is reproducible

SET NOCOUNT ON;
BEGIN TRAN
UPDATE TestTable
   set Randomnumber = @Mean + (2 * SQRT(3) * @stDev) * (( (ID * @SEED ) % 1000 ) / 1000 - 0.5) 
COMMIT
-- Check the results    
SELECT AVG(RandomNumber) AS mean,
    STDEV(RandomNumber ) AS stDev
    FROM TestTable;

尽管如此,他提到了“数亿行”和“性能问题”。即使您将更新封装在较大的事务块中,仍然需要较长的运行时间。使用游标并对每个记录执行更新。 - deroby
我同意,“逐个”更新会影响性能。我添加了一种替代方案,避免了这种情况并进行了单次更新(以减少“随机性”为代价),但仍满足“均匀分布,平均值=0,标准差=1”和“可重复性”的要求。 - chrisuae

4

在我看来,这里的主要问题是您如何看待“可重复性”?或者换句话说:什么“驱动”了随机性?我可以想象一种解决方案,即只要数据不改变,每次运行都将相同的随机数分配给每个记录。然而,如果数据发生变化,您希望会发生什么?

为了好玩,我对一个(不太具有代表性的)包含100万行的测试表格进行了以下测试:

-- seed
SELECT Rand(0)

-- will show the same random number for EVERY record
SELECT Number, blah = Convert(varchar(100), NewID()), random = Rand()
  INTO #test
  FROM master.dbo.fn_int_list(1, 1000000)

CREATE UNIQUE CLUSTERED INDEX uq0_test ON #test (Number)

SET NOCOUNT ON

GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP,
        @c_number int

-- update each record (one by one) and set the random number based on 'the next Rand()' value
-- => the order of the records drives the distribution of the Rand() value !

-- seed
SELECT @c_number = Rand(0) 

-- update 1 by 1 
DECLARE cursor_no_transaction CURSOR LOCAL STATIC
    FOR SELECT Number
          FROM #test
         ORDER BY Number
OPEN cursor_no_transaction 
FETCH NEXT FROM cursor_no_transaction INTO @c_number
WHILE @@FETCH_STATUS = 0
    BEGIN
        UPDATE #test 
           SET random = Rand()
         WHERE Number = @c_number

        FETCH NEXT FROM cursor_no_transaction INTO @c_number
    END
CLOSE cursor_no_transaction 
DEALLOCATE cursor_no_transaction 

PRINT 'Time needed (no transaction) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'

SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test

GO

DECLARE @start_time datetime = CURRENT_TIMESTAMP,
        @c_number int

BEGIN TRANSACTION

-- update each record (one by one) and set the random number based on 'the next Rand()' value
-- => the order of the records drives the distribution of the Rand() value !

-- seed
SELECT @c_number = Rand(0) 

-- update 1 by 1 but all of it inside 1 single transaction
DECLARE cursor_single_transaction CURSOR LOCAL STATIC
    FOR SELECT Number
          FROM #test
         ORDER BY Number
OPEN cursor_single_transaction 
FETCH NEXT FROM cursor_single_transaction INTO @c_number
WHILE @@FETCH_STATUS = 0
    BEGIN
        UPDATE #test 
           SET random = Rand()
         WHERE Number = @c_number

        FETCH NEXT FROM cursor_single_transaction INTO @c_number
    END
CLOSE cursor_single_transaction 
DEALLOCATE cursor_single_transaction 

COMMIT TRANSACTION

PRINT 'Time needed (single transaction) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'

SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test

GO

DECLARE @start_time datetime = CURRENT_TIMESTAMP

-- update each record (single operation), use the Number column to reseed the Rand() function for every record
UPDATE #test 
    SET random = Rand(Number)

PRINT 'Time needed Rand(Number) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'

SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test

GO

DECLARE @start_time datetime = CURRENT_TIMESTAMP

-- update each record (single operation), use 'a bunch of fields' to reseed the Rand() function for every record
UPDATE #test 
    SET random = Rand(BINARY_CHECKSUM(Number, blah))

PRINT 'Time needed Rand(BINARY_CHECKSUM(Number, blah)) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'

SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test

结果或多或少如预期:
Time needed (no transaction) : 24570 ms.
_avg                   _stdev
---------------------- ----------------------
0.499630943538644      0.288686960086461

Time needed (single transaction) : 14813 ms.
_avg                   _stdev
---------------------- ----------------------
0.499630943538646      0.288686960086461

Time needed Rand(Number) : 1203 ms.
_avg                   _stdev
---------------------- ----------------------
0.499407423620328      0.291093824839539

Time needed Rand(BINARY_CHECKSUM(Number, blah)) : 1250 ms.
_avg                   _stdev
---------------------- ----------------------
0.499715398881586      0.288579510523627

所有这些都是“可重复的”,问题在于“可重复”的含义是否符合您的要求。我坚持使用AVG()和STDEV()来获取分布的粗略想法,我将让您自己判断它们是否真正适合您的需求(如果不适合,应该如何改进=)。
对于100万行而言,1.2秒的时间听起来并不太糟糕。但是,如果您的表包含额外的列,它将占用更多的空间,因此需要更多的时间!
希望这能帮助您入门...

1
DECLARE @userReportId BIGINT
SET @userReportId = FLOOR(RAND()*(10000000000000-1) + 1);

1
重复的随机数很可能需要在测试出现异常时重复情况,以便再现异常的情况。以下建议将在物理表中填充一个位置和随机数(添加索引!)。使用此列表与简单的连接将每行与随机数连接起来。每次调用都会将相同的随机数绑定到给定的行。可以通过重新定位随机数(或截断-重新填充或删除-重新创建表)来更改此内容。这应该非常快...
CREATE TABLE dbo.MyRepeatableRandoms(CurrentPosition BIGINT,RandomNumber BIGINT);
GO
DECLARE @CountOfNumbers INT=5; --set a fitting max count here
WITH Tally AS
(
 SELECT TOP(@CountOfNumbers) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nr 
 FROM master..spt_values 
      CROSS JOIN master..spt_values X 
      CROSS JOIN master..spt_values Y
)
INSERT INTO dbo.MyRepeatableRandoms
SELECT Nr,CAST(CAST(NEWID() AS VARBINARY(8)) AS BIGINT) FROM Tally;

--Use this list with a simple join to bind it to the rows of your table
SELECT * FROM dbo.MyRepeatableRandoms ORDER BY CurrentPosition;

--Re-Position the list
WITH UpdateableCTE AS
(
 SELECT ROW_NUMBER() OVER(ORDER BY A.posOrder) AS NewPos
       ,CurrentPosition
 FROM dbo.MyRepeatableRandoms
 CROSS APPLY(SELECT NEWID() AS posOrder) AS A
)
UPDATE UpdateableCTE SET CurrentPosition=NewPos;

--The same random numbers at new positions
SELECT * FROM MyRepeatableRandoms ORDER BY CurrentPosition;

GO
DROP TABLE dbo.MyRepeatableRandoms

结果

    RandomNumber
1   -1939965404062448822
2   2786711671511266125
3   -3236707863137400753
4   -6029509773149087675
5   7815987559555455297

重新定位后
    RandomNumber
1   7815987559555455297
2   -1939965404062448822
3   2786711671511266125
4   -6029509773149087675
5   -3236707863137400753

0

这里有一个纯粹的、简单的 SQL 近似值:

select iif(rand(rand(id)) < .5, -1, 1) * sqrt(1 - exp(-1.27323954474*rand(id)*rand(id) * 
  (1 + 0.0586276296*rand(id)*rand(id)) / (1 + 0.0886745239*rand(id)*rand(id))))
from mytable

我选择了id列作为种子,但你可以选择任何对你来说最有意义的列。例如,你可以将rand(id)更改为rand(some_other_column)

这个公式基于这个数学近似


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接