SHA哈希用于训练/验证/测试集拆分

Question

SHA哈希用于训练/验证/测试集拆分

10

以下是从完整代码中摘取的一小段。

我试图理解这种分割方法的逻辑过程。

SHA1编码是40个十六进制字符。该表达式计算了什么概率？
(MAX_NUM_IMAGES_PER_CLASS + 1)的原因是什么？为什么要加1？
将不同的值设置为MAX_NUM_IMAGES_PER_CLASS对拆分质量有影响吗？
我们能从中得到什么样的拆分质量？这是推荐的数据集拆分方式吗？

# We want to ignore anything after '_nohash_' in the file name when
  # deciding which set to put an image in, the data set creator has a way of
  # grouping photos that are close variations of each other. For example
  # this is used in the plant disease data set to group multiple pictures of
  # the same leaf.
  hash_name = re.sub(r'_nohash_.*$', '', file_name)
  # This looks a bit magical, but we need to decide whether this file should
  # go into the training, testing, or validation sets, and we want to keep
  # existing files in the same set even if more files are subsequently
  # added.
  # To do that, we need a stable way of deciding based on just the file name
  # itself, so we do a hash of that and then use that to generate a
  # probability value that we use to assign it.
  hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
  percentage_hash = ((int(hash_name_hashed, 16) %
                      (MAX_NUM_IMAGES_PER_CLASS + 1)) *
                     (100.0 / MAX_NUM_IMAGES_PER_CLASS))
  if percentage_hash < validation_percentage:
    validation_images.append(base_name)
  elif percentage_hash < (testing_percentage + validation_percentage):
    testing_images.append(base_name)
  else:
    training_images.append(base_name)

  result[label_name] = {
      'dir': dir_name,
      'training': training_images,
      'testing': testing_images,
      'validation': validation_images,
      }

- Ujjwal

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Davis Herring · Accepted Answer

这段代码简单地将文件名“随机”（但可重复）地分配到若干个箱子中，然后将箱子分为三类。哈希值中的位数无关紧要（只要足够，对于这种工作来说大约为35）。

对n+1取模会产生一个在[0,n]范围内的值，然后将其乘以100/n显然会产生一个在[0,100]范围内的值，被解释为百分比。MAX_NUM_IMAGES_PER_CLASS是指控制解释误差不超过“一张图像”，因此应该设置为n。

这种策略是合理的，但看起来比实际情况更复杂（因为仍然存在舍入错误，而余数引入了偏差——尽管对于这么大的数字来说完全不可见）。您可以通过为每个类预先计算整个2^160哈希空间上的范围并仅检查哈希是否在两个边界之间，使其更简单、更准确。这仍然概念上涉及舍入，但对于160位来说，这只是表示十进制数（例如31%）时固有的浮点舍入误差。