SHA哈希用于训练/验证/测试集拆分

10

以下是从完整代码中摘取的一小段。

我试图理解这种分割方法的逻辑过程。

  • SHA1编码是40个十六进制字符。该表达式计算了什么概率?
  • (MAX_NUM_IMAGES_PER_CLASS + 1)的原因是什么?为什么要加1?
  • 将不同的值设置为MAX_NUM_IMAGES_PER_CLASS对拆分质量有影响吗?
  • 我们能从中得到什么样的拆分质量?这是推荐的数据集拆分方式吗?

# We want to ignore anything after '_nohash_' in the file name when
  # deciding which set to put an image in, the data set creator has a way of
  # grouping photos that are close variations of each other. For example
  # this is used in the plant disease data set to group multiple pictures of
  # the same leaf.
  hash_name = re.sub(r'_nohash_.*$', '', file_name)
  # This looks a bit magical, but we need to decide whether this file should
  # go into the training, testing, or validation sets, and we want to keep
  # existing files in the same set even if more files are subsequently
  # added.
  # To do that, we need a stable way of deciding based on just the file name
  # itself, so we do a hash of that and then use that to generate a
  # probability value that we use to assign it.
  hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
  percentage_hash = ((int(hash_name_hashed, 16) %
                      (MAX_NUM_IMAGES_PER_CLASS + 1)) *
                     (100.0 / MAX_NUM_IMAGES_PER_CLASS))
  if percentage_hash < validation_percentage:
    validation_images.append(base_name)
  elif percentage_hash < (testing_percentage + validation_percentage):
    testing_images.append(base_name)
  else:
    training_images.append(base_name)

  result[label_name] = {
      'dir': dir_name,
      'training': training_images,
      'testing': testing_images,
      'validation': validation_images,
      }
1个回答

4
这段代码简单地将文件名“随机”(但可重复)地分配到若干个箱子中,然后将箱子分为三类。哈希值中的位数无关紧要(只要足够,对于这种工作来说大约为35)。
n+1取模会产生一个在[0,n]范围内的值,然后将其乘以100/n显然会产生一个在[0,100]范围内的值,被解释为百分比。MAX_NUM_IMAGES_PER_CLASS是指控制解释误差不超过“一张图像”,因此应该设置为n
这种策略是合理的,但看起来比实际情况更复杂(因为仍然存在舍入错误,而余数引入了偏差——尽管对于这么大的数字来说完全不可见)。您可以通过为每个类预先计算整个2^160哈希空间上的范围并仅检查哈希是否在两个边界之间,使其更简单、更准确。这仍然概念上涉及舍入,但对于160位来说,这只是表示十进制数(例如31%)时固有的浮点舍入误差。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接