将字典转换为NumPy数组的Pythonic方法

Question

将字典转换为NumPy数组的Pythonic方法

3

我更多的是关于编程风格的问题。我会从网页上抓取字段，例如：“温度：51-62”，“高度：1000-1500”等。结果将保存在字典中。

{"temperature": "51-62", "height":"1000-1500" ...... }

所有的键和值都是字符串类型。每个键可以映射到多个可能的值中的一个。现在我想将这个字典转换成numpy数组/向量。我的问题如下：

- 每个键对应数组中的一个索引位置。 - 每个可能的字符串值都映射到一个整数上。 - 对于某些字典，有些键不可用，例如，我还有一个字典没有“温度”键，因为该网页不包含此字段。

我想知道在Python中编写这种转换的最明确和有效的方式是什么。我考虑构建另一个字典，将键映射到向量的索引号，以及许多其他将值映射到整数的字典。

我遇到的另一个问题是我不确定一些键的范围。我想动态跟踪字符串值和整数之间的映射。例如，将来我可能会发现key1可以映射到val1_8。

谢谢。

- fast tooth

可能是如何在Python字典中迭代值的重复问题。 - Anycorn

@Anycorn，感谢您的迅速评论。我的问题与那篇帖子不同。 - fast tooth

2个回答

1

>>> # a sequence of dictionaries in an interable called 'data'
>>> # assuming that not all dicts have the same keys
>>> pprint(data)
  [{'x': 7.0, 'y1': 2.773, 'y2': 4.5, 'y3': 2.0},
   {'x': 0.081, 'y1': 1.171, 'y2': 4.44, 'y3': 2.576},
   {'y1': 0.671, 'y3': 3.173},
   {'x': 0.242, 'y2': 3.978, 'y3': 3.791},
   {'x': 0.323, 'y1': 2.088, 'y2': 3.602, 'y3': 4.43}]

>>> # get the unique keys across entire dataset
>>> keys = [list(dx.keys()) for dx in data]

>>> # flatten and coerce to 'set'
>>> keys = {itm for inner_list in keys for itm in inner_list}

>>> # create a map (look-up table) from each key 
>>> # to a column in a NumPy array

>>> LuT = dict(enumerate(keys))
>>> LuT
  {'y2': 0, 'y3': 1, 'y1': 2, 'x': 3}

>>> idx = list(LuT.values())

>>> # pre-allocate NUmPy array (100 rows is arbitrary)
>>> # number of columns is len(LuT.keys())

>>> D = NP.empty((100, len(LuT.keys())))

>>> keys = list(LuT.keys())
>>> keys
  [0, 1, 2, 3]

>>> # now populate the array from the original data using LuT
>>> for i, row in enumerate(data):
        D[i,:] = [ row.get(LuT[k], 0) for k in keys ]

>> D[:5,:]
  array([[ 4.5  ,  2.   ,  2.773,  7.   ],
         [ 4.44 ,  2.576,  1.171,  0.081],
         [ 0.   ,  3.173,  0.671,  0.   ],
         [ 3.978,  3.791,  0.   ,  0.242],
         [ 3.602,  4.43 ,  2.088,  0.323]])

比较最后5行D的结果与上面的data，注意对于每一行（单个字典），其排序是保留的，即使键集不完整——换句话说，D的第2列始终对应于以y2为键的值，等等，即使数据中给定的行没有为该键存储任何值；例如，查看data中的第三行，它只有两个键/值对，在D的第三行中，第一列和最后一列都是0，这些列对应于键x和y2，实际上是两个缺失的键。

- doug

感谢您详细的回答。我发现Pandas是解决我当前问题的自然选择。 - fast tooth

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- U2EF1 · Accepted Answer

7

试试使用pandas Series，它专门为此而构建。

import pandas as pd
s = pd.Series({'a':1, 'b':2, 'c':3})
s.values # a numpy array

- U2EF1

其中一个问题是并非所有的字典都有相同的键集，Pandas能够处理吗？谢谢。 - fast tooth

1

是的。您可能还想查看pandas DataFrame，以获得更多乐趣。 - U2EF1

谢谢，我已经安装了它，真是一个强大的工具。我执行了 pd.DataFrame( {dd["name"]: pd.Series( dd) for dd in dictlist})，其中 dictlist 是一个字典列表。 - fast tooth