假设我有一个字符串对(键)和它们各自概率(值)的字典/哈希表:
import numpy as np
import random
import uuid
# Creating the N vocabulary and M vocabulary
max_word_len = 20
n_vocab_size = random.randint(8000,10000)
m_vocab_size = random.randint(8000,10000)
def random_word():
return str(uuid.uuid4().get_hex().upper()[0:random.randint(1,max_word_len)])
# Generate some random words.
n_vocab = [random_word() for i in range(n_vocab_size)]
m_vocab = [random_word() for i in range(m_vocab_size)]
# Let's hallucinate probabilities for each word pair.
hashes = {(n, m): random.random() for n in n_vocab for m in m_vocab}
hashes
哈希表将会看起来像这样:{('585F', 'B4867'): 0.7582038699473549,
('69', 'D98B23C5809A'): 0.7341569569849136,
('4D30CB2BF4134', '82ED5FA3A00E4728AC'): 0.9106077161619021,
('DD8F8AFA5CF', 'CB'): 0.4609114677237601,
...
}
假设这是我将从CSV文件中读取的输入哈希表,第一列和第二列是哈希表的单词对(键),第三列是概率。
如果我要将概率放入某种
numpy
矩阵中,我需要从哈希表中进行如下操作: n_words, m_words = zip(*hashes.keys())
probs = np.array([[hashes[(n, m)] for n in n_vocab] for m in m_vocab])
是否有其他方法可以从哈希表中获取 `prob` 数据并将其放入 |N| * |M| 矩阵中,而不必通过 m_vocab 和 n_vocab 的嵌套循环来完成?
(注:此处我正在创建随机单词和随机概率,但请想象我已经从文件中读取了哈希表,并将其读入到该哈希表结构中)
假设有以下两种情况:
1. 哈希表来自一个 CSV 文件(@bunji 的答案解决了这个问题) 2. 哈希表来自 pickled 字典或在到达转换为矩阵的部分之前以某种其他方式计算出哈希表。
重要的是最终矩阵需要是可查询的,下面的情况是不可取的:
$ echo -e 'abc\txyz\t0.9\nefg\txyz\t0.3\nlmn\topq\t\0.23\nabc\tjkl\t0.5\n' > test.txt
$ cat test.txt
abc xyz 0.9
efg xyz 0.3
lmn opq .23
abc jkl 0.5
$ python
Python 2.7.10 (default, Jul 30 2016, 18:31:42)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pt = pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack().as_matrix()
>>> pt
array([[ 0.5, nan, 0.9],
[ nan, nan, 0.3],
[ nan, nan, nan]])
>>> pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack()
2
1 jkl opq xyz
0
abc 0.5 NaN 0.9
efg NaN NaN 0.3
lmn NaN NaN NaN
>>> df = pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack()
>>> df
2
1 jkl opq xyz
0
abc 0.5 NaN 0.9
efg NaN NaN 0.3
lmn NaN NaN NaN
>>> df['abc', 'jkl']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
return self._getitem_multilevel(key)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
loc = self.columns.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1617, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13161)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13115)
KeyError: ('abc', 'jkl')
>>> df['abc']['jkl']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
return self._getitem_multilevel(key)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
loc = self.columns.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
loc = self._get_level_indexer(key, level=0)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
loc = level_index.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
File "pandas/index.pyx", line 163, in pandas.index.IndexEngine.get_loc (pandas/index.c:4090)
KeyError: 'abc'
>>> df[0][2]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
return self._getitem_multilevel(key)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
loc = self.columns.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
loc = self._get_level_indexer(key, level=0)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
loc = level_index.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)
File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)
KeyError: 0
>>> df[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
return self._getitem_multilevel(key)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
loc = self.columns.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
loc = self._get_level_indexer(key, level=0)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
loc = level_index.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)
File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)
KeyError: 0
生成的矩阵/数据框应该是可查询的,即能够执行以下操作:
probs[('585F', 'B4867')] = 0.7582038699473549
pandas.DataFrame.tonumpy()
呢?有这样的函数吗?让我试试。 - alvasuuid4().get_hex().upper()
可能需要在 Python 3.5 或更高版本中更改为uuid4().hex.upper()
。 - Tammo Heerennlp
中,该矩阵被称为共现矩阵。 - alvas