从哈希表创建一个 |N| x |M| 矩阵

12

假设我有一个字符串对(键)和它们各自概率(值)的字典/哈希表:

import numpy as np
import random
import uuid

# Creating the N vocabulary and M vocabulary
max_word_len = 20
n_vocab_size = random.randint(8000,10000)
m_vocab_size = random.randint(8000,10000)

def random_word(): 
    return str(uuid.uuid4().get_hex().upper()[0:random.randint(1,max_word_len)])

# Generate some random words.
n_vocab = [random_word() for i in range(n_vocab_size)]
m_vocab = [random_word() for i in range(m_vocab_size)]


# Let's hallucinate probabilities for each word pair.
hashes =  {(n, m): random.random() for n in n_vocab for m in m_vocab}
hashes哈希表将会看起来像这样:
{('585F', 'B4867'): 0.7582038699473549,
 ('69', 'D98B23C5809A'): 0.7341569569849136,
 ('4D30CB2BF4134', '82ED5FA3A00E4728AC'): 0.9106077161619021,
 ('DD8F8AFA5CF', 'CB'): 0.4609114677237601,
...
}

假设这是我将从CSV文件中读取的输入哈希表,第一列和第二列是哈希表的单词对(键),第三列是概率。
如果我要将概率放入某种numpy矩阵中,我需要从哈希表中进行如下操作:
 n_words, m_words = zip(*hashes.keys())
 probs = np.array([[hashes[(n, m)] for n in n_vocab] for m in m_vocab])

是否有其他方法可以从哈希表中获取 `prob` 数据并将其放入 |N| * |M| 矩阵中,而不必通过 m_vocab 和 n_vocab 的嵌套循环来完成?
(注:此处我正在创建随机单词和随机概率,但请想象我已经从文件中读取了哈希表,并将其读入到该哈希表结构中)
假设有以下两种情况:
1. 哈希表来自一个 CSV 文件(@bunji 的答案解决了这个问题) 2. 哈希表来自 pickled 字典或在到达转换为矩阵的部分之前以某种其他方式计算出哈希表。
重要的是最终矩阵需要是可查询的,下面的情况是不可取的:
$ echo -e 'abc\txyz\t0.9\nefg\txyz\t0.3\nlmn\topq\t\0.23\nabc\tjkl\t0.5\n' > test.txt

$ cat test.txt
abc xyz 0.9
efg xyz 0.3
lmn opq .23
abc jkl 0.5


$ python
Python 2.7.10 (default, Jul 30 2016, 18:31:42) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pt = pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack().as_matrix()
>>> pt
array([[ 0.5,  nan,  0.9],
       [ nan,  nan,  0.3],
       [ nan,  nan,  nan]])
>>> pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack()
       2         
1    jkl opq  xyz
0                
abc  0.5 NaN  0.9
efg  NaN NaN  0.3
lmn  NaN NaN  NaN

>>> df = pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack()

>>> df
       2         
1    jkl opq  xyz
0                
abc  0.5 NaN  0.9
efg  NaN NaN  0.3
lmn  NaN NaN  NaN

>>> df['abc', 'jkl']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
    return self._getitem_multilevel(key)
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1617, in get_loc
    return self._engine.get_loc(key)
  File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13161)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13115)
KeyError: ('abc', 'jkl')
>>> df['abc']['jkl']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
    return self._getitem_multilevel(key)
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
    loc = self._get_level_indexer(key, level=0)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
    loc = level_index.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
  File "pandas/index.pyx", line 163, in pandas.index.IndexEngine.get_loc (pandas/index.c:4090)
KeyError: 'abc'

>>> df[0][2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
    return self._getitem_multilevel(key)
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
    loc = self._get_level_indexer(key, level=0)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
    loc = level_index.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
  File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)
  File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)
KeyError: 0

>>> df[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
    return self._getitem_multilevel(key)
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
    loc = self._get_level_indexer(key, level=0)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
    loc = level_index.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
  File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)
  File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)
KeyError: 0

生成的矩阵/数据框应该是可查询的,即能够执行以下操作:
probs[('585F', 'B4867')] = 0.7582038699473549

你能用pandas来做这个吗?为字典创建一个数据框?同时使用两个键作为两列,哈希作为另一列。之后你可以尝试创建一个复合索引。只是猜测。 - Tammo Heeren
那么 pandas.DataFrame.tonumpy() 呢?有这样的函数吗?让我试试。 - alvas
1
你的 uuid4().get_hex().upper() 可能需要在 Python 3.5 或更高版本中更改为 uuid4().hex.upper() - Tammo Heeren
为什么你需要把它们以这样的表格形式呈现? - Tammo Heeren
我实际上有另一个 |M| x 1 的向量,对应于 M,需要通过矩阵乘法形成 |N| x |M| * |M| x 1 = |N| x 1。此外,该矩阵还需要执行其他统计计算,在 nlp 中,该矩阵被称为共现矩阵。 - alvas
我稍微尝试了一下,但是如果没有循环,我无法完成这个任务。 - Tammo Heeren
5个回答

5

我不确定是否有完全避免循环的方法,但我想可以通过使用itertools来进行优化:

import itertools
nested_loop_iter = itertools.product(n_vocab,m_vocab)
#note that because it iterates over n_vocab first we will need to transpose it at the end
probs = np.fromiter(map(hashes.get, nested_loop_iter),dtype=float)
probs.resize((len(n_vocab),len(m_vocab)))
probs = probs.T

4

如果您的最终目标是从.csv文件中读取数据,那么使用pandas直接读取文件可能更容易。

import pandas as pd

df = pd.read_csv('coocurence_data.csv', index_col=[0,1], header=None).unstack()
probs = df.as_matrix()

这段代码从csv文件中读取数据,将前两列变成一个多级索引,对应于你的两组单词。然后解除多级索引,这样你就有了一组单词作为列标签,另一组单词作为行标签。这给出了你的|N|*|M|矩阵,然后可以使用.as_matrix()函数将其转换为numpy数组。
这并没有真正解决你关于如何将你的{(n,m):prob}字典转换为numpy数组的问题,但根据你的意图,这将使你避免完全创建该字典的需要。
此外,如果你无论如何都要读取csv文件,那么使用pandas比使用内置的csv模块更快:请参见这些基准测试here 编辑 为了基于行和列标签查询DataFrame中的特定值,请使用df.loc
df.loc['xyz', 'abc']

'xyz' 是您的行标签中的单词,'abc' 是您的列标签。此外,请查看df.ixdf.iloc以了解在DataFrame中查询特定单元格的其他方法。


如果我将文件读入panda数据框中,那么只有对角向量才会被填充,但是如果哈希表被封装为“dict”,可能不太方便。不过,这仍然是读取csv的好方法 =) - alvas
啊,这很棘手,最终概率矩阵中的标题未定义,导致无法查询 =( - alvas
@alvas,你可以使用.loc函数基于标签查询DataFrame中的单元格。我将编辑上面的答案来演示这一点。 - bunji
1
.values is recommended over .as_matrix() - Paul H
当我在我的数据集上尝试使用 df.loc 时,它一直报错说我的单词不在索引中,感觉有点奇怪。 - alvas
似乎一些Python的限制导致了索引错误问题。http://stackoverflow.com/questions/40714519/indexerror-on-huge-list-in-python - alvas

3

大多数解决方案对我来说看起来不错。有点取决于您需要速度还是方便。

我同意,您基本上拥有一个coo稀疏格式的矩阵。您可能需要查看https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html

唯一的问题是矩阵需要整数索引。因此,只要哈希值足够可以快速表示为np.int64,那么就可以工作。而且稀疏格式应该允许$O(1)$访问所有元素。

(抱歉简洁!)

概述

这可能潜在地很快,但有点hacky。

  1. get the data in sparse representation. I think you should pick coo_matrix to just hold your 2D hash map.

    a. load the CSV using numpy.fromtxt and use e.g. datatype ['>u8', '>u8', np.float32] to treat the hashes as string representations of unsigned 8byte integer numbers. If that does not work you might load strings and use numpy to convert it. Finally you have three tables of size N * M like your hash table and use these with the scipy sparse matrix representation of your choice.

    b. if you have the object already in memory you might be able to use the sparse constructor directly

  2. To access you need to parse your strings again

    prob = matrix[np.fromstring(key1, dtype='>u8'), np.fromstring(key2, dtype='>u8')]
    

2
似乎在稀疏矩阵中遍历整个n_vocab x m_vocab空间有些低效!您可以循环原始哈希表。当然,首先需要知道一些事情:
1.您是否提前知道n_vocab和m_vocab的大小?还是在构建过程中确定它们的大小?
2.您是否知道哈希表中是否存在重复项,如果有,您将如何处理它?看起来hash是一个字典,在这种情况下,显然键是唯一的。实际上,这可能意味着每次都会被覆盖,因此最后一个值将占主导地位。
无论如何,以下是两种选项的比较:
from collections import defaultdict
import numpy as np

hashes = defaultdict(float,{('585F', 'B4867'): 0.7582038699473549,
 ('69', 'D98B23C5809A'): 0.7341569569849136,
 ('4D30CB2BF4134', '82ED5FA3A00E4728AC'): 0.9106077161619021,
 ('DD8F8AFA5CF', 'CB'): 0.4609114677237601})

#Double loop approach
n_vocab, m_vocab = zip(*hashes.keys())
probs1 = np.array([[hashes[(n, m)] for n in n_vocab] for m in m_vocab])

#Loop through the hash approach
n_hash = dict()  #Create a hash table to find the correct row number
for i,n in enumerate(n_vocab):
    n_hash[n] = i
m_hash = dict()  #Create a hash table to find the correct col number
for i,m in enumerate(m_vocab):
    m_hash[m] = i
probs2 = np.zeros((len(n_vocab),len(m_vocab)))
for (n,m) in hashes: #Loop through the hashes and put the values into the probs table
    probs2[n_hash[n],m_hash[m]] = hashes[(n,m)]

probs1和probs2的输出结果当然是相同的:

>>> probs1
array([[ 0.73415696,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.46091147,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.75820387,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.91060772]])
>>> probs2
array([[ 0.73415696,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.46091147,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.75820387,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.91060772]])

当然,你的probs1代码非常简洁。然而,循环的大小差别很大,这可能对运行时间产生重大影响。


2

我尝试减少样本量以快速比较不同的代码。我编写了一个数据帧方法,可能仍然在pandas函数中使用for循环,并与Tadhg McDonald-Jensen提供的原始代码和itertools代码进行了比较。最快的代码是itertools。

In [3]: %timeit itertool(hashes,n_vocab,m_vocab)
1000 loops, best of 3: 1.12 ms per loop

In [4]: %timeit baseline(hashes,n_vocab,m_vocab)
100 loops, best of 3: 3.23 ms per loop

In [5]: %timeit dataframeMethod(hashes,n_vocab,m_vocab)
100 loops, best of 3: 5.49 ms per loop

这是我用来进行比较的代码。

import numpy as np
import random
import uuid
import pandas as pd
import itertools

# Creating the N vocabulary and M vocabulary
max_word_len = 20
n_vocab_size = random.randint(80,100)
m_vocab_size = random.randint(80,100)

def random_word(): 
    return str(uuid.uuid4().get_hex().upper()[0:random.randint(1,max_word_len)])

# Generate some random words.
n_vocab = [random_word() for i in range(n_vocab_size)]
m_vocab = [random_word() for i in range(m_vocab_size)]


# Let's hallucinate probabilities for each word pair.
hashes =  {(n, m): random.random() for n in n_vocab for m in m_vocab}

def baseline(hashes,n_vocab,m_vocab):
    n_words, m_words = zip(*hashes.keys())
    probs = np.array([[hashes[(n, m)] for n in n_vocab] for m in m_vocab])
    return probs

def itertool(hashes,n_vocab,m_vocab):
    nested_loop_iter = itertools.product(n_vocab,m_vocab)
    #note that because it iterates over n_vocab first we will need to transpose it at the end
    probs = np.fromiter(map(hashes.get, nested_loop_iter),dtype=float)
    probs.resize((len(n_vocab),len(m_vocab)))
    return probs.T  

def dataframeMethod(hashes,n_vocab,m_vocab):
    # build dataframe from hashes
    id1 = pd.MultiIndex.from_tuples(hashes.keys())
    df=pd.DataFrame(hashes.values(),index=id1)
    # make dataframe with one index and one column
    df2=df.unstack(level=0)
    df2.columns = df2.columns.levels[1]
    return df2.loc[m_vocab,n_vocab].values

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接