简单的DAWG创建算法？

Question

简单的DAWG创建算法？

5

我需要为我的Scrabble游戏玩家创建一个DAWG (http://en.wikipedia.org/wiki/Directed_acyclic_word_graph) 结构，并给定文件中的单词列表。我使用Java语言。我只需要做一次，然后将其存储在一个或多个文件中。我已经看到了两种方法：1）构建Trie并将其减少到DAWG；2）直接构建DAWG。由于我只需要执行一次，因此我想要实现最简单的算法来完成它。速度和内存要求不重要。

此外，我想知道如何在运行时存储结构以及如何将其保存在文件中？DAWG基本上是一个图，建议使用我编写的一些非常简单的类的一些节点和边/指针，但我看到了一些使用数组和偏移量（在这个数组中）的实现，这似乎很复杂且难以理解。这一次，我关心的是运行时的内存大小（以及保存的文件的大小）和加载DAWG/使用DAWG的速度。

- NPS

2个回答

0

我曾经为一个客户在C语言中实现过这样的结构。最终的结构是从一个描述字符集和dawg的xml文件中加载的，另一个进程从单词列表创建了该xml文件。

步骤1：构建第一个dawg并将其序列化为xml文件

我们使用了：

typedef struct _s_build_node build_node_t;
struct _s_build_node {
  char_t letter;
  build_node_t* first_child;
  build_node_t* right_sibling;

  hash_t hash;
  size_t depth;
  size_t ID;
};
typedef struct _s_build_dawg {
  charset_t charset;
  node_t* all_nodes; // an array of all the created nodes
  node_t* root;
} build_dawg_t;

子串按升序排序，单词结束特殊字符小于任何其他字符。算法非常简单：

// create the build dawg
foreach word in wordlist
  insert(dawg, word)
// compact the dawg
compact(dawg)
// generate the xml file
xml_dump(dawg)

为了压缩dawg，我们为每个节点计算了一个哈希值。具有相同哈希的两个节点可以被分解。这部分可能有些棘手。只保留深度最低的节点，其他节点将被删除，它们的父节点现在指向保留的节点。
一旦压缩完成，我们为每个节点分配一个唯一的ID（通过bfs，ID介于0和N-1之间，N是压缩后的dawg中节点的数量）。xml文件简单地描述了trie：

<dawg>
  <charset ....>
    ...
  </charset>

  <node ID="node_id" letter="letter" fist_child="first_child_ID" next_sibling="next_sibling_id" />
  <node .... />
  <node .... />
  <node .... />
</dawg>

步骤二：最终的DAGW

结构稍微简单一些。

typedef struct {
  char_t letter;
  size_t first_child;
  size_t next_sibling;
} node_t;

typedef struct {
  node_t nodes[];
  ... whatever you need ...
} dawg_t;

这里的根节点是dawg.nodes [0]，而first_child / next_sibling是节点数组中的索引。从xml文件创建这样的结构很容易。主要缺点是任何单词列表修改都会触发生成新的xml文件。

- Kwariz

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kevin · Accepted Answer

最简单、最高效的DAWG构建算法在this paper中定义，并要求DAWG所代表的单词集合被排序。考虑到您计划从预先存在的单词列表构建DAWG，该列表可能已经排序，或者可以为此目的进行排序。

我已经将算法的伪代码以比论文中给出的更“程序员友好”的格式粗略地转录了一遍（免责声明：我可能会犯一些转录错误；您应该查看原始内容确定是否存在任何错误）：

Given:
        startState is the state from which traversal of the DAWG is to start
        register is a map of representations (hint: hashes) OF graphs which extend 
        from states in the DAWG TO said states

While there is newWord in wordList
    Get newWord from wordList
    Determine longestPrefix of newWord, starting from startState, which already exists in DAWG
    Get longestPrefixEndState, the state which the sequence of transitions defined by longestPrefix leads to
    Get suffix of newWord, the substring of newWord after longestPrefix
    if longestPrefixEndState has children 
        replace_or_register(longestPrefixEndState)
    endIf
    Create the sequence of transitions and states defined by suffix, starting from longestPrefixEndState
endWhile
replace_or_register(startState)


function replace_or_register(argState)
    Get greatestCharEndState of argState, the state which the lexicographically-greatest-char-labelled-transition in the outgoing transition set of argState leads to
    if greatestCharEndState has children
        replace_or_register(greatestCharEndState)
    endIf
    if there exists state in DAWG that is in the register and is equivalent (has an identical graph extending from it) to greatestCharEndState
        Redefine the transition that extends from argState to greatestCharEndState, as one that extends from argState to state
        Delete greatestCharEndState
    endIf
    else 
        add greatestCharEndState to the register
    endElse

考虑到您正在使用Java，您可以利用Serializable接口来处理所有序列化和反序列化需求。

如果您对Java中现有的DAWG实现感兴趣，并且想要实现上述算法，请查看MDAG，它还提供了其他实现不具备的一些巧妙功能（包括即时添加和删除字符串），并由我维护！