在Java 8中从流中提取Map<K, Multiset<V>>的方法

Question

在Java 8中从流中提取Map<K, Multiset<V>>的方法

11

我有一串单词的流(Stream)，这个格式不是由我设置的，也不能更改。例如：

Stream<String> doc1 = Stream.of("how", "are", "you", "doing", "doing", "doing");
Stream<String> doc2 = Stream.of("what", "what", "you", "upto");
Stream<String> doc3 = Stream.of("how", "are", "what", "how");
Stream<Stream<String>> docs = Stream.of(doc1, doc2, doc3);

我想将这个内容转化成 Map<String, Multiset<Integer>> 结构（或其相应的流，因为我想进一步处理），其中键 String 是单词本身，Multiset<Integer> 表示该单词在每个文档中的出现次数（0不应计算在内）。Multiset 是 Google Guava 类（而不是 Java.util.）。

例如：

how   -> {1, 2}  // because it appears once in doc1, twice in doc3 and none in doc2(so doc2's count should not be included)
are   -> {1, 1}  // once in doc1 and once in doc3
you   -> {1, 1}  // once in doc1 and once in doc2
doing -> {3}     // thrice in doc3, none in others 
what  -> {2,1}   // so on
upto  -> {1}

在Java 8中，有什么好的方法可以做到这一点？

我尝试使用flatMap，但内部的Stream极大地限制了我的选择。

- Anoop

3

"MultiSet"似乎是一个奇怪的选择，不仅会抑制不同文档中相同频率的出现，而且顺序是未定义的，因此您将不知道哪个计数属于哪个文档。 - Jorn Vernee

Map的值可以是一个MultiSets列表，或者是一个Map<String, Integer>。 - Leonardo Pina

@JornVernee..我可以接受失去顺序。这就是选择它而不是List的原因。据我所知，我们可以使用MultiSet在不同的文档中具有相同的出现频率，对吧？对于{1,1,1}的MultiSet是完全有效的。 - Anoop

2

Multiset的主要用例是使用.count(Object)来获取元素出现的次数，这似乎在单词计数本身中并不必要。如果您只需要允许重复的任意顺序的值，则List仍然是最佳解决方案。 - Sean Van Gorder

4个回答

3

Map<String, Multiset<Integer>> result = docs
        .map(s -> s.collect(Collectors.toCollection(HashMultiset::create)))
        .flatMap(m -> m.entrySet().stream())
        .collect(Collectors.groupingBy(Multiset.Entry::getElement,
                Collectors.mapping(Multiset.Entry::getCount,
                        Collectors.toCollection(HashMultiset::create))));

// {upto=[1], how=[1, 2], doing=[3], what=[1, 2], are=[1 x 2], you=[1 x 2]}

多重集合对于获取单词计数很有用，但实际上并不需要用它来存储计数。如果您可以接受 Map<String, List<Integer>>，只需将最后一行替换为 Collectors.toList())));。

或者，既然您已经在使用Guava，为什么不使用ListMultimap呢？

ListMultimap<String, Integer> result = docs
        .map(s -> s.collect(Collectors.toCollection(HashMultiset::create)))
        .flatMap(m -> m.entrySet().stream())
        .collect(ArrayListMultimap::create,
                (r, e) -> r.put(e.getElement(), e.getCount()),
                Multimap::putAll);

// {upto=[1], how=[1, 2], doing=[3], what=[2, 1], are=[1, 1], you=[1, 1]}

- Sean Van Gorder

3

由于您正在使用Guava，您可以利用其实用程序来处理流。同样适用于Table结构。以下是代码：

Table<String, Long, Long> result =
    Streams.mapWithIndex(docs, (doc, i) -> doc.map(word -> new SimpleEntry<>(word, i)))
        .flatMap(Function.identity())
        .collect(Tables.toTable(
            Entry::getKey, Entry::getValue, p -> 1L, Long::sum, HashBasedTable::create));

在这里，我使用Streams.mapWithIndex方法为每个内部流分配索引。在map函数中，我将每个单词转换为一个由单词和索引组成的对，以便稍后知道单词属于哪个文档。

然后，我将所有文档的（单词，索引）对进行扁平化映射到一个流中，最后，通过Tables.toTable收集器将所有对收集到Guava Table中。行是单词，列是文档（由索引表示），值是每个文档的单词计数（我为每个不同的(word, index)对分配1L并使用Long::sum合并冲突）。

您在result表中拥有所需的所有信息，但如果您仍需要Map<String, Multiset<Integer>>，可以按以下方式执行：

Map<String, Multiset<Long>> map = Maps.transformValues(
    result.rowMap(),
    m -> HashMultiset.create(m.values()));

注意：您需要使用Guava 21才能使其正常工作。

- fps

1

你可以用 new SimpleEntry<>(word, i) 替换 Pair。同时请注意，OP 不关心条目来自哪个文档索引：他不想要 how={0=1, 2=2}，而是想要 how={1,2}。 - Eugene

1

@Eugene 感谢您的反馈。我已经用 SimpleEntry 替换了 Pair。关于文档索引，您可以在表格上使用 rowMap()，然后使用 Maps.transformValues 来惰性地将表格更改为 OP 所需的内容。 - fps

1

以下是AbacusUtil提供的简单解决方案：

Map<String, List<Integer>> m = Stream.of(doc1, doc2, doc3)
          .flatMap(d -> d.toMultiset().stream()).collect(Collectors.toMap2());

- user_3380739

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Eugene · Accepted Answer

 Map<String, List<Long>> map = docs.flatMap(
            inner -> inner.collect(
                    Collectors.groupingBy(Function.identity(), Collectors.counting()))
                    .entrySet()
                    .stream())
            .collect(Collectors.groupingBy(
                    Entry::getKey,
                    Collectors.mapping(Entry::getValue, Collectors.toList())));

System.out.println(map);

// {upto=[1], how=[1, 2], doing=[3], what=[2, 1], are=[1, 1], you=[1, 1]}