Java算法：通过多个情况标准将配对列表条目进行配对

Question

Java算法：通过多个情况标准将配对列表条目进行配对

javaalgorithmlist

4

我担心这不是一个简单的问题。我已经思考了很长时间，希望新的一批大脑能够更好地看待这个问题 - 让我们开始吧：

数据：

我们在这里使用的是包含多列的 CSV 文件，与此问题相关的是：

用户 ID（整数，范围从 3 到 8 位数字，存在具有相同 UserID 的多个条目）名单按此排序
查询（字符串）
时代（长，时代时间值）
clickurl（字符串）

我们在这里处理的每个数据条目都具有这些属性的 !null 值。

示例数据：

SID,UID,query,rawdate,timestamp,timegap,epoc,lengthwords,lengthchars,rank,clickurl
5,142,westchester.gov,2006-03-20 03:55:57,Mon Mar 20 03:55:57 CET 2006,0,1142823357504,1,15,1,http://www.westchestergov.com
10,142,207 ad2d 530,2006-04-08 01:31:14,Sat Apr 08 01:31:14 CEST 2006,10000,1144452674507,3,12,1,http://www.courts.state.ny.us
11,142,vera.org,2006-04-08 08:38:42,Sat Apr 08 08:38:42 CEST 2006,11000,1144478322507,1,8,1,http://www.vera.org

注意：存在多个具有相同“Epoc”值的条目，这是由于用于收集数据的工具造成的

注意2：列表的大小约为700000，仅供参考

目标：匹配具有相同查询的条目对

范围：共享相同UserID的条目

由于数据收集过程中提到的异常情况，必须考虑以下内容：

如果两个条目共享“Query”和“Epoc”的相同值，则必须检查以下标准的列表元素，直到下一个条目具有这些属性之一的不同值。共享相同查询和Epoc值的条目组应视为一个条目，因此为了匹配一对，必须找到另一个与“Query”值匹配的条目。由于没有更好的名称，让我们称共享相同查询和Epoc值的组为“链”

现在这个问题解决了，它变得更容易了，我们可以从中获得3种类型的配对组合：

条目&条目
条目&链
链&链

这里的类型1仅表示列表中具有相同“Query”值但不具有相同“Epoc”的两个条目。

因此，这总结了相等查询对

还有一种情况是不同查询对，可以描述为以下内容：

在我们匹配相等查询对之后，可能存在未与其他条目配对的条目，因为它们的查询不匹配-每个未配对的条目都是称为“不同查询”的集合的一部分

此集合的成员必须无需遵循任何标准进行配对，但仍将链视为一对条目。

至于一般的配对，可能没有冗余的配对-单个条目可以成为n多个配对的一部分，但两个独立的条目只能形成一对。

示例：

下面的条目需要配对

UID,Query,Epoc,Clickurl
772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html
772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html
772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html
772,raspberry pi,1141394164710,http://www.raspberrypi.org/
772,stackoverflow,1141394274810,http://en.wikipedia.org/wiki/Buffer_overflow
772,stackoverflow,1141394274850,http://www.stackoverflow.com
772,tall women,1141394275921,http://www.tallwomen.org/
772,raspberry pi,1141394277991,http://www.raspberrypi.org/
772,Donuts,114139427999,http://de.wikipedia.org/wiki/Donut
772,stackoverflow,1141394279999,http://www.stackoverflow.com
772,something,1141399299991,http:/something.else/something/

在这个例子中，donuts是一家连锁店，因此成对出现的行号为（没有标题的行号）：

相等的查询对：（1-3，9）（4，8）（5，6）（5，10）（6，10）
不同的查询对：（7、11）

我对这个问题的失败尝试：

我开发的算法解决这个问题遵循以下步骤：

迭代条目列表，直到UserID的值改变。

然后，应用于一个单独的列表，该列表仅包含共享相同UserID的刚刚迭代的元素：

   for (int i = 0; i < list.size(); i++) {
            Entry tempI = list.get(i);
            Boolean iMatched = false;
            //boolean to save whether or not c1 is set
            Boolean c1done = false;
            Boolean c2done = false;

        //Hashsets holding the clickurl values of the entries that form a pair
        HashSet<String> c1 = null;
        HashSet<String> c2 = null;

        for (int j = i + 1; j < list.size(); j++) {
            Entry tempJ = list.get(j);
            // Queries match
            if (tempI.getQuery().equals(tempJ.getQuery())) {
                // wheter or not Entry at position i has been matched or not
                if (!iMatched) {
                    iMatched = true;
                }
                HashSet<String> e1 = new HashSet<String>();
                HashSet<String> e2 = new HashSet<String>();
                int k = 0;
                // Times match
                HashSet<String> chainset = new HashSet<String>();
                if (tempI.getEpoc() == tempJ.getEpoc()) {
                    chainset.add(tempI.getClickurl());
                    chainset.add(tempJ.getClickurl());
                } else {
                    e1.add(tempI.getClickurl());
                    if (c1 == null) {
                        c1 = e1;
                        c1done = true;
                    } else {
                        if (c2 == null) {
                            c2 = e1;
                            c2done = true;
                        }
                    }
                }
                //check how far the chain goes and get their entries
                if ((j + 1) < list.size()) {
                    Entry tempjj = list.get(j + 1);
                    if (tempjj.getEpoc() == tempJ.getEpoc()) {
                        k = j + 1;
                        //search for the end of the chain
                        while ((k < list.size())
                                && (tempJ.getQuery().equals(list.get(k)
                                        .getQuery()))
                                && (tempJ.getEpoc() == list.get(k).getEpoc())) {

                            chainset.add(tempJ.getClickurl());
                            chainset.add(list.get(k).getClickurl());
                            k++;

                        }
                        j = k + 1; //continue the iteration at the end of the chain
                        if (c1 == null) {
                            c1 = chainset;
                            c1done = true;
                        } else {
                            if (c2 == null) {
                                c2 = chainset;
                                c2done = true;
                            }
                        }

                        // Times don't match
                    }
                } else {
                    e2.add(tempJ.getClickurl());
                    if (c1 == null) {
                        c1 = e2;
                        c1done = true;
                    } else {
                        if (c2 == null) {
                            c2 = e2;
                            c2done = true;
                        }
                    }
                }

                /** Block that compares the clicks in the Hashsets and computes the resulting data
                *  left out for now to not make this any more complicated than it already is
                **/

                // Queries don't match
            } else {
                if (!dq.contains(tempJ)) { //note: dq is an ArrayList holding the entries of the differen query set
                    dq.add(tempJ);
                }
            }

            if (j == al.size() - 1) {
                if (!iMatched) {
                    dq.add(tempI);
                }
            }
        }
        if (dq.size() >= 2) {

            for (int z = 0; z < dq.size() - 1; z++) {
                if (dq.get(z + 1) != null) {
                    /** Filler, iterate dq just like the normal list with two loops
                    *
                    **/
                }
            }
        }

    }

因此，我使用了大量循环来匹配这些对，导致运行时间非常长，直到现在我都没有看到它的结束。

好的，我希望我没有忘记任何关键信息，我稍后会添加进一步需要的信息。如果你已经读到这里，感谢你的阅读 - 希望你有一个能帮助我的想法。

- deemel

3

你能否提供一个带有一些数据的例子？因为我很难理解这个问题。 - nullpotent

2

我在阅读这个问题和理解场景时遇到了困难...你能否用两句话解释一下你想要实现什么？需要进行什么样的比较？ - RonK

为列表数据和键值对添加了示例。@Ronk，比较不是问题，配对才是 - 因此更改了目标，可能会误导。 - deemel

每行的分隔符是什么？ - Kneel-Before-ZOD

一个制表符，但是CSV文件中的数据被读入Java中的ArrayList以进行处理。 - deemel

2个回答

0

首先，从每个链中删除除一个条目以外的所有条目。为此，请按（userid、query、epoch）排序，删除重复项。

然后，扫描已排序的列表。获取（userid、query）对的所有条目。如果只有一个，请将其保存在稍后处理的列表中，否则发出所有对。

对于为给定用户保存以供稍后处理的所有条目（这些是类型2和3），请发出对。

- maniek

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dylan · Accepted Answer

使用SQL将数据导入数据库，然后执行查询。你的文本文件太大了，难怪需要这么长时间来处理它。 :)