如何检查XML文件是否包含连续节点？

Question

如何检查XML文件是否包含连续节点？

3

我有一些看起来像这样的XML文件：

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="jats-html.xsl"?>
<!--<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with OASIS Tables v1.0 20120330//EN" "JATS-journalpublishing-oasis-article1.dtd">-->
<article article-type="proceedings" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id" />
<journal-title-group>
<journal-title>Eleventh &#x0026; Tenth International Conference on Correlation Optics</journal-title>
</journal-title-group>
<issn pub-type="epub">0277-786X</issn>
<publisher>
<publisher-name>Springer</publisher-name>
</publisher>
</journal-meta>
<fig-count count="0" />
<table-count count="0" />
<equation-count count="0" />
</front>
<body>
<sec id="s1">
<label>a.</label>
<title>INTRODUCTION</title>
<p>One of approaches of solving<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref13">[13]</xref>, <xref ref-type="bibr" rid="ref8">[8]</xref> the problem <xref ref-type="bibr" rid="ref1">[1]</xref>, <xref ref-type="bibr" rid="ref5">[2]</xref>, <xref ref-type="bibr" rid="ref6">[6]</xref> <xref ref-type="bibr" rid="ref7">[6]</xref> of light propagation in scattering media is the method of Monte Carlo statistical simulation<sup><xref ref-type="bibr" rid="c1">1</xref>–<xref ref-type="bibr" rid="c5">5</xref></sup>. It is a set of techniques that allow us to find the necessary solutions by repetitive random sampling. Estimates of the unknown quantities are statistical means.</p>
<p>For the case of radiation transport in scattering <xref ref-type="bibr" rid="ref6">6</xref> <xref ref-type="bibr" rid="ref8">8</xref> <xref ref-type="bibr" rid="ref9">9</xref> <xref ref-type="bibr" rid="ref10">10</xref> medium Monte Carlo method consists in repeated calculation of the trajectory <xref ref-type="bibr" rid="ref7">6</xref> <xref ref-type="bibr" rid="ref7">7</xref> <xref ref-type="bibr" rid="ref8">8</xref> <xref ref-type="bibr" rid="ref9">[9]</xref> of a photon in a medium based on defined environment parameters. Application of Monte Carlo method is based on the use of macroscopic optical properties of the medium which are considered homogeneous within small volumes of tissue. Models that are based on this method can be divided into two types: models that take into account the polarization of the radiation, and models that ignore it.</p>
<p>Simulation that is based on the previous models usually discards the details of the radiation energy distribution within a single scattering particle. This disadvantage can be ruled out (in the case of scattering particles whose size exceeds the wavelength) by using another method - reverse ray tracing. This method is like the one mentioned before on is based on passing a large number of photons through a medium that is simulated. The difference is that now each scattering particle has a certain geometric topology and scattering is now calculated using the Fresnel equations. The disadvantage of this method is that it can give reliable results only if the particle size is much greater than the wavelength (at least an order of magnitude).</p>
</sec>
</body>
</article>

在文件中，存在以<xref ref-type="bibr" rid="ref...">...</xref>形式表示的链接节点。我如何查找是否有三个或更多连续的链接节点（通过逗号和空格或简单的空格分隔），并将它们输出到txt文件中。

我可以执行一个正则表达式搜索，例如(?:<xref type="bibr" rid="ref\d+">\[\d+\]</xref>\s*,\s*){2,}<xref type="bibr" rid="ref\d+">\[\d+\]</xref>，它会查找由“逗号和空格”或“空格”分隔的3个或更多链接节点，但它们不一定绑定在连续的id上。我该怎么做？

- Don_B

我已经更新了代码。现在应该可以通过你的所有测试用例。 - Kent Kostelac

3个回答

1

我的xpath有点生疏。但我相信你可以制作一个比我下面呈现的更好的xpath。更好的xpath只会选择具有3个或更多类型为bibr且包含以ref开头的rid的节点。无论如何，这是获取所需节点的解决方案。

public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.Load("article.xml");

    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//xref[@ref-type='bibr' and starts-with(@rid,'ref')]/parent::*");

    foreach(XmlNode x in nodes)
    {
        XmlNodeList temp = x.SelectNodes("//xref[@ref-type='bibr' and starts-with(@rid,'ref')]");
        //we only select those that have 3 or more references.
        if (temp.Count >= 3)
        {
            Console.WriteLine(x.InnerText);
        }
    }

    Console.ReadKey();

}

编辑我稍微尝试了一下，下面的代码有一个更新的xpath，应该可以获取你想要的所有内容。

public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.Load("article.xml");

    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//*[count(xref[@ref-type='bibr' and starts-with(@rid,'ref')])>2]");

    foreach(XmlNode x in nodes){
        Console.WriteLine(x.InnerText);
    }

    Console.ReadKey();

}

- Kent Kostelac

我希望输出的结果像这样：

<xref ref-type="bibr" rid="ref2">[2]</xref>, <xref ref-type="bibr" rid="ref3">[3]</xref>, <xref ref-type="bibr" rid="ref4">[4]</xref>

和

<xref ref-type="bibr" rid="rid6">8</xref> <xref ref-type="bibr" rid="rid6">9</xref> <xref ref-type="bibr" rid="rid10">10</xref>

。你的代码只是提取文件中<p>节点的内容。 - Don_B

我的代码会给出具有3个或更多子元素的<p>标签。但是不用担心，我会修改代码。请稍等一下。 - Kent Kostelac

1

正则表达式并不适用于层级结构的语法。我会编写 C# 代码来读取 XML，并跟踪仅由", "或" "分隔的连续xref节点的数量。

  static void Main(string[] args)
  {
     using (var xmlStream = System.Reflection.Assembly.GetExecutingAssembly().GetManifestResourceStream("ConsoleApp1.XMLFile1.xml"))
     {
        int state = 0; // 0 = Look for xref; 1 = look for separator
        string[] simpleSeparators = { " ", ", " };
        string rid = "0";
        StringBuilder nodeText = new StringBuilder();
        string[] consecutiveNodes = new string[3];

        System.Xml.XmlReaderSettings settings = new System.Xml.XmlReaderSettings();
        settings.DtdProcessing = System.Xml.DtdProcessing.Ignore;
        using (var reader = System.Xml.XmlReader.Create(xmlStream, settings))
        {
           while (reader.Read())
           {
              if (reader.IsStartElement("xref"))
              {
                 nodeText.Append("<xref");
                 if (reader.HasAttributes)
                 {
                    while (reader.MoveToNextAttribute())
                       nodeText.AppendFormat(" {0}=\"{1}\"", reader.Name, reader.Value);
                 }
                 nodeText.Append(">");
                 string nextRid = reader.GetAttribute("rid");
                 switch (state)
                 {
                    case 0:
                       break;
                    case 2:
                    case 4:
                       if (Math.Abs(GetIndex(nextRid) - GetIndex(rid)) > 1)
                          state = 0;
                       break;
                 }
                 state++;
                 rid = nextRid;
              }
              else if (reader.NodeType == System.Xml.XmlNodeType.Text)
              {
                 if (state > 0)
                    nodeText.Append(reader.Value);
                 if ((state % 2 == 1) && simpleSeparators.Contains(reader.Value))
                       state++;
              }
              else if ((reader.NodeType == System.Xml.XmlNodeType.EndElement) && (state > 0))
              {
                 nodeText.AppendFormat("</{0}>", reader.Name);
                 consecutiveNodes[state / 2] = nodeText.ToString();
                 nodeText.Clear();
                 if (state > 3)
                 {
                    Console.WriteLine("{0}{1}{2}", consecutiveNodes[0], consecutiveNodes[1], consecutiveNodes[2]);
                    state = 0;
                 }
              }
              else if (reader.IsStartElement())
              {
                 nodeText.Clear();
                 state = 0;
              }
           }
        }
     }
  }

  static int GetIndex(string rid)
  {
     int start = rid.Length;
     while ((start > 0) && Char.IsDigit(rid, --start)) ;

     start++;
     if (start < rid.Length)
        return int.Parse(rid.Substring(start));
     return 0;
  }

这段代码在您的样本数据上运行输出：

<xref ref-type="bibr" rid="ref2">[2]</xref>, <xref ref-type="bibr" rid="ref3">[3]</xref>, <xref ref-type="bibr" rid="ref4">[4]</xref>
<xref ref-type="bibr" rid="rid6">6</xref><xref ref-type="bibr" rid="rid6">9</xref><xref ref-type="bibr" rid="rid6">10</xref>

我把代码更新了，排除了以下内容：

<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref13">[13]</xref>, <xref ref-type="bibr" rid="ref8">[8]</xref>

因为ref11、ref13和ref8不是按照您的问题要求的连续id。

- BlueMonkMN

我希望输出的结果像这样

<xref ref-type="bibr" rid="ref2">[2]</xref>, <xref ref-type="bibr" rid="ref3">[3]</xref>, <xref ref-type="bibr" rid="ref4">[4]</xref>

,

<xref ref-type="bibr" rid="rid6">8</xref> <xref ref-type="bibr" rid="rid6">9</xref> <xref ref-type="bibr" rid="rid10">10</xref>

。 - Don_B

LINQ2XML会不会是这个问题的更好选择呢？ - Bumba

@Don_B 我认为更新后的代码已经实现了你的要求。 - BlueMonkMN

@Bumba 我更喜欢优化性能，我认为这比将整个XML文档加载到内存中并执行额外的XPath查询和结构化文档的解决方案表现更好。 - BlueMonkMN

@BlueMonkMN 你把xml文件放在哪里？我已经把它放在了D盘的一个位置，例如D:\Test\test.xml。 - Don_B

显示剩余3条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kent Kostelac · Accepted Answer

为了满足您的要求，我现在向您呈现我的解决方案。我还没有彻底测试重复的可能性，即某些引用可能只是前一个结果的子集。但排序应该不成问题。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Text.RegularExpressions;


public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.Load("article.xml");

    //only selects <p>'s that already have 3 or more refs. No need to check paragraphs that don't even have enough refs
    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//*[count(xref[@ref-type='bibr' and starts-with(@rid,'ref')])>2]");

    List<string> results = new List<string>();

    //Foreach <p>
    foreach (XmlNode x in nodes)
    {
        XmlNodeList xrefs = x.SelectNodes(".//xref[@ref-type='bibr' and starts-with(@rid,'ref')]");
        List<StartEnd> startEndOfEachTag = new List<StartEnd>(); // we mark the start and end of each ref.
        string temp = x.OuterXml; //the paragraph we're checking

        //finds start and end of each tag xref tag
        foreach (XmlNode xN in xrefs){ //We find the start and end of each paragraph
            StartEnd se = new StartEnd(temp.IndexOf(xN.OuterXml), temp.IndexOf(xN.OuterXml) + xN.OuterXml.Length);
            startEndOfEachTag.Add(se);  
        }

        /* This comment shows the regex command used and how we build the regular expression we are checking with.
        string regexTester = Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref>")+"([ ]|(, ))" + Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref3\">3</xref>");
        Match matchTemp = Regex.Match("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref> <xref ref-type=\"bibr\" rid=\"ref3\">3</xref>", regexTester);
        Console.WriteLine(matchTemp.Value);*/

        //we go through all the xrefs
        for (int i=0; i<xrefs.Count; i++)
        {
            int newIterator = i; //This iterator prevents us from creating duplicates.
            string regCompare = Regex.Escape(xrefs[i].OuterXml); // The start xref

            int count = 1; //we got one xref to start with we need at least 3
            string tempRes = ""; //the string we store the result in

            int consecutive = Int32.Parse(xrefs[i].Attributes["rid"].Value.Substring(3));

            for (int j=i+1; j<xrefs.Count; j++) //we check with the other xrefs to see if they follow immediately after.
            {
                if(consecutive == Int32.Parse(xrefs[j].Attributes["rid"].Value.Substring(3)) - 1)
                {
                    consecutive++;
                }
                else { break; }

                regCompare += "([ ]|(, ))" + Regex.Escape(xrefs[j].OuterXml); //we check that the the xref comes exactly after a space or a comma and space
                regCompare += "([ ]|(, ))" + Regex.Escape(xrefs[j].OuterXml); //we check that the the xref comes exactly after a space or a comma and space

                Match matchReg;

                try
                {
                    matchReg = Regex.Match(temp.Substring(startEndOfEachTag[i].start, startEndOfEachTag[j].end - startEndOfEachTag[i].start),
                        regCompare); //we get the result
                }
                catch
                {
                    i = j; // we failed and i should start from here now.
                    break;
                }

                if (matchReg.Success){
                    count++; //it was a success so we increment the number of xrefs we matched
                    tempRes = matchReg.Value; // we add it to out temporary result.
                    newIterator = j; //update where i should start from next time.
                }
                else {
                    i = j; // we failed and i should start from here now.
                    break;
                }
            }
            i = newIterator;
            if (count > 2)
            {
                results.Add(tempRes); 
            }
        }
    }
    Console.WriteLine("Results: ");
    foreach(string s in results)
    {
            Console.WriteLine(s+"\n");
    }

    Console.ReadKey();
}

缺失的类

class StartEnd
{
    public int start=-1;
    public int end = -1;

    public StartEnd(int start, int end)
    {
        this.start = start;
        this.end = end;
    }
}