获取特定元素引用周围的文本

9
拥有这样的HTML片段:

像这样:

<p>Lorem ipsum <mark>dolor</mark> sit amet. <mark>Lorem</mark> ipsum again
   and <mark>dolor</mark></p>

我可以使用$("mark")选择<mark>元素。我希望获得一个字符串列表,这些字符串代表被标记的单词以及左侧和右侧各5个字符,并在字符串前缀和后缀处添加[...]
对于此示例,它应该是:
[
   "[...] psum dolor sit [...]",
   "[...] met. Lorem ipsu [...]",
   "[...] and dolor [...]",
]

目前我的情况是这样的:

var $highlightMarks = $("mark");
var results = [];

for (var i = 0; i < $highlightMarks.length; ++i) {
  var $c = $highlightMarks.eq(i);
  var text = $c.parent().text().trim().replace(/\n/g, " ");
  var indexStart = new RegExp($c.html(), "gim").exec(text).index;
  text = "[...] " + text.substring(indexStart - 5, $c.html().length + indexStart + 5) + " [...]";
  results.push(text);
}

alert(JSON.stringify(results))
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<p>Lorem ipsum <mark>dolor</mark> sit amet. <mark>Lorem</mark> ipsum again and <mark>dolor</mark>.</p>

但是当同一段落中存在两个相同的单词时(在本例中为 dolor),这种方法将失败。

不应该在数组末尾显示 psum dolor sit,而应该是 and dolor。

因此,如果有一个对 <mark> 元素的引用,正确的方法是什么,使得一些文本在右侧,另一些文本在左侧呢?


输出应该是什么? - Praveen Kumar Purushothaman
4个回答

3
这是一种只使用正则表达式实现的双步防护(欢迎提供反例),与标签容器无关,可以提取标记周围的文本,就像<p>...</p>一样。

var filter = /<(?![/]?mark)[^><]*>/gi;

var regex  = /((?:(?!<[/]mark\s*>).){0,5})<mark\s*>([^<]*)<[/]mark\s*>(?=((?:(?!<mark\s*>).){0,5}))/ig;
var subst  = "$1 $2 $3";

var tests  = ['<p>Lorem ipsum mark> <MARK  >dolor</MARK > < mark sitamet. <mark>Lorem</mark> ipsum again and <mark>dolor</mark>.</p>','<P style="margin: 0 15px 15px 0;">um <mark>dolor</mark> sit amet. <mark>Lorem</mark> ipsum again and <mark>dolor</mark>.</P>','<p>um <mark>dolor</mark> <span>sit</ span> <test amet. <mark>Lorem</mark> <b>i</b>psum again and <mark>dolor</mark>.</p>','<p style="margin: 0 15px 15px 0;" another_tag="123">Lorem ipsum <MARK  >dolor</MARK > sit <mark>amet.</mark><mark>Lorem</mark> ipsum again and <mark>dolor</mark>.</P>'];


while(t = tests.pop()) {

    document.write('<b>INPUT</b> <xmp>' + t + '</xmp>');

    var t = t.replace(filter,'');
    document.write('<b>Filtered:</b> <xmp>' + t + '</xmp>');

    while ((r = regex.exec(t)) != null) {

        pre = r[1]; marked = r[2]; post = r[3];
        document.write('<b>Match:</b> "' + pre + ' <mark>' + marked + '</mark> ' + post + '"<hr/>');
    }
}

它是如何工作的



  1. Filter out every tag that is not a <mark> or a </mark> tag (case insensitive and space relaxed according to what is accepted by chrome and firefox: the regex does accept also the variations <mark > or </mark > as valid tags but not < mark> or </ mark>:

    /<(?![/]?mark)[^><]*>/gi
    

    Regex 101 Demo

    Regular expression visualization

    NOTE: this filter handles the single chars '<' and '>' correctly (with or without text after/before them).

    This behaves differently from a browser regard the opening tag char <: anything after <someText till the next valid tag will be removed (breaking valid html tags). I prefer do not do this way and treat an opening not closed '<' as a simple char.

    e.g.: Some text <notAtag other text <mark>marked</mark>. chrome or firefox will output Some text marked (with marked actually not marked cause the <mark> tag is been filtered out together with <notAtag other text).


  1. Select the marked text and its context (till 5 characters)

    /((?:(?!<[/]mark\s*>).){0,5}) #* 0 to 5 chars that not belongs to '<mark\s*>' 
                                  #  the round brackets save them in group $1
    <mark\s*>                     #* literal string '<mark' followed by 
                                  #  0 or more whitespace chars then literal '>'
    ([^<]*)                       #* 0 or more chars that is not '<'
                                  #  the round brackets save them in group $2
    <[/]mark\s*>                  #* literal string '</mark' followed by 
                                  #  0 or more whitespace chars then literal '>'
    (?=((?:(?!<mark\s*>).){0,5})) #* 0 to 5 chars that not belongs to '</mark\s*>'
                                  #  lookahead (?=...) used to not consume them
                                  #  round brackets save them in $3
    
    /ig                           #* i: Case-insensitive, g: global search
    

    Regex 101 Demo

    Regular expression visualization

    NOTE: The regex is smart enough to select 5 chars both from the previous and the next <mark> if is the case (e.g. </mark>12345<mark>, 12345 will be both post context of the closing tag and the pre context of the opening tag).

    In addiction the context selection avoid to select over <mark> tags so :

    • where there is two adjacent ...</mark><mark>... tags nothing is selected as post/pre context;
    • </mark>123<mark>: only 123 is selected as post/pre context.

1
你可以使用简单的正则表达式来实现。 [\w\s.]{5} = 匹配标记前后的5个字符 [^<]+ = 匹配标记之间的任何内容

var myText = $('p').html();
var reg = new RegExp("([\\w\\s.]{5})<mark>([^<]+)</mark>([\\w\\s]{5})?", "g");
var match = null, matches = [];

while ((match = reg.exec(myText)) !== null) {
    var match3 = (typeof match[3] == 'undefined') ? '' : match[3];
 matches.push( '[...] ' + match[1] + ' ' + match[2] + ' ' + match3 + '[...]');
}

alert(matches.toString());
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<p>Lorem ipsum <mark>dolor</mark> sit amet. <mark>Lorem</mark> ipsum again
   and <mark>dolor</mark></p>

匹配数组有4个元素。

第一个(match[0])包含所有匹配项。

第二个(match[1])包含第一组括号内的所有匹配项。

第三个(match[2])包含第二组括号内的所有匹配项,即在标记标签和

第四个(match[3])匹配标记标签后的5个字符。


为什么要踩?它能给出所需的输出...得给一些理由 :/ - DrGeneral
1
除非downvoter解释,否则我认为downvote无效,因此给你+1。请注意,如果模式不是动态的,则使用RegExp构造函数不是一个好主意。此外,在构造符号内部不需要转义“/”。 - Wiktor Stribiżew

1
可以使用jQuery的contents()函数来完成。我们可以通过指定索引来选择元素内的文本片段。请查看下面的代码,我已经开发了一种逻辑并实现了它。

$(document).ready(function(){
  
  var marks=$('mark')//get all the mark elements
  var j=0;
  for(var i=0;i<marks.length;i++){
  
    var markText=marks[i].textContent  //get text from each mark element
    
    var content1=$("p").contents().eq(j).text()
    //alert("content1"+content1)
    content1=content1.substr(content1.length - 5)
    
    j=j+2
    var content2=$("p").contents().eq(j).text()
    //alert("content2"+content2)
    content2=content2.substr(0,5)
    
    var final="[...] "+content1+markText+content2+" [...] "
    
    //alert(final)//you can push this final result into array or something u want
  $('body').append("<br>"+final)
  }

})
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

<p>Lorem ipsum <mark>dolor</mark> sit amet. <mark>Lorem</mark> ipsum again
   and <mark>dolor</mark></p>


0

我已经做到这一步了。你可以这样做:

var results = $("p").clone().find("mark").html("1").after(" ").end().html().trim();
results = results.split(" <mark>1</mark> ");
results = results.map(Function.prototype.call, String.prototype.trim);
final = [];
for (var i = 0; i < results.length; i ++) {
  if (i != results.length - 1)
    final.push(results[i].split(" ")[results[i].split(" ").length - 1]);
  if (i != 0)
    final.push(results[i].split(" ")[0]);
}
$("pre").text(JSON.stringify(final));
// alert(JSON.stringify(final))
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<p>Lorem ipsum <mark>dolor</mark> sit amet. <mark>Lorem</mark> ipsum again
   and <mark>dolor</mark></p>
<pre></pre>


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接