使用正则表达式在JS中将大型字符串文字解析为对象数组

4

我是新手程序员,但我正在努力学习JavaScript。目前,我正在开展一个项目,尝试将一个大文本文件(莎士比亚的154首十四行诗在这里)解析到对象数组中,数据结构如下:

var obj = {
property 1: [ 'value 1',
     'value 2',
    ], 
property 2: [ 'value 1',
     'value 2',
    ], 

在这里,罗马数字代表对象属性,每行十四行诗分别代表每个属性数组中的一个值。

我必须使用正则表达式来解析文本文件。到目前为止,我一直在寻找正确的正则表达式来划分文本,但我不知道是否正确。最终我想创建一个下拉菜单,其中列表中的每个值都是一个十四行诗。

编辑:实际上我现在从这个网址获取源文本:http://pizzaboys.biz/xxx/sonnets.php

然后做与上面相同的事情,但不是用 $get,而是将文本放入变量中...

我尝试过这样:

$(document).ready(function(){
    var data = new SonnetizerArray();
});

function SonnetizerArray(){
    this.data = [];
    var rawText = "text from above link"
    var rx = /^\\n[CDILVX]/$\\n/g;

    var array_of_sonnets = rawText.exec(rx);
    for (var i = 0; i < array_of_sonnets.length; i ++){
        var s = $.split(array_of_sonnets[i]);
        if (s.length > 0) this.data.push(s);
    }
}

3
好的,请展示一下你目前发现或尝试过的内容。你可能已经接近成功,也可能不是;我们不知道你的尝试。 - Ian
刚刚将其添加到原始内容中。 - Jake
1个回答

5

描述

这个正则表达式将文本解析为罗马数字和正文。然后可以在换行符\n上对正文进行拆分。

^\s+\b([CDMLXVI]{1,12})\b(?:\r|\n|$).*?(?:^.*?)(^.*?)(?=^\s+\b([MLXVI]{1,12})\b(?:\r|\n|$)|\Z)

enter image description here

捕获组

组0获取整个匹配部分

  1. 获取罗马数字
  2. 获取章节内容,不包括罗马数字

Javascript代码示例:

您的链接中提取的示例文本

  VII

  Lo! in the orient when the gracious light
  Lifts up his burning head, each under eye
  Doth homage to his new-appearing sight,


  VIII

  Music to hear, why hear'st thou music sadly?
  Sweets with sweets war not, joy delights in joy:
  Why lov'st thou that which thou receiv'st not gladly,
  Or else receiv'st with pleasure thine annoy?


  IX

  Is it for fear to wet a widow's eye,
  That thou consum'st thy self in single life?
  Ah! if thou issueless shalt hap to die,
  The world will wail thee like a makeless wife;

示例代码

<script type="text/javascript">
  var re = /^\s+\b([MLXVI]{1,12})\b(?:\r|\n|$).*?(?:^.*?)(^.*?)(?=^\s+\b([MLXVI]{1,12})\b(?:\r|\n|$)|\Z)/;
  var sourcestring = "source string to match with pattern";
  var results = [];
  var i = 0;
  for (var matches = re.exec(sourcestring); matches != null; matches = re.exec(sourcestring)) {
    results[i] = matches;
    for (var j=0; j<matches.length; j++) {
      alert("results["+i+"]["+j+"] = " + results[i][j]);
    }
    i++;
  }
</script>

样例输出

$matches Array:
(
    [0] => Array
        (
            [0] =>   VII

  Lo! in the orient when the gracious light
  Lifts up his burning head, each under eye
  Doth homage to his new-appearing sight,

            [1] => 

  VIII

  Music to hear, why hear'st thou music sadly?
  Sweets with sweets war not, joy delights in joy:
  Why lov'st thou that which thou receiv'st not gladly,
  Or else receiv'st with pleasure thine annoy?

            [2] => 

  IX

  Is it for fear to wet a widow's eye,
  That thou consum'st thy self in single life?
  Ah! if thou issueless shalt hap to die,
  The world will wail thee like a makeless wife;
        )

    [1] => Array
        (
            [0] => VII
            [1] => VIII
            [2] => IX
        )

    [2] => Array
        (
            [0] => 
  Lo! in the orient when the gracious light
  Lifts up his burning head, each under eye
  Doth homage to his new-appearing sight,

            [1] => 
  Music to hear, why hear'st thou music sadly?
  Sweets with sweets war not, joy delights in joy:
  Why lov'st thou that which thou receiv'st not gladly,
  Or else receiv'st with pleasure thine annoy?

            [2] => 
  Is it for fear to wet a widow's eye,
  That thou consum'st thy self in single life?
  Ah! if thou issueless shalt hap to die,
  The world will wail thee like a makeless wife;
        )

    [3] => Array
        (
            [0] => VIII
            [1] => IX
            [2] => 
        )

)

罗马数字验证

上述表达式仅测试罗马数字字符串是否由罗马数字字符组成,它并不能验证该数字是否有效。如果您需要验证罗马数字的格式是否正确,那么可以使用以下表达式。

^\s+\b(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))\b(?:\r|\n|$).*?(?:^.*?)(^.*?)(?=^\s+\b([MLXVI]{1,12})\b(?:\r|\n|$)|\Z)

enter image description here


我猜测你试图将示例文本直接粘贴到程序中作为多行字符串。如果是这样,那么你需要在字符串中包含一个新行字符,并以 \ 结束每一行,这告诉 JavaScript 继续到下一行。 - Ro Yo Mi
为了展示这个表达式确实有效,请查看一个工作演示 http://regex101.com/r/fK4rR3,或者这个罗马数字验证的演示 http://regex101.com/r/cM6uR9 - Ro Yo Mi
我仍然困惑于如何从正则表达式中实际创建上述数据结构,即一个具有154个属性和154个包含14个值的数组的对象。 - Jake
我首先会将罗马数字转换为十进制,因为这样更容易排序,可以使用类似于http://blog.stevenlevithan.com/archives/javascript-roman-numeral-converter的工具。然后创建多维数组就像http://bytes.com/topic/javascript/answers/162679-how-define-array-array中所示那样简单。 - Ro Yo Mi

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接