如何使用Python解析带有行合并的HTML表格?

24

问题

我正在尝试解析一个包含行合并的HTML表格,就像我正在尝试解析我的大学课程表。

问题在于,如果最后一行包含一个行合并,那么下一行缺少一个TD,而这个缺失的TD现在是行合并的位置。

我不知道如何解决这个问题,希望能够成功解析这个课程表。

我尝试过的方法

几乎我能想到的所有方法。

我得到的结果

[
    {
        'blok_eind': 4,
        'blok_start': 3,
        'dag': 4, # Should be 5
        'leraar': 'DOODF000',
        'lokaal': 'ALK C212',
        'vak': 'PROJ-T',
    },
]

如您所见,上面的输出片段中有一个值为PROJ-Tvak键,在这里可以看到dag4,而实际上应该是5(即星期五/Vrijdag),如下表所示:

Table

我想要的结果

一个Python dict(),它看起来像上面发布的那个,但带有正确的值

其中:

  • day/dag 是表示星期一至星期五的整数,范围为 1~5
  • block_start/blok_start 是表示课程开始时间的整数(时间块,表格左侧)
  • block_end/blok_eind 是表示课程结束时的整数
  • classroom/lokaal 是课程所在教室的代码
  • teacher/leraar 是教师的 ID
  • course/vak 是课程的 ID

上述数据的基本 HTML 结构

<center>
    <table>
        <tr>
            <td>
                <table>
                    <tbody>
                        <tr>
                            <td>
                                <font>
                                    TEACHER-ID
                                </font>
                            </td>
                            <td>
                                <font>
                                    <b>
                                        CLASSROOM ID
                                    </b>
                                </font>
                            </td>
                        </tr>
                        <tr>
                            <td>
                                <font>
                                    COURSE ID
                                </font>
                            </td>
                        </tr>
                    </tbody>
                </table>
            </td>
        </tr>
    </table>
</center>

代码

HTML

<CENTER><font size="3" face="Arial" color="#000000">
<BR></font>
  <font size="6" face="Arial" color="#0000FF">
16AO4EIO1B
&nbsp;</font> <font size="4" face="Arial">
IO1B
</font>
  <BR>
  <TABLE border="3" rules="all" cellpadding="1" cellspacing="1">
    <TR>
      <TD align="center">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial" color="#000000">
Maandag 29-08
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
Dinsdag 30-08
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
Woensdag 31-08
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
Donderdag 01-09
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
Vrijdag 02-09
</font> </TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>1</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
8:30
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>2</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
10:10
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>3</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
10:25
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
DOODF000
</font> </TD>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK C212</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
PROJ-T
</font> </TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>4</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
MENT
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>5</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>6</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
JONGJ003
</font> </TD>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
BURG
</font> </TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>7</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
14:35
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
FLUIP000
</font> </TD>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B004</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
ICT algemeen  Prakti
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>8</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
14:50
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=4 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
KOOLE000
</font> </TD>
            <TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
          </TR>
          <TR>
            <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
NED
</font> </TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>9</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
    <TR>
      <TD rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>10</B>
</font> </TD>
            <TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
          </TR>
          <TR>
            <TD align="center" nowrap=1><font size="2" face="Arial">
17:20
</font> </TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
      <TD colspan=12 rowspan=2 align="center" nowrap="1">
        <TABLE>
          <TR>
            <TD></TD>
          </TR>
        </TABLE>
      </TD>
    </TR>
    <TR>
    </TR>
  </TABLE>
  <TABLE cellspacing="1" cellpadding="1">
    <TR>
      <TD valign=bottom> <font size="4" face="Arial" color="#0000FF"></TR></TABLE><font size="3" face="Arial">
Periode1   29-08-2016 (35) - 04-09-2016 (35)   G r u b e r  &amp;  P e t t e r s   S o f t w a r e
</font></CENTER>

Python

from pprint import pprint
from bs4 import BeautifulSoup
import requests

r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
                 "/c/c00025.htm")
daytable = {
    1: "Maandag",
    2: "Dinsdag",
    3: "Woensdag",
    4: "Donderdag",
    5: "Vrijdag"
}
timetable = {
    1: ("8:30", "9:20"),
    2: ("9:20", "10:10"),
    3: ("10:25", "11:15"),
    4: ("11:15", "12:05"),
    5: ("12:05", "12:55"),
    6: ("12:55", "13:45"),
    7: ("13:45", "14:35"),
    8: ("14:50", "15:40"),
    9: ("15:40", "16:30"),
    10: ("16:30", "17:20"),
}

page = BeautifulSoup(r.content, "lxml")

roster = []
big_rows = 2
last_row_big = False
# There are 10 blocks, each made up out of 2 TR's, run through them
for block_count in range(2, 22, 2):
    # There are 5 days, first column is not data we want
    for day in range(2, 7):
        dayroster = {
            "dag": 0,
            "blok_start": 0,
            "blok_eind": 0,
            "lokaal": "",
            "leraar": "",
            "vak": ""
        }
        # This selector provides the classroom
        table_bold = page.select(
            "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
                day) + ") > table > tr > td > font > b")

        # This selector provides the teacher's code and the course ID
        table = page.select(
            "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
                day) + ") > table > tr > td > font")

        # This gets the rowspan on the current row and column
        rowspan = page.select(
            "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
                day) + ")")

        try:
            if table or table_bold and rowspan[0].attrs.get("rowspan") == "4":
                last_row_big = True
                # Setting end of class
                dayroster["blok_eind"] = (block_count // 2) + 1
            else:
                last_row_big = False
                # Setting end of class
                dayroster["blok_eind"] = (block_count // 2)
        except IndexError:
            pass

        if table_bold:
            x = table_bold[0]
            # Classroom ID
            dayroster["lokaal"] = x.contents[0]

        if table:
            iter = 0
            for x in table:
                content = x.contents[0].lstrip("\r\n").rstrip("\r\n")
                # Cell has data
                if content != "":
                    # Set start of class
                    dayroster["blok_start"] = block_count // 2
                    # Set day of class
                    dayroster["dag"] = day - 1
                    if iter == 0:
                        # Teacher ID
                        dayroster["leraar"] = content
                    elif iter == 1:
                        # Course ID
                        dayroster["vak"] = content
                    iter += 1

        if table or table_bold:
            # Store the data
            roster.append(dayroster)

# Remove duplicates
seen = set()
new_l = []
for d in roster:
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)
pprint(new_l)

请在问题本身中包含以下内容:1)您的Python代码,2)重现问题所需的最少量HTML,3)您期望的输出,以及4)实际得到的输出,而不是在外部网站上。 - Ry-
我遇到了一个问题,即如果最后一行包含rowspan,则下一行缺少一个TD,其中rowspan现在缺少的就是该TD。你是说HTML中确实没有<td>,还是你的代码认为没有但实际上存在? - John Gordon
请求返回 404 错误页面。 - Jules G.M.
网站 URL 显示 404。 - Jules G.M.
现在类已经改变,c00019.htm。 - iSeeDeadPixels
显示剩余3条评论
2个回答

15
您需要追踪先前行的每列 rowspan 值。您可以通过将 rowspan 的整数值复制到字典中来完成此操作,并使随后的行递减 rowspan 值,直到它降至 1(或者我们可以存储整数值减 1 并降至 0 以便于编码)。然后,您可以根据前面行的 rowspan 调整随后表格的计数。由于您的表格使用默认大小为 2 的跨度,并且步长为 2,因此会使其变得有些复杂,但这可以通过除以 2 来轻松地转换为可管理的数字。不要使用大量的 CSS 选择器,只需选择表格行,我们将对其进行迭代:
roster = []
rowspans = {}  # track rowspanning cells
# every second row in the table
rows = page.select('html > body > center > table > tr')[1:21:2]
for block, row in enumerate(rows, 1):
    # take direct child td cells, but skip the first cell:
    daycells = row.select('> td')[1:]
    rowspan_offset = 0
    for daynum, daycell in enumerate(daycells, 1):
        # rowspan handling; if there is a rowspan here, adjust to find correct position
        daynum += rowspan_offset
        while rowspans.get(daynum, 0):
            rowspan_offset += 1
            rowspans[daynum] -= 1
            daynum += 1
        # now we have a correct day number for this cell, adjusted for
        # rowspanning cells.
        # update the rowspan accounting for this cell
        rowspan = (int(daycell.get('rowspan', 2)) // 2) - 1
        if rowspan:
            rowspans[daynum] = rowspan

        texts = daycell.select("table > tr > td > font")
        if texts:
            # class info found
            teacher, classroom, course = (c.get_text(strip=True) for c in texts)
            roster.append({
                'blok_start': block,
                'blok_eind': block + rowspan,
                'dag': daynum,
                'leraar': teacher,
                'lokaal': classroom,
                'vak': course
            })

    # days that were skipped at the end due to a rowspan
    while daynum < 5:
        daynum += 1
        if rowspans.get(daynum, 0):
            rowspans[daynum] -= 1

这将产生正确的输出:

[{'blok_eind': 2,
  'blok_start': 1,
  'dag': 5,
  'leraar': u'BLEEJ002',
  'lokaal': u'ALK B021',
  'vak': u'WEBD'},
 {'blok_eind': 3,
  'blok_start': 2,
  'dag': 3,
  'leraar': u'BLEEJ002',
  'lokaal': u'ALK B021B',
  'vak': u'WEBD'},
 {'blok_eind': 4,
  'blok_start': 3,
  'dag': 5,
  'leraar': u'DOODF000',
  'lokaal': u'ALK C212',
  'vak': u'PROJ-T'},
 {'blok_eind': 5,
  'blok_start': 4,
  'dag': 3,
  'leraar': u'BLEEJ002',
  'lokaal': u'ALK B021B',
  'vak': u'MENT'},
 {'blok_eind': 7,
  'blok_start': 6,
  'dag': 5,
  'leraar': u'JONGJ003',
  'lokaal': u'ALK B008',
  'vak': u'BURG'},
 {'blok_eind': 8,
  'blok_start': 7,
  'dag': 3,
  'leraar': u'FLUIP000',
  'lokaal': u'ALK B004',
  'vak': u'ICT algemeen  Prakti'},
 {'blok_eind': 9,
  'blok_start': 8,
  'dag': 5,
  'leraar': u'KOOLE000',
  'lokaal': u'ALK B008',
  'vak': u'NED'}]

此外,即使课程跨越超过2个或只有一个块,这段代码仍将继续运行;支持任何rowspan大小。


2
也许最好使用bs4内置函数,例如“findAll”来解析您的表格。
您可以使用以下代码:
from pprint import pprint
from bs4 import BeautifulSoup
import requests

r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
                 "/c/c00025.htm")

content=r.content
page = BeautifulSoup(content, "html")
table=page.find('table')
trs=table.findAll("tr", {},recursive=False)
tr_count=0
trs.pop(0)
final_table={}

for tr in trs:
    tds=tr.findAll("td", {},recursive=False)
    if tds:
        td_count=0
        tds.pop(0)
        for td in tds:
            if td.has_attr('rowspan'):                              
                final_table[str(tr_count)+"-"+str(td_count)]=td.text.strip()
                if int(td.attrs['rowspan'])==4:
                    final_table[str(tr_count+1)+"-"+str(td_count)]=td.text.strip()
                if final_table.has_key(str(tr_count)+"-"+str(td_count+1)):
                    td_count=td_count+1         
            td_count=td_count+1
        tr_count=tr_count+1

roster=[]
for i in range(0,10): #iterate over time
    for j in range(0,5): #iterate over day
        item=final_table[str(i)+"-"+str(j)]
        if len(item)!=0:    
            block_eind=i+1          

            try:
                if final_table[str(i+1)+"-"+str(j)]==final_table[str(i)+"-"+str(j)]:
                        block_eind=i+2
            except:
                pass

            try:
                lokaal=item.split('\r\n \n\n')[0]
                leraar=item.split('\r\n \n\n')[1].split('\n \n\r\n')[0]
                vak=item.split('\n \n\r\n')[1]
            except:
                lokaal=leraar=vak="---"

            dayroster = {
                "dag": j+1,
                "blok_start": i+1,
                "blok_eind": block_eind,
                "lokaal": lokaal,
                "leraar": leraar,
                "vak": vak
            }


            dayroster_double = {
                "dag": j+1,
                "blok_start": i,
                "blok_eind": block_eind,
                "lokaal": lokaal,
                "leraar": leraar,
                "vak": vak
            }

            #use to prevent double dict for same event
            if dayroster_double not in roster:
                roster.append(dayroster)

print (roster)

3
您可以使用 find_all 方法;findAll 只是为了支持 BeautifulSoup 3 的代码,现已废弃,请改用符合 PEP8 命名规范的方法名。 - Martijn Pieters
你说得对,谢谢。我已经在我的代码中进行了更改。很明显,你的版本比我的好多了。祝好! - A. STEFANI
下一个提示:在if语句中,==True永远不需要;让if本身测试表达式是否产生了true结果。if td.has_attr('rowspan'):if td.has_attr('rowspan')==True: 的效果一样好,但更清晰易读。 - Martijn Pieters
Martijn,我知道这个,我同意你的观点。我已根据此修改了代码。 - A. STEFANI

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接