扫描网站内容（快速）

Question

扫描网站内容（快速）

4

我有一个包含成千上万个网站的数据库，我想要搜索所有网站中的特定字符串。你认为最快的方法是什么？我想首先获取每个网站的内容，这是我考虑的做法：

import urllib2, re
string = "search string"
source = urllib2.urlopen("http://website1.com").read()
if re.search(word,source):
    print "My search string: "+string

我该在Python中搜索字符串，但速度太慢。有什么方法可以加速吗？

- Michael

我怀疑程序运行缓慢是因为所有操作都是串行的（也就是说，程序在请求下一页之前会等待每个页面下载和搜索完成）。我建议使用线程或进程池并行执行多个请求。 - Blckknght

2个回答

2

尝试使用多进程同时运行多个搜索。多线程也可以，但是如果不正确管理共享内存，它可能会成为一个问题。看看这个讨论，帮助你选择哪种方法适合你。

- JonathanV

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- RocketDonkey · Accepted Answer

我认为你的问题并不是程序本身，而是你正在执行成千上万个网站的HTTP请求。你可以研究涉及某种并行处理的不同解决方案，但无论你使解析代码多么高效，都会在当前实现中的请求处遇到瓶颈。

这里是一个基本示例，使用了“队列”和“线程”模块。我建议阅读关于多进程与多线程的优劣之处的文章（例如@JonathanV提到的帖子），但这将有助于理解正在发生的事情：

import Queue
import threading
import time
import urllib2

my_sites = [
    'http://news.ycombinator.com',
    'http://news.google.com',
    'http://news.yahoo.com',
    'http://www.cnn.com'
    ]

# Create a queue for our processing
queue = Queue.Queue()


class MyThread(threading.Thread):
  """Create a thread to make the url call."""

  def __init__(self, queue):
    super(MyThread, self).__init__()
    self.queue = queue

  def run(self):
    while True:
      # Grab a url from our queue and make the call.
      my_site = self.queue.get()
      url = urllib2.urlopen(my_site)

      # Grab a little data to make sure it is working
      print url.read(1024)

      # Send the signal to indicate the task has completed
      self.queue.task_done()


def main():

  # This will create a 'pool' of threads to use in our calls
  for _ in range(4):
    t = MyThread(queue)

    # A daemon thread runs but does not block our main function from exiting
    t.setDaemon(True)

    # Start the thread
    t.start()

  # Now go through our site list and add each url to the queue
  for site in my_sites:
    queue.put(site)

  # join() ensures that we wait until our queue is empty before exiting
  queue.join()

if __name__ == '__main__':
  start = time.time()
  main()
  print 'Total Time: {0}'.format(time.time() - start)

如果想了解关于线程的好资源，可以查看Doug Hellmann在这里发布的文章（链接），IBM的一篇文章（链接）（这已经成为我通用的线程设置，正如上面所示），以及实际文档（链接）。