使用Python下载*.mp4文件

6
我正在尝试从一个网站下载并保存讲座视频。虽然我已经成功地下载了这些文件,但它们无法在我的媒体播放器中播放。以下是我使用的代码:
from bs4 import BeautifulSoup
import re
import urllib2

snippet = open('Python/SNA Page Source Revised.txt', 'r')
soup = BeautifulSoup(snippet)

links = [link.get('href') for link in soup.find_all('a')]

videos = []

for link in links:
  match = re.search('.*mp4.*', link)
  if match:
    videos.append(link)

vidNum = 1

for video in videos:
  f = urllib2.urlopen(video)
  with open('Data Analysis/Social Network Analysis/Video '+vidNum+'.mp4', 'wb') as code:
    code.write(f.read())
  vidNum += 1

一切似乎都正常,但是当我尝试播放其中的视频时,出现以下错误: “Python(v2.7)需要安装插件才能播放以下类型的媒体文件:文本/ HTML解码器”。此外,如果我手动从网站下载视频,该文件大小约为22.8MB,但使用我的脚本时,该文件仅为7.8kB。
我在下载文件的方式上有误吗?非常感谢任何帮助。
另外:我在使用Python v2.7的Ubuntu 12.04 LTS操作系统上操作。
编辑:根据我收到的响应,这是我正在使用的代码:
import requests

r = requests.get('https://class.coursera.org/sna-003/lecture/download.mp4?lecture_id=2', auth=('myUsername', 'myPassword'))

with open('Data Analysis/TestFile.mp4', 'wb') as fd:
  fd.write(r.content)

以下是r.content的输出结果:
<!DOCTYPE html>
<html itemtype="http://schema.org" xmlns:fb="http://ogp.me/ns/fb#"><head><meta content="IE=Edge,chrome=IE7" http-equiv="X-UA-Compatible"/><meta content="!" name="fragment"/><meta content="NOODP" name="robots"/><meta charset="utf-8"/><meta content="Coursera" property="og:title"/><meta content="website" property="og:type"/><meta content="http://s3.amazonaws.com/coursera/media/Coursera_Computer_Narrow.png" property="og:image"/><meta content="https://www.coursera.org/" property="og:url"/><meta content="Coursera" property="og:site_name"/><meta content="en_US" property="og:locale"/><meta content="Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for anyone to take, for free. We believe in connecting people to a great education so that anyone around the world can learn without limits." property="og:description"/><meta content="727836538,4807654" property="fb:admins"/><meta content="274998519252278" property="fb:app_id"/><meta content="Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for anyone to take, for free. We believe in connecting people to a great education so that anyone around the world can learn without limits." name="description"/><meta content="http://s3.amazonaws.com/coursera/media/Coursera_Computer_Narrow.png" name="image"/><meta content="app-id=736535961" name="apple-itunes-app"/><script>window.onerror = function(message, url, lineNum) {

  // First check the URL and line number of the error
  url = url || window.location.href;
  // 99% of the time, errors without line numbers arent due to our code,
  // they are due to third party plugins and browser extensions
  if (lineNum === undefined || lineNum == null) return;

  // Now figure out the actual error message
  // If it's an event, as triggered in several browsers
  if (message.target &amp;&amp; message.type) {
    message = message.type;
  }
  if (!message.indexOf) {
    message = 'Non-string, non-event error: ' + (typeof message);
  }

  var errorDescrip = {
    message: message,
    script: url,
    line: lineNum,
    url: document.URL
  }

  var err = {
    key: 'page.error.javascript', 
    value: errorDescrip
  }

  window._204 = window._204 || [];
  window._204.push(err);

  window._gaq = window._gaq || [];
  window._gaq.push(err);
}</script><title>Coursera.org</title><link href="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/css/home.css" rel="stylesheet" type="text/css"/><link href="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/pages/auth/css/auth.css" rel="stylesheet" type="text/css"/><script data-baseurl="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/" id="_mobile">(function(el) {
  // Override certian behaviour if the page is for our mobile app.
  // TODO(priya) Remove this conditional behaviour once I want to push this behaviour
  // for regular authentication pages on mobile/smaller screens as well.
  // Currently I'm keeping existing behaviour same and only adding mobile specific
  // layouts ot /mobilesignup page (which is what isMobileApp = true signifies).
  if ("false" == "true") {
    var head = document.getElementsByTagName('head')[0];
    // Add viewport meta tag
    var viewport = document.querySelector('meta[name=viewport]');
    var viewportContent = 'width=device-width, initial-scale=1.0, user-scalable=no';
    if (!viewport) {
        viewport = document.createElement('meta');
        viewport.setAttribute('name', 'viewport');
        head.appendChild(viewport);
    }
    viewport.setAttribute('content', viewportContent);

    // Add responsive css
    var link  = document.createElement('link');
    link.rel  = 'stylesheet';
    link.type = 'text/css';
    link.href = el.getAttribute("data-baseurl") + "pages/auth/css/auth_responsive.css";
    head.appendChild(link);
  }
})(document.getElementById("_mobile"));
</script></head><body><div id="fb-root"></div><div id="origami"><div style="position:absolute;top:0px;left:0px;width:100%;height:100%;background:#f5f5f5;padding-top:5%;"><div id="coursera-loading-nojs" style="text-align:center; margin-bottom:10px;display:none;">Please use a <a href="/browsers">modern browser </a> with JavaScript enabled to use Coursera.</div><div><span id="coursera-loading-js" style="display: none; padding-left:45%">loading   <img src="https://d2wvvaown1ul17.cloudfront.net/site-static/images/icons/loading.gif"/></span></div><noscript><div style="text-align:center; margin-bottom:10px;">Please use a <a href="/browsers">modern browser </a> with JavaScript enabled to use Coursera.</div></noscript></div></div><!--[if gte IE 8]&gt;&lt;script&gt;document.getElementById("coursera-loading-js").style.display = 'block';&lt;/script&gt;&lt;![endif]-->
<!--[if lte IE 7]&gt;&lt;script&gt;document.getElementById("coursera-loading-nojs").style.display = 'block';
window._204 = window._204 || [];
window._gaq = window._gaq || [];

window._gaq.push(
    ['_setAccount', 'UA-28377374-1'],
    ['_setDomainName', window.location.hostname],
    ['_setAllowLinker', true],
    ['_trackPageview', window.location.pathname]);

window._204.push(
  ['client', 'home'],
  {key:"pageview", value:window.location.pathname});
  &lt;/script&gt;&lt;script src="https://eventing.coursera.org/204.min.js"&gt;&lt;/script&gt;&lt;script src="https://ssl.google-analytics.com/ga.js"&gt;&lt;/script&gt;&lt;![endif]-->
<!--[if !IE]&gt; --><script>document.getElementById("coursera-loading-js").style.display = 'block';</script><!-- &lt;![endif]--><script src="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/js/core/require.js" type="text/javascript"></script><script data-baseurl="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/" data-debug="0" data-locale="" data-timestamp="1386838999742" data-version="e47434615f57601f9b9ccaf255a589e8550d328d" id="_require" type="text/javascript">if(document.getElementById("coursera-loading-js").style.display == 'block') {
  (function(el) {
     // prevent throw
     require.onError = function(err) {
       window._204 = window._204 || [];
       window._204.push({key: 'requireErr', value: err});
     };

     define("pages/auth/authConfig",
         function() {
             return {"coursera_url": "https://www.coursera.org/",
                     "environment": "production"};
     }
     );

     require.config({
       enforceDefine: false,
       waitSeconds: 14,
       baseUrl: el.getAttribute("data-baseurl"),
       urlArgs: el.getAttribute("data-debug") == "1" ? "v=" + el.getAttribute("data-timestamp") : "",
       shim: {
          "underscore": {
             exports: '_'
          },
          "backbone": {
             deps: ['underscore', 'jquery'],
             exports: 'Backbone'
          }
       },
       paths: {
          "jquery":       "js/core/jquery",
          "underscore":   "js/core/underscore",
          "backbone":     "js/core/backbone",
          "i18n":         "js/core/i18n._t"
       },
       callback: function() {
         require(["pages/auth/routes"]); // bootup coursera
       },
       config: {
         i18n: {
           locale: (window.localStorage ? localStorage.getItem("locale") : '') || el.getAttribute("data-locale")
         }
       }
     });
  })(document.getElementById("_require"));
}</script><script type="text/javascript">define("pages/home/models/user.json", [], function(){
  return null;
});
</script></body></html>

我觉得这很奇怪,因为它看起来只是网站的源代码,但当我查看r.url时,我得到了一个可以在浏览器中加载并提示我保存或查看视频的实际网站。即使我尝试传递从那里得到的新url(我认为它包含我的cookie信息),我仍然得到相同的内容。我不明白我做错了什么。


3
你可能是在下载页面的HTML而不是文件本身。你试过在浏览器中输入URL吗? - Kevin
是的,我有。这是一个示例URL:https://class.coursera.org/sna-003/lecture/download.mp4?lecture_id=2当我查看urllib2.urlopen(video).read()的输出时,它是XML数据,并且有一个data-baseurl,但是该URL无法在浏览器中加载。这是一个示例:https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d - tblznbits
1
显然,您需要设置某种cookie以避免下载失败。 - plaes
请问您是指我发布的第一个链接吗?该网站要求您登录coursera.org。还是说我需要设置其他类型的cookies? - tblznbits
在您登录后发送的那些 cookie 需要与下载 URL 一起发送。 - Kevin
如果你在网页浏览器中关闭了JavaScript,你能否手动下载它?你可以使用网络嗅探工具(如Wireshark)来比较网页浏览器发送的内容和你的代码发送的内容。 - jfs
4个回答

8

首先,下载并安装requests包

然后使用以下代码:

import requests

def downloadfile(name,url):
    name=name+".mp4"
    r=requests.get('url')
    print "****Connected****"
    f=open(name,'wb');
    print "Donloading....."
    for chunk in r.iter_content(chunk_size=255): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
    print "Done"
    f.close()

1
你需要一个有效的cookie,这样你就不会下载登录页面。
以下是如何在urllib2中设置cookie。
import urllib2
opener = urllib2.build_opener()
opener.addheaders.append(('Cookie', 'cookiename=cookievalue'))
f = opener.open("http://example.com/")

您也可以使用 cookielib 来实现更像浏览器的行为,以完成登录过程并获取正确的cookie来下载电影。

另一种方法是使用 Requests,这类似于urllib2,但更加简单,可用于自动化登录过程。


那么,使用Requests,我的代码会是这样的:for video in videos: r = requests.get(video, auth=('user', 'pass')) vidFile = open('Data Analysis/Video '+vidNum+'.mp4', 'wb') vidFile.write(r.content) vidNUm += 1(分号用于显示换行)我以前从未使用过requests,现在正在阅读文档。只是想知道我是否走在正确的道路上。 - tblznbits

1
我会首先将文件保存为 .html,而不是 .mp4,以确保它不是登录页面、错误页面或其他杂项页面。有些网站需要 cookies、特定用户代理(用于阻止机器人/刮削程序/自动漏洞扫描器)、引荐人等信息。
我个人使用 tamper-data 或 live http headers 来确保我的程序在调试时能够正常工作。
如果你收到了一个 cloudfront 响应,则可能没有正确处理 cookies/user-agents/referrer 等信息。
我刚刚检查了链接,发现还有一个 CSRF cookie {csrf_token=toNQOP7stgOREzrDcbPc},你必须使用它才能查看通过登录页面传递的任何内容。

-2

如果你有链接,你也可以使用Curl下载MP4视频,这样会更容易。

导入os模块

os.system(f"curl {你的URL链接} --output c:/Users/Desktop/你的文件名.mp4")


从Python调用其他程序会使错误处理和调试变得更加困难。 - D-S

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接