爬取B站视频弹幕

参考：https://www.cnblogs.com/becks/p/14540355.html

0.爬虫之前

安装BeautifulSoup4；

B站弹幕接口👇，中间的xxx，是视频的一个id号，就是后面会提到的cid。

1	`http://comment.bilibili.com/xxx.xml`

1.获取cid

我们要爬取的弹幕是冰冰的第一期视频，首先打开要爬取的视频并点击播放，打开浏览器的调试器。按照下面的步骤获取cid。

2.进入弹幕接口网址

我们现在已经知道，冰冰第一期视频的弹幕网址了。

1	`http://comment.bilibili.com/283851334.xml`

我们进入这个网页，查看网页的结构，可以从图中看到，弹幕内容存放在d标签中，这样就确定了爬虫用到的筛选条件：

1	`comments = find_all('d')`

QQ截图20210410211221

3.编码

"""
@encoding   :       utf-8
@Author     :       Tang
@E-mail     :       1009592703@qq.com
@File       :       t5_1beautifulsoup.py
@Description:       use BeautifulSoup to crawl web pages
@CreateTime :       2021/4/10    14:00
"""
import requests
from bs4 import BeautifulSoup

global comments
comments = ''

def get_reviews(html):
    html_doc = str(html, 'utf-8')
    bf = BeautifulSoup(html_doc, 'html.parser')
    comment = bf.find_all('d')
    i = 0
    for short in comment:
        global comments
        comments += str(i + 1) + ' : ' + short.text + '\n'
        i = i + 1
    return comments


url = "http://comment.bilibili.com/283851334.xml"
herders = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/537.36 115Broswer/6.0.3',
    'Referer': 'https://movie.douban.com/',
    'Connection': 'keep-alive'}
response = requests.get(url, headers=herders)
print('返回状态码：%s' % response.status_code)
get_reviews(response.content)
print(comments)