爬取B站冰冰❤的视频弹幕

爬取B站视频弹幕

参考:https://www.cnblogs.com/becks/p/14540355.html

0.爬虫之前

安装BeautifulSoup4;

B站弹幕接口👇,中间的xxx,是视频的一个id号,就是后面会提到的cid

1
http://comment.bilibili.com/xxx.xml

1.获取cid

我们要爬取的弹幕是冰冰的第一期视频,首先打开要爬取的视频并点击播放,打开浏览器的调试器。按照下面的步骤获取cid。

image-20210410210826215

2.进入弹幕接口网址

我们现在已经知道,冰冰第一期视频的弹幕网址了。

1
http://comment.bilibili.com/283851334.xml

我们进入这个网页,查看网页的结构,可以从图中看到,弹幕内容存放在d标签中,这样就确定了爬虫用到的筛选条件:

1
comments = find_all('d')

QQ截图20210410211221

3.编码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"""
@encoding : utf-8
@Author : Tang
@E-mail : 1009592703@qq.com
@File : t5_1beautifulsoup.py
@Description: use BeautifulSoup to crawl web pages
@CreateTime : 2021/4/10 14:00
"""
import requests
from bs4 import BeautifulSoup

global comments
comments = ''

def get_reviews(html):
html_doc = str(html, 'utf-8')
bf = BeautifulSoup(html_doc, 'html.parser')
comment = bf.find_all('d')
i = 0
for short in comment:
global comments
comments += str(i + 1) + ' : ' + short.text + '\n'
i = i + 1
return comments


url = "http://comment.bilibili.com/283851334.xml"
herders = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/537.36 115Broswer/6.0.3',
'Referer': 'https://movie.douban.com/',
'Connection': 'keep-alive'}
response = requests.get(url, headers=herders)
print('返回状态码:%s' % response.status_code)
get_reviews(response.content)
print(comments)
image-20210410221020385

4.词云

image-20210810173952268

image-20210810173752892

image-20210810173806001

5.简单的情感分析

image-20210810173841453

image-20210810173853635

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!