子夜的星 | 个人Blog

使用AI编写B站爬虫代码

字数总计：1.4k | 阅读估时：6分钟

记录一次，自己不写一行代码，所有的代码全由AI编写的过程。

使用的AI工具为：Claude

首先，观察哔哩哔哩网页的结构，定位到了包含视频信息的关键元素。右键检查或打开F12，找到最左侧的这个选择元素的按钮（元素检查器），点击一下。然后鼠标移动到第一个视频的标题部分会有浅色的背景显示，点击一下。

接下来，找到对应的元素后，复制相关的代码。

然后，根据复制的代码，编写AI的提示词：

你是一个专业的python爬虫程序员，现在请你帮我写一个python爬虫程序，用于爬取哔哩哔哩的热榜数据。要求如下：  

1. 爬取网站：https://www.bilibili.com/v/popular/all  
2. 爬取内容：标题，作者，视频链接，观看量，视频图片链接 
3. 爬取后的内容，请保存为csv文件 

网站的前端一个视频的元素为：
<div data-v-507d2d1a="" data-v-78bc95a6="" class="video-card"><div data-v-507d2d1a="" class="video-card__content"><a data-v-507d2d1a="" href="//www.bilibili.com/video/BV114421U75X" target="_blank"><img data-v-507d2d1a="" class="lazy-image cover-picture__image" data-src="//i2.hdslb.com/bfs/archive/65cbaf02664e32d7d022886250ed178ae2f29024.jpg@412w_232h_1c_!web-popular.avif" src="//i2.hdslb.com/bfs/archive/65cbaf02664e32d7d022886250ed178ae2f29024.jpg@412w_232h_1c_!web-popular.avif" lazy="loaded"></a> <div data-v-507d2d1a="" class="watch-later van-watchlater black"><span class="wl-tips" style="left: -21px; display: none;"></span></div></div> <div data-v-507d2d1a="" class="video-card__info"><p data-v-507d2d1a="" title="纯黑《黑神话：悟空》全程无伤攻略解说 第一期" class="video-name">纯黑《黑神话：悟空》全程无伤攻略解说 第一期</p> <div data-v-507d2d1a=""><span data-v-507d2d1a="" class="rcmd-tag strong-tag">百万播放</span> <span data-v-507d2d1a="" class="up-name"><svg data-v-507d2d1a="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" viewBox="0 0 18 18" width="18" height="18" style="width: 18px; height: 18px; margin-right: 4px;">
****************************
***省略部分内容（太长了）*******
****************************
 <!----> <!----></div></div></div>

将上述提示词发给AI后，AI会编写如下代码：

import requests
from bs4 import BeautifulSoup
import csv

# 目标URL
url = "https://www.bilibili.com/v/popular/all"

# 发送HTTP请求并获取页面内容
response = requests.get(url)
html_content = response.content

# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(html_content, "html.parser")

# 找到所有视频卡片元素
video_cards = soup.find_all("div", class_="video-card")

# 创建CSV文件并写入标题行
with open("bilibili_hotlist.csv", "w", newline="", encoding="utf-8") as csvfile:
    fieldnames = ["title", "author", "video_link", "views", "image_link"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    # 遍历每个视频卡片,提取所需信息并写入CSV
    for video_card in video_cards:
        title = video_card.find("p", class_="video-name").text.strip()
        author = video_card.find("span", class_="up-name__text").text.strip()
        video_link = "https://www.bilibili.com" + video_card.find("a")["href"]
        views = video_card.find("span", class_="play-text").text.strip()
        image_link = video_card.find("img", class_="cover-picture__image")["data-src"]

        writer.writerow({
            "title": title,
            "author": author,
            "video_link": video_link,
            "views": views,
            "image_link": image_link
        })

print("数据已保存到 bilibili_hotlist.csv 文件中.")

运行后，虽然输出了csv文件，但是没有任何数据，接下来继续和AI进行对话，让他帮忙修改bug。

import requests
from bs4 import BeautifulSoup
import csv
import time

# 目标URL
url = "https://www.bilibili.com/v/popular/all"

# 发送HTTP请求并获取页面内容
response = requests.get(url)
html_content = response.content

# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(html_content, "html.parser")

# 检查数据是否在初次加载的HTML中
if not soup.find_all("div", class_="video-card"):
    print("数据似乎是通过JavaScript动态加载的,需要进一步处理.")

    # 使用Selenium模拟浏览器加载页面
    from selenium import webdriver
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(5)  # 等待5秒钟让页面加载完毕
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, "html.parser")

# 找到所有视频卡片元素
video_cards = soup.find_all("div", class_="video-card")

# 创建CSV文件并写入标题行
with open("bilibili_hotlist.csv", "w", newline="", encoding="utf-8") as csvfile:
    fieldnames = ["title", "author", "video_link", "views", "image_link"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    # 遍历每个视频卡片,提取所需信息并写入CSV
    for video_card in video_cards:
        title = video_card.find("p", class_="video-name").text.strip()
        author = video_card.find("span", class_="up-name__text").text.strip()
        video_link = "https://www.bilibili.com" + video_card.find("a")["href"]
        views = video_card.find("span", class_="play-text").text.strip()
        image_link = video_card.find("img", class_="cover-picture__image")["data-src"]

        writer.writerow({
            "title": title,
            "author": author,
            "video_link": video_link,
            "views": views,
            "image_link": image_link
        })

print("数据已保存到 bilibili_hotlist.csv 文件中.")

这次爬取到了数据，但是只有21条。自己观察可以发现，其实就是因为哔哩哔哩没有完全加载数据，导致爬虫爬取的时候就爬了第一次加载到的数据。接下来继续和AI进行对话，让他修改Bug。

import requests
from bs4 import BeautifulSoup
import csv
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# 目标URL
url = "https://www.bilibili.com/v/popular/all"

# 初始化 Selenium 驱动
driver = webdriver.Chrome()
driver.get(url)

# 向下滚动页面直至底部
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
    # 向下滚动页面
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)  # 等待 5 秒钟

    # 计算新的页面高度
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# 获取完整的页面源代码
html_content = driver.page_source

# 使用 BeautifulSoup 解析 HTML
soup = BeautifulSoup(html_content, "html.parser")

# 找到所有视频卡片元素
video_cards = soup.find_all("div", class_="video-card")

# 创建 CSV 文件并写入标题行
with open("bilibili_hotlist.csv", "w", newline="", encoding="utf-8") as csvfile:
    fieldnames = ["title", "author", "video_link", "views", "image_link"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    # 遍历每个视频卡片,提取所需信息并写入 CSV
    for video_card in video_cards:
        title = video_card.find("p", class_="video-name").text.strip()
        author = video_card.find("span", class_="up-name__text").text.strip()
        video_link = "https://www.bilibili.com" + video_card.find("a")["href"].lstrip("/")
        views = video_card.find("span", class_="play-text").text.strip()
        image_link = video_card.find("img", class_="cover-picture__image")["data-src"]

        writer.writerow({
            "title": title,
            "author": author,
            "video_link": video_link,
            "views": views,
            "image_link": image_link
        })

driver.quit()
print("数据已保存到 bilibili_hotlist.csv 文件中.")

这次成功的把哔哩哔哩的热榜数据全部爬取下来了。

2024-08-21 该篇文章被子夜打上标签: 网络爬虫归为分类: 网络爬虫