一、处理Cookie

在有些需要处理Cookies是实现身份验证、会话保持和访问受限制内容的关键步骤。

Cookies是服务器发送给客户端的小数据片段，客户端会在后续请求中将它们发送回服务器。

1、手动设置Cookie

import requests
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/",
    "Cookie": "GUID=47bdf4fc-128e-4525-b812-b52697ff3f89; _openId=ow-yN5kyWuF_VCeTF42PXixLnvHM; c_channel=0; "
              "c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F17"
              "%252F77%252F30%252F103553077.jpg-88x88%253Fv%253D1717857827000%26id%3D103553077%26nickname%3Dzydx123"
              "%26e%3D1733410783%26s%3Daf252d208f18d213"
}
resp = requests.get("https://user.17k.com/ck/author2/shelf?page=1&appKey=2406394919",headers=headers)

print(resp.json())

2、使用Session管理

使用Python的requests库来模拟登录并获取Cookies，后续使用session发起请求。

import requests

# 创建会话对象
session = requests.Session()

# 登录数据
data = {
    "loginName": "zhanghao",
    "password": "mima"
}

# 登录URL
url = "https://passport.17k.com/ck/user/login"

# 发送POST请求进行登录
response = session.post(url, data=data)

# 检查登录响应
if response.status_code == 200:
    print("登录成功")
else:
    print("登录失败")

# 打印会话的Cookies
print("Cookies after login:")
print(session.cookies)

# 使用session发送后续请求
response = session.get("https://user.17k.com/ck/author2/shelf?page=1&appKey=2406394919")
print(response.text)

二、防盗链处理

1、防盗链现象

通过检查可以看见这个视频的URL链接，但是HTML源码中并没有这个链接。

查看网络请求，可以得到下面的内容：

根据这个请求头的地址可以获得一些json信息，这些json信息中就包含了视频地址信息。

但是，如果直接访问这个请求地址：https://www.pearvideo.com/videoStatus.jsp?contId=1794701&mrd=0.47983386088167057

由于请求头中包含防盗链技术（referer），无法获取到正常情况下应该获取的JSON信息。

所以，要通过修改请求头中的referer来绕过防盗链技术，从而成功获取所需的JSON信息。

修改请求头：

url = "https://www.pearvideo.com/video_1794701"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 "
                  "Safari/537.36 Edg/125.0.0.0",
    # 防盗链：溯源，找到本次请求的上一级
    "Referer": url
}

2、抓取梨视频

# 1.拿到contID
# 2.拿到videoStatus返回的json -> srcURL
# 3.srcURL里面的内容进行解密
# 4.下载视频
import requests

# 拉取视频地址
url = "https://www.pearvideo.com/video_1794643"
contID = url.split("_")[1]

videoStatusUrl = f"https://www.pearvideo.com/videoStatus.jsp?contId={contID}&mrd=0.04365037843226749"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 "
                  "Safari/537.36 Edg/125.0.0.0",
    # 防盗链：溯源，找到本次请求的上一级
    "Referer": url
}

resp = requests.get(videoStatusUrl, headers=headers)
dic = resp.json()
srcUrl = dic['videoInfo']['videos']['srcUrl']
systemTime = dic['systemTime']
srcUrl = srcUrl.replace(systemTime, f"cont-{contID}")

# 下载视频
with open(f"{contID}.mp4", "wb") as f:
    f.write(requests.get(srcUrl).content)

三、代理

代理的原理就是通过第三方的一个机器去发送请求，使用代理服务可以隐藏客户端的真实IP地址，绕过访问限制。

免费代理站：免费代理IP站点

import requests
# 代理服务器地址
proxy = "http:110.86.183.16:59345"

# 配置代理
proxies = {
    "http": proxy,
    "https": proxy,
}
url = "https://www.baidu.com"

response = requests.get(url, proxies=proxies)

print(response.text)

网上找的比较靠谱的代理API:

import requests

def get_proxy():
    return requests.get("https://api.linux-do-proxy.com/get/").json()

# your spider code

def getHtml():
    # ....
    retry_count = 5
    proxy = get_proxy().get("proxy")
    while retry_count > 0:
        try:
            html = requests.get('http://www.example.com', proxies={"http": "http://{}".format(proxy)})
            # 使用代理访问
            return html
        except Exception:
            retry_count -= 1
    return None

2024-05-04 该篇文章被子夜打上标签: 网络爬虫归为分类: 网络爬虫