
HandySoup to teach you to use BeautifulSoup to engage in web crawling
Recently, there are always old iron asked me, using Python to engage in web crawling always be blocked IP how to do? Today we will nag about this matter. Let's start with a real case: my disciple wanted to catch the price of goods on a website last month, and just grabbed 200 IPs to be blacked out. This time we have to offer ourThe Proxy IP Method, especially with the ipipgo home service, which has been pro-tested to carry high-intensity acquisition.
Why do I need a proxy IP?
To give a chestnut, the site is like a neighborhood gatekeeper, you wear the same clothes in and out every day, three days to be recognized. Proxy IP isdrag queenIf you want to use a new proxy, you have to change your "armor" every time you request a proxy. But note, don't use those free proxies, nine out of ten are pits. Like ipipgo such professional service providers, IP pool is large and stable, not easy to overturn.
import requests
from bs4 import BeautifulSoup
proxies = {
'http': 'http://username:password@proxy.ipipgo.com:9020',
'https': 'https://username:password@proxy.ipipgo.com:9020'
}
response = requests.get('https://目标网站.com', proxies=proxies)
soup = BeautifulSoup(response.text, 'html.parser')
Followed by your parsing code...
Don't be sloppy with your environmental preparations
The matter of loading libraries is sometimes quite esoteric, and it is recommended to use the Tsinghua Mirror Source:
pip install beautifulsoup4 requests -i https://pypi.tuna.tsinghua.edu.cn/simple
Pay attention to version compatibility, Python 3.8 or above is recommended. If you encounter SSL errors, remember to update your certificate:
pip install --upgrade certifi
Four Steps to Practice
1. First, the whole camouflage header file, do not let the site recognize you as a reptile
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) like a proper browser',
'Accept-Language': 'zh-CN,zh;q=0.9'
}
2. Proxy settings should be careful, it is recommended to use session to keep the session:
session = requests.Session()
session.proxies.update(proxies)
3. Parsing is most secure with CSS selectors, such as finding the price of an item:
price_tags = soup.select('div.price-wrapper > span.current-price')
4. Don't be lazy about exception handling, especially network fluctuations:
try.
response.raise_for_status()
except requests.exceptions.HTTPError as err:
HTTPError as err: print(f "Crashed: {err}")
This triggers ipipgo's automatic IP switching function
Demining of common potholes
| symptomatic | antidote |
|---|---|
| Returns a 403 error | Checking User-Agents and Cookies |
| Connection timeout | Increase the timeout parameter |
| data mismatch | Verify that the structure of the page has not changed |
QA time
Q: Do I need to change my IP often with ipipgo?
A: His family IP pool is large enough, the default will automatically rotate, unless particularly high-frequency acquisition, generally do not need to manually change.
Q: How is the crawl speed controlled?
A: Suggest adding a random delay:
import time
import random
time.sleep(random.uniform(1, 3)) Randomly sleep 1-3 seconds
Q: What should I do if I encounter a CAPTCHA?
A: This situation suggests: 1. reduce the frequency of requests 2. use ipipgo high stash agent 3. on the coding platform (but the cost goes up)
A final word.
Proxy IP is not a panacea, the key toFake it like it's real.The ipipgo family dynamic residential agent is particularly suitable for the need for long-term collection of the scene, the pro-test continuous run for a week have not been blocked. Remember, web page capture to talk about martial arts, don't make people's servers hang up.

