
Crawlers are always blocked IP, try this combo!
Brothers should have encountered this situation, right? When you write a crawler script in Python, you just run for two minutes and receive a 403 error from the target site. At this time do not rush to smash the keyboard, today teach you to use theBeautifulSoup+Proxy IPThis golden pair to break the ice.
To cite a real case: last month there is an e-commerce price comparison brother, with ordinary script to catch the data of a shopping platform, the results just run half an hour IP will be pulled black. Later changed to use ipipgo's rotating proxy program, with the parsing skills we are going to talk about, and now every day to capture tens of thousands of stable commodity information.
Hands-on building of anti-blocking environment
Install these two essential libraries first (remember to operate in a virtual environment):
pip install beautifulsoup4 requests
重点来了!传统就像裸奔上网,用代理IP相当于给爬虫穿防弹衣。这里以ipipgo的服务为例,演示如何配置:
proxies = {
'http': 'http://username:password@gateway.ipipgo.com:9020',
'https': 'http://username:password@gateway.ipipgo.com:9020'
}
Be careful to replace the authentication information with your own account. ipipgo's Dedicated Proxy has separate ports for each channel, so don't mix them up.
Four Steps to Web Parsing
Real-world parsing of a news site (desensitized):
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'} fake browser
response = requests.get('https://example.com/news',
proxies=proxies, headers=headers)
headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
Grab titles with a specific class
titles = soup.find_all('h3', class_='news-title')
for title in titles.
print(title.get_text().strip())
A guide to avoiding the pit:Here the most easy to plant in three places: 1) did not add the request header is recognized as a crawler 2) poor quality proxy IP leads to request failure 3) page structure changes lead to selector failure. The first two problems can be solved with ipipgo's quality proxy + standard request header template.
How do you break dynamic content?
When it comes to JavaScript rendered pages, BeautifulSoup may not be able to do the job. Don't panic, it's the ultimate solution:
| take | prescription | ipipgo configuration recommendations |
|---|---|---|
| Simple Dynamic Loading | Requests-html library | Use long-lasting static IPs |
| Complex Interaction Pages | Selenium Automation | With browser fingerprint protection |
Focusing on the Selenium solution, remember to add the proxy in the driver configuration:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server=http://gateway.ipipgo.com:9020')
driver = webdriver.Chrome(options=options)
Frequently Asked Questions First Aid Kit
Q: Why is it still blocked even though I'm obviously using a proxy?
A: Check three things: 1) whether the proxy is in effect 2) whether the request frequency is too high 3) whether it triggers the website anti-climbing rules. It is recommended to use ipipgo's per-volume billing package to automatically switch between high stash IPs.
Q: What should I do if I return a garbled code?
A: Specify the encoding when initializing BeautifulSoup:
soup = BeautifulSoup(response.content, 'html.parser', from_encoding='utf-8')
Q: How do I choose an agent package for ipipgo?
A: For beginnerstrial version($5/day), to be transferred when business stabilizesEnterprise Customized EditionThe following is a special reminder: to do large-scale collection must choose exclusive IP pool. Special Reminder: To do large-scale collection, be sure to choose the exclusive IP pool, shared IP is easy to affect each other.
Final scratch: the heart of web parsing lies in theStable page acquisition + accurate data extraction. Use ipipgo's proxy service is like a turbocharger for the crawler, both to avoid IP being blocked and to enhance the collection efficiency. There are specific questions welcome to ipipgo official website to find technical support, their technical customer service response speed is really fast, personally test the kind of seconds back.

