
Hands-on teaching you to use proxy IP to bypass the anti-crawl mechanism
Brothers who engage in network crawlers understand that the biggest headache is the anti-climbing system of the target site. Last week I climbed an e-commerce platform data, just run half an hour IP was blocked. At this time it is necessary to proxy IP to save the scene, the principle is like wearing a mask to participate in the masquerade - the site to see are different faces.
recommendedipipgo Dynamic Residential ProxyI'm sure the IP pool is large enough that I've tested the collection for 6 hours without triggering a blockade. Focus on how to configure the proxy in Requests:
import requests
proxies = {
'http': 'http://username:password@gateway.ipipgo.com:9020',
'https': 'http://username:password@gateway.ipipgo.com:9020'
}
response = requests.get('https://target-site.com', proxies=proxies, timeout=10)
Note the use ofUser name and password authentication methodThe IP address is more flexible than the whitelist verification. ipipgo background can self-generate API extraction links, it is recommended to randomly select different export IPs for each request.
Proxy IP real battle to avoid the pit manual
Three common pitfalls for newbies: ① did not deal with SSL certificate validation ② unreasonable timeout settings ③ IP switching frequency is not appropriate. Here I share my configuration file:
from requests.adapters import HTTPAdapter
session = requests.Session()
adapter = HTTPAdapter(max_retries=3, pool_connections=100)
session.mount('http://', adapter)
session.mount('https://', adapter)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Accept-Language': 'Accept-Language': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
'Accept-Language': 'zh-CN,zh;q=0.9'
}
With ipipgo'squantity-based billing package, remember to add response status detection in the code. When it encounters 403 status code, it automatically switches proxies, like this:
if response.status_code == 403.
print("Anti-Crawl triggered! Changing IP...")
Call ipipgo's API to replace the IP with a new one
reset_proxy()
Tips for doubling your collection efficiency
Single-threaded crawler is too wasteful of proxy IP resources, on the multi-threaded in order to drain the bandwidth. But pay attention to the number of threads do not exceed the ipipgo package.Maximum concurrency, or it will be restricted.
Here's a parameter comparison table:
| Package Type | Recommended number of threads | Requests per second |
|---|---|---|
| trial version | 5 | 3 |
| Enterprise Edition | 50 | 20 |
| customized edition | 200+ | negotiable |
It is recommended to use the concurrent.futures module for thread pooling, and remember to assign independent agents to each thread:
from concurrent.futures import ThreadPoolExecutor
def worker(url): proxy = get_proxy()
proxy = get_proxy() get new IP from ipipgo
return requests.get(url, proxies=proxy)
with ThreadPoolExecutor(max_workers=20) as executor: results = executor.map(worker, url_map)
results = executor.map(worker, url_list)
Frequently Asked Questions First Aid Kit
Q: What should I do if the proxy IP suddenly fails to connect?
A: First check whether the account quota is used up, and then test the local network. ipipgo has real-time usage statistics in the background, and it is recommended to turn on the residual amount of warning
Q: How do I break into Cloudflare protection?
A: Switch to ipipgo'sHigh Stash Residential AgencyThe mouse is used to simulate a randomized UA and mouse movement trajectory.
Q: Is it normal for the acquisition speed to be fast and slow?
A: There are differences in the speed of proxy nodes in different regions, it is recommended to record the response time of each IP in the code and prioritize the fast nodes.
As a final reminder, the use of proxy IPs is subject to the website robots protocol. ipipgo offersCompliance User GuideThe new user registration sends 1G flow test, which is enough for small-scale data collection needs. Encountered technical problems their customer service response is quite fast, the last time I submitted a work order at two o'clock in the morning, ten minutes to receive the solution.

