
Hands-on with Python to crawl data without blocking numbers
Recently a lot of old iron asked me how to use Python to get website data, the result of their own written crawler run two days will be blocked IP. this thing I three years ago also planted a heel, and later found a magic tool - proxy IP. today to take their own home with theipipgoService as a chestnut to show you how to play this routine.
Why doesn't your crawler live more than three days?
The site is not a fool, the people anti-creeper mainly look at these three indicators:Frequency of visits, request characteristics, IP tracesThe IP of the crawler is the most important thing. Especially the IP this cant, ordinary crawlers with a fixed IP crazy request, just like the same person every minute in the supermarket checkout 50 times, the security guards do not catch you catch who?
Typical code examples
import requests
for page in range(1,100): url = f'{page}'.
url = f'https://xxx.com/list?page={page}'
r = requests.get(url) swipe with the same ip
The right way to open a proxy IP
Recommended hereipipgos dynamic residential proxy, their IP pool is so ridiculously large (reportedly 90 million +) that the site can't tell if it's a real person or a machine every time a request is made for a different live user IP.
What a reliable crawler should look like
import requests
from random import choice
proxies_pool = [
'112.85.130.93:3328',
'120.33.240.211:1188', ...
... This is where the proxies provided by ipipgo are located.
]
url = 'https://目标网站.com'
headers = {'User-Agent': 'Mozilla/5.0'}
for _ in range(10): proxy = {'http': choice(proxies_pool)}
proxy = {'http': choice(proxies_pool)}
response = requests.get(url, headers=headers, proxies=proxy)
print(response.text[:200]) print the first 200 characters to confirm success
Five anti-blocking shenanigans
1. IP Rotation Rhythm: Don't be stupid and change IPs every request, switch at random intervals like real people do. For example, visit 3-8 times to change one, in the middle of a random wait 1-3 seconds!
2. Request headers should be realistic: Remember to bring the common browser UA, don't use Python's default requests header!
3. Failure Retry Mechanism: Encounter 403/429 error code, take a break and try again with a different IP address.
4. Flow dispersion: Don't catch a page dead in the water, cross visit multiple pages
5. Protocol Selection: some sites are more likely to trigger authentication with https than http
Practical: catch e-commerce price data
As a chestnut, you want to monitor the price fluctuation of a certain east goods:
1. ToipipgoOpen a pay-as-you-go package in the back office
2. Use their API to get the latest list of proxies
3. Crawl the page every half hour, and be careful not to do it on the dot.
4. Automatically cut IP and retry when encountering CAPTCHA.
Advanced version with exception handling
import requests
import time
def smart_crawler(url).
max_retry = 3
for _ in range(max_retry):: _ in range(max_retry).
try: proxy = get_ipipgo_proxy() here call ipipgo_proxy.
proxy = get_ipipgo_proxy() Here we call the ipipgo API to get a new IP.
response = requests.get(url, proxies=proxy, timeout=8)
if 'CAPTCHA' in response.text: raise Exception('CAPTCHA' in response.text)
raise Exception('Authentication triggered')
return response.text
except Exception as e.
print(f "Error: {e}, prepare to change IP")
time.sleep(2_) exponential backoff wait
return None
Frequently Asked Questions QA
Q: What should I do if my proxy IP is slow?
A: Pick the right type of agent! LikeipipgoThe static residential proxy latency can be squeezed to within 200ms, more than twice as fast as the normal server room proxy.
Q: How do I test if the agent is valid?
A: Test with a small batch of IPs first, it is recommended to use this detection interface:
Detection code:
resp = requests.get('http://httpbin.org/ip', proxies=proxy)
print(resp.json()) show current IP in use
Q: What should I do if I encounter website upgrade anti-climbing?
A: timely switching of IP protocol types, such as from HTTP to socks5. like ipipgo background can directly filter different protocol types of proxy, this is particularly convenient.
Saving Program Recommendations
If you're too lazy to toss it yourself, go straight toipipgos Smart Proxy package. Their rotation strategy is self-developed, and is said to automatically match the protection level of the target site, and the success rate for newbies with this can be up to 90%. The recent double eleven and50% off your first orderactivity, much more cost-effective than building your own agent pool.

