
I. Why is crawling data always kicked out? Try Changing Your Vest
If you're just starting to use Python to capture data, you're likely to run into this crap: just crawl two pages of a website, pop up the CAPTCHA, and then a little while later, directly block the IP. It's like going to the cafeteria to play food to be caught by the auntie, and directly give you the rice Kara black.
This is the time to use the proxy IP this "armor" method. It's like changing your meal card every time you go to the cafeteria, so the cafeteria aunts won't recognize the same person. We recommendipipgoProxy services, specializing in providing this "cloak", their IP pool is large enough to change quickly.
Second, the hand to teach you to wear a vest
Load these two guys first:
pip install requests
pip install beautifulsoup4
(after a verb of motion indicates movement away from the speaker)ipipgo official websiteGet some free trial IPs, their home API looks like this:
import requests
proxy_api = "https://api.ipipgo.com/get?token=你的令牌"
resp = requests.get(proxy_api)
proxy = resp.json()['proxy'] get fresh ip
Third, the actual combat wear vest crawl data
Basic version of the vest to wear:
proxies = {
'http': 'http://'+proxy,
'https': 'https://'+proxy
}
resp = requests.get('destination URL', proxies=proxies, timeout=10)
Advanced players can playAuto Change::
from itertools import cycle
Get a bunch of IPs from ipipgo
proxy_list = ['111.222.333.444:8888', '555.666.777.888:9999']
proxy_pool = cycle(proxy_list)
for page in range(1,6): current_proxy = next(proxy_list)
current_proxy = next(proxy_pool)
current_proxy = next(proxy_pool): current_proxy = next(proxy_pool)
resp = requests.get(url, proxies={'http': current_proxy})
Processing data...
except: print(f "http': current_proxy})
print(f"{current_proxy} this vest is leaking, switch to the next one")
Fourth, wear vest to pay attention to what?
1. Don't fool around too much:Even if you have a vest, don't woolgather the site to death, and control the pace of the visit
2. Camouflage should be complete:Remember to put a proper User-Agent in the headers, don't use Python's default!
| shitty operation | correct posture |
|---|---|
| No headers. | Disguised as Chrome |
| 10 requests per second | Random intervals of 1-3 seconds |
V. Common rollover site QA
Q: What should I do if my vest suddenly doesn't work well?
A: 80% of the IP is expired, use ipipgo's automatic replacement API, their IP survival time is longer than other parents!
Q: It's even slower when I use a proxy?
A: free agents are this line, it is recommended that the ipipgo paid package, their family has a special high-speed channels
Q: Will you be taken to tea?
A: Don't crawl sensitive data, abide by the website's robots.txt regulations, and check out their terms of use when using ipipgo!
VI. Vest purchase guide
There are a bunch of proxy providers on the market, but many of them are the pits:
- They claim to have millions of IPs, but not many of them actually work.
- Not enough anonymity to expose the real IP in minutes
- Customer service is like a robot, no one cares if something goes wrong
ipipgoDoing a more reliable job on this piece:
1. Exclusive IP pool, do not steal "clothes" with others
2. Support HTTPS/SOCKS5 multiple protocols
3. With a professional technical team to keep an eye on the IP survival rate can reach 95% or more.
4. 3-day trial for new users, not afraid of being pitched.
Finally, although the crawler is good, don't be greedy. With ipipgo such regular service providers, both to protect themselves and will not add to the site, which is the long-term solution. If you are just starting to learn, it is recommended that you start with their free packages to play, and then on the advanced features when you figure out the way.

