
What happens when an HTML parser meets a proxy IP?
Recently, people always ask me why I always get blocked when I use Python to crawl a web page. It's just like when you go to the supermarket to try food, and you catch the same counter, can the security guards not keep an eye on you? You need to use a proxy IP toDisguised as different customersThe website can't tell if you're a "third party" or a "fourth party". Take ipipgo rotating IP, each request for a different "vest", the site can not tell whether you are Zhang San or Li Si.
import requests
from bs4 import BeautifulSoup
proxies = {
'http': 'http://ipipgo-rotating:password@gateway.ipipgo.com:9020',
'https': 'https://ipipgo-rotating:password@gateway.ipipgo.com:9020'
}
response = requests.get('https://target.com', proxies=proxies)
soup = BeautifulSoup(response.text, 'html.parser')
Here's where you can feel comfortable parsing the structure of the page
Three Iron Rules for Choosing a Proxy IP
There is a mixed bag of agency services on the market, so remember these three life-saving rules:
1. The IP pool has to be big enough: a pool of 10 million IPs like ipipgo to ensure a new face for every request
2. Be responsiveDon't make the whole proxy slower than a tortoise, it'll be cold by the time you're done parsing it.
3. Protocol support should be full: Both SOCKS5 and HTTPS must be available to switch between different scenarios.
| functional item | General Agent | ipipgo proxy |
|---|---|---|
| Concurrent requests | Up to 5 threads | limitless |
| IP Survival Time | Three minutes. | Customized |
A practical guide to avoiding the pit
Three common mistakes newbies make:
① Rigorously sticking to one IP address, resulting in being blacklisted by websites
② SSL certificates are not processed, resulting in data parsing failure.
③ Forgot to set the timeout parameter, the program is stuck.
The correct posture should be to match an agent like this:
from requests.adapters import HTTPAdapter
session = requests.Session()
session.mount('http://', HTTPAdapter(max_retries=3))
session.mount('https://', HTTPAdapter(max_retries=3))
try.
response = session.get(url, proxies=proxies, timeout=(3.05, 27))
except requests.exceptions.ProxyError:
Automatically switch ipipgo backup node
switch_to_backup_node()
question-and-answer session
Q: What should I do if I can't connect to the proxy IP often?
A: 80% of them are using junk proxies. Suggest to change to ipipgo's enterprise level line, our self-developed intelligent routing system will automatically avoid the congestion node!
Q: What should I do if I need to resolve multiple websites at the same time?
A: Open multiple Session objects, each with ipipgo nodes in different regions. For example:
site1_proxy = {'https': 'fr-node.ipipgo.com:443'}
site2_proxy = {'https': 'us-node.ipipgo.com:443'}
Q: What's wrong with getting stuck halfway through parsing data?
A: Eighty percent is triggered by the site's authentication mechanism. This time with ipipgo's browser fingerprint camouflage function, with proxy IP to use the effect is better!
Say something from the heart.
Web parsing is like playing hide and seek, proxy IP is your cloak of invisibility. But don't try to be cheap with free proxies, those things are just like torn pants, the exposed shouldn't be exposed to you all exposed. ipipgo recently on the new dynamic port mapping function, with their API can realize milliseconds IP switching, who use who know.
Lastly, I would like to remind all of you to remember to control the frequency of requests when you are doing parsing. Even the best proxy can not hold you hundreds of times per second crazy operation, which is like to give the web server to pour two pots of head, do not get drunk is strange! The rational use of tools in order to flow is not?

