
Teach you to use a proxy IP to catch web page data
Recently, a lot of friends asked Lao Zhang, using Python to parse the web page always encountered 403 error how to do? This is just like going to the market to buy food, you go to the same stall every day, the stall owner must recognize you. The same applies to web servers, which can be directly hacked if they find out that you visit frequently. This time we need ourProxy IP ProdigyComing to help.
Why do we need to put a vest on the reptile?
To give a real case: Xiao Wang to catch a weather website data, just catch 200 pages on the blocked IP. later used ipipgo's dynamic residential proxy, each request for a different region of the IP address, the server can not distinguish between real people to visit or reptile, the data smoothly to hand.
import requests
from bs4 import BeautifulSoup
proxies = {
'http': 'http://user:pass@gateway.ipipgo.com:9020',
'https': 'http://user:pass@gateway.ipipgo.com:9020'
}
response = requests.get('https://目标网站.com', proxies=proxies)
soup = BeautifulSoup(response.text, 'html.parser')
Here's where you pick up your parsing code...
What are the doors to look for when choosing a proxy IP?
The agent service providers on the market are a mixed bag, Lao Zhang recommended ipipgo mainly focus on three points:
1. True Residential IP: Unlike server room IPs which are easily recognized
2. Automatic rotation: no worries about automatic IP changes per request
3. Protocol support: Simultaneous support for HTTP/HTTPS/SOCKS5
A practical guide to avoiding the pit
A common mistake newbies make is to configure the proxy incorrectly, here is a universal template:
import requests
from itertools import cycle
Proxy pool from ipipgo
proxy_list = [
"gateway.ipipgo.com:8001",
"gateway.ipipgo.com:8002",
"gateway.ipipgo.com:8003"
]
proxy_pool = cycle(proxy_list)
for page in range(1, 100): current_proxy = next(proxy_pool)
current_proxy = next(proxy_pool)
current_proxy = next(proxy_pool)
response = requests.get(
url=f "https://目标网站.com/page/{page}", proxies={"http": f "https://目标网站.com/page/{current_proxy}", }
proxies={"http": f "http://{current_proxy}"}, timeout=5, current_proxy = next(proxy_pool)
timeout=5
)
Parsing the code...
except.
print(f"{current_proxy} failed, automatically switching to the next one.")
Frequently Asked Questions QA
Q: What should I do if I use a proxy and still get blocked?
A: check two points: 1. whether to set the request header User-Agent 2. whether the access frequency is too high, it is recommended to add time.sleep(2) in the code
Q: What is the best way to get a good deal on ipipgo proxies?
A: For crawlers, choose the Dynamic Residential IP package, new users have a 3-day trial period. Enterprise users remember to choose exclusive IP pool, to avoid crashing with others!
Q: HTTPS website can't catch data?
A: In the requests request to configure both the http and https proxy address, many people only one
Upgrade Play Tips
You can use it with Selenium when you encounter websites with strong anti-climbing:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server=http://gateway.ipipgo.com:9020')
driver = webdriver.Chrome(options=options)
driver.get("https://目标网站.com")
Here we use BeautifulSoup to parse driver.page_source
The last nagging sentence, choose the proxy IP is like looking for the object, you have to find a reliable. ipipgo used for half a year, the stability of more than 90%. Especially their intelligent routing function, can automatically match the fastest node, than manual switching much more trouble. Remember not to use the free agent, light data leakage, heavy account theft, the loss is not worth it!

