Hands on with Python to jerk data without getting stuck!
Old iron is not often encountered climbing data by the site blocked IP, today we nag how to use requests library with proxy IP stable as the old dog to grab JSON data. Focused on Amway's own artifacts ipipgo, used to say that the proxy service.
import requests
from random import choice
The ipipgo trick (example from the API docs)
proxy_list = [
"http://user:pass@gateway.ipipgo.com:9020",
"http://user:pass@gateway.ipipgo.com:9021"
]
resp = requests.get(
"https://api.example.com/data",
proxies={"http": choice(proxy_list)},
timeout=8
)
print(resp.json()['results'])
Knockout!Proxy IPs should be changed as often as socks.The IP address of ipipgo is a very important part of the IP address, especially when catching high-frequency data. ipipgo has millions of IP resources in its pool, so you don't have to worry about the cost of changing IPs.
JSON Data Rollover First Aid Guide
Don't panic when you encounter these reported errors:
symptomatic | antidote |
---|---|
ConnectionError | Try ipipgo's alternate port. |
JSONDecodeError | First print(resp.text) to see the raw data |
Timeout | 8-15 seconds is the safest timeout setting |
To give a real case: an e-commerce platform anti-climbing upgraded with ipipgo'sDynamic Residential AgentsCombined with this trick below, the success rate soars from 30% to 92%:
Masquerading as a proper browser
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36",
"Accept-Encoding": "gzip"
}
Add an exception catch to keep things safe
try: resp = requests.get(url)
resp = requests.get(url, headers=headers, proxies=proxy)
resp.raise_for_status()
except requests.exceptions.RequestException as e:
print(f "Rollover! Reason: {str(e)}")
Practical anti-IP blocking of the tawdry operation
Three key tips to take a small notebook and memorize:
1. Randomize proxy IPs for each request (don't pull the wool over a sheep's eyes)
2. Control the frequency of requests (3-5 seconds/request recommended)
3. Mixed use of data centers + residential proxies (ipipgo both types)
Advanced play can be onAutomatic agent pool scheduling, here's a short version of the polling scheme:
from itertools import cycle
Create an IP circulator
proxy_pool = cycle(ipipgo_proxy_list)
for page in range(1, 101): current_proxy = next(ipipgo_proxy_list)
current_proxy = next(proxy_pool)
Remember to add time.sleep here to simulate manual operation
A must-see QA session for the little guy
Q: What should I do if my proxy IP is not working?
A: ipipgo has a survival rate of 99%, and if individual IPs hang, their API will automatically filter failed nodes
Q: Do I need to handle gzip compressed data?
A: The requests library will be unpacked by default, but to be on the safe side you can put Accept-Encoding in the headers.
Q:Why does my JSON parsing always report errors?
A: eighty percent of the site returned non-JSON content, first use resp.status_code to confirm that is not 200 status code
Hidden benefits of ipipgo
In addition to the regular agents, their family has these killer deals:
- Customized IP geography on demand (e.g. just Shanghai/Beijing export IP)
- Support HTTPS/Socks5 dual protocols
- Free 1G traffic trial for new users
One last word of advice: don't use free proxies! Those who claim not to pay for the IP, either as slow as a snail, or early by the major sites to pull the black. Professional things to professional tools, with ipipgo such serious service providers, data collection efficiency can be more than tripled.