
Dude, is your proxy IP reliable or not?
Crawler Lao Zhang recently head is very big, the hands of the thousands of proxy IP, with the opening of a blind box like. Yesterday just run through the script, today suddenly collective strike, so angry that he straight beat the table. I know this too well.Batch Verify Proxy IP Survival, definitely just what the data collection party needs.
Manual testing? Stop it!
At first I also stupid manual test, open the browser one by one to lose proxy. Later, I realized that this job is not a human job - after 200 IP tests, my eyes are all strung out. What's worse is that some IPs look like they can connect, but in reality they either time out or drop packets like crazy.
| Test Methods | take a period of (x amount of time) | accuracy |
|---|---|---|
| manually controlled | 3 hours/100 | 60% or so |
| Script Batch | 5 minutes/1000 pieces | 95% and above |
Write your own detector
Here's one.Python real-world cases, using the requests library + multithreading the whole job. Pay attention to the comments section, it's all about stepping on potholes!
import concurrent.futures
import requests
To face the site, it is recommended to test with your own business domain name
TEST_URL = "http://www.baidu.com"
TIMEOUT = 5
def check_proxy(proxy):
try: resp = requests.get(TEST_URL, proxies)
resp = requests.get(TEST_URL, proxies={
'https': f'http://{proxy}'}, timeout=TIMEOUT), timeout=TIMEOUT)
timeout=TIMEOUT)
return proxy if resp.status_code == 200 else None
return None
return None
Read the IP list from a file
with open('proxy_list.txt') as f:
proxies = f.read().splitlines()
Open 20 thread pools
with concurrent.futures.ThreadPoolExecutor(20) as executor:
results = executor.map(check_proxy, proxies)
Sift out valid IPs
valid_ips = [ip for ip in results if ip]
print(f "Surviving IPs: {len(valid_ips)} ones")
Notice there's ahidden pit: Don't just use a third party to test the site, some sites will block HF requests. It is recommended to use their own business-related domain names, such as you do e-commerce with Jingdong Taobao test.
Heart-saving program also depends on professional services
As cool as it is to toss scripts on your own, you're scratching your head when it comes to these few situations:
- The size of the IP library is 100,000, the server can't handle it.
- Need to measure advanced parameters such as latency, geolocation, etc.
- Requires 24-hour continuous monitoring
It's time to go straight toipipgo's API Inspection ServiceIt's the real flavor. Their home interface returns this critical data:
{
"ip": "123.60.88.99",
"port": 8080,
"speed": 356ms,
"expire_time": "2024-06-30"
}
QA time (often asked by old timers)
Q: What can I do if the detection script runs too slow?
A: Don't be greedy with the number of threads! It is recommended to control within 50, otherwise it is easy to crash the local network. Really want to deal with big data, it is recommended to use ipipgo's asynchronous detection interface, 100,000 IP half an hour.
Q: Where to get a reliable proxy IP?
A: Must amenable to my own brotheripipgo. Their IP pool is updated daily with 20%, with a focus on specializedDetection IP Package, especially suitable for scenarios that require high-frequency verification.
Q: HTTPS proxy detection always fails?
A: 80% of the time it's a certificate validation issue. In the requests request addverify=Falseparameters, but this is not safe. It is recommended to use ipipgo's ready-made detection interface directly, to save time.
A final heartfelt word:Don't waste your time with junk agents.I'm not sure if you're going to be able to do that. With that kind of effort tossing scripts around, why not get a bunch of quality IPs. something like ipipgo can provide theReal-time availability reportingThe service providers that are true - productivity tools.

