
First, why crawlers with proxy IP?
Engaged in data collection of the old iron know, the site is now like a thief staring at the crawler. Last week I took the company intranet to try my hand, did not hang the proxy directly open crawl, the results of 5 minutes to be blocked IP, even dragged the entire department disconnected for half an hour, and was almost invited to drink tea by the leadership.
at this momentproxy IPIt's your invisibility cloak. If you go to the supermarket to try out food and always use one plate to get the food (fixed IP), the waiter will definitely stop you. If you change a different plate each time (proxy IP), people will not recognize it. Our ipipgo proxy pool is large, the country has 5 million + dynamic residential IP, change IP than change socks more diligent.
Second, Jsoup with the correct posture of the proxy
A lot of tutorials teach people to use System.setProperty to set up the proxy, that is the younger brother to play! What's really reliable is to use the Connection object to hook up the proxy directly. Look at this code:
// Note that you have to import the ipipgo SDK here.
import com.ipipgo.proxy.;.
...
Document doc = Jsoup.connect("destination URL")
.proxy(ipipgo.getProxy()) // The key is in this line! Get the proxy dynamically
.timeout(30000)
.userAgent("Mozilla/5.0 (Windows NT 10.0) decent browser")
.get();
Highlight it three times:Don't use free agents! Don't use free proxies! Don't use free proxies!Before the cheap use of the pheasant proxy, the result of climbing to the data are all advertisements, the party almost sued me. ipipgo's exclusive proxy line has a dedicated maintenance, the response speed can be pressed in 200ms or less.
Third, the actual battle in the tawdry operation
When you come across a site that is hard to crawl, I'll teach you a trick:IP+UA+Cookie 3-piece rotation. Here's a real case:
| be tactful | effect | ipipgo configuration recommendations |
|---|---|---|
| Single IP Continuous Access | It'll be closed in 10 minutes. | Enable automatic switching mode |
| IP+Browser Fingerprinting | Survive for 2 hours | Bind static residential IP |
The last time I crawled an e-commerce site for price data, I used ipipgo'sIntelligent Routingfunction, automatically matching the IP of the target server location, the collection speed is directly doubled. Here is a pit to note: do not write a dead proxy address in the code, to use their API to dynamically obtain, so that the IP will automatically change when it fails.
IV. Common rollover site QA
Q: What should I do if the agent suddenly fails to connect?
A: First call ipipgo ping interface detection, if the return code = 502, immediately change the alternate line. Their console has real-time monitoring, which is more reliable than writing your own retry mechanism.
Q: What should I do if I am bombarded with CAPTCHAs?
A: Don't tough it out! Reduce the request frequency to 1 request/5 seconds while turning on ipipgo'shigh stash modelThe method of crawling is not a good one. Pro-tested effective, last week with this method to crawl 100,000 data did not trigger the verification.
Q: How can I tell if the proxy is really in effect?
A: Add a log output to the code:
System.out.println("Currently using proxy:" + ipipgo.getCurrentProxy());
V. Say something heartfelt
I've used seven or eight agency services, and finally used ipipgo for a long time for three reasons: one is that their homeIt's so responsive.The second is that the IP pool is large enough to do the national data collection can be accurate to the district and county level; the third is that the billing is flexible, such as our small team with the volume package, a month is only a hundred dollars.
A final reminder for newbies:Don't save money on agents.The cost of handling dirty data is not enough to cover the cost of the proxy fee. Last time I saw an old man with a free agent to climb the data, the results into the library found that 30% are messy code, crying too late.

