
Don't let IP blocking ruin your job as a crawler!
Recently, many brothers engaged in data capture brothers and I complained, hard work to write the crawler ran not two days on the blocked IP. this thing I know too much, last year to do e-commerce price monitoring, three consecutive days by the target site to pull the black, so angry that I almost smashed the keyboard. Later, I found that using proxy IP isdesperate measure to save one's life, today take the C development experience to give you a few tips.
C crawler essential two-piece set
Engage in the webpage to capture the first to choose the weapon in hand, recommended two old buddy:
// Use this to handle HTTP requests
Http; using System.Net.
Http; // Parsing HTML.
using HtmlAgilityPack.
These two work with the absolute efficiency, especially HtmlAgilityPack XPath parsing, than regular expressions to save ten times. However, it is not enough to have a tool, you have to learn how to use it.camouflage tacticsThe
Three Life-Saving Scenarios for Proxy IP
| take | Performance of the problem | prescription |
|---|---|---|
| High Frequency Visits | Triggering Website Risk Control | Rotation of IP decentralization requests |
| Geographical limitation | Returns a 403 error | Toggle region node |
| Account Linkage | Login anomaly detection | Fixed IP Binding Account |
Last week to help friends do job site capture, with ipipgo's dynamic residential agent, hourly automatic IP change, hard to capture the efficiency of 3 times still not blocked.
Practical: HttpClient to wear a cloak of invisibility
Straight to the dry code to see how to load the ipipgo proxy into the crawler:
var handler = new HttpClientHandler
var handler = new HttpClientHandler {
Proxy = new WebProxy("gateway.ipipgo.com:8000"),
UseProxy = true
}
var client = new HttpClient(handler);
var client = new HttpClient(handler); client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0)");
// It's safer to set a timeout
var response = await client.GetAsync("https://目标网站.com",
new CancellationTokenSource(TimeSpan.FromSeconds(15)).
Be careful to putipipgo consoleThe account password you applied in WebProxy is recommended to use their API to get the proxy address dynamically, so that the IP pool can be updated automatically.
Real Case: E-commerce Price Monitoring System
A price comparison system for a supermarket chain last year hit three bumps in the road:
- Every crawl is recognized as a crawler
- Manually change servers when IP is blocked
- Different prices for different areas
Final Program:
1. With ipipgoHigh Stash Residential Agency
2. Automatic IP switching every 50 requests
3. Coordinate with different city nodes for collection
As a result, the average daily crawl volume soared from 50,000 to 800,000, and the O&M guy no longer had to get up in the middle of the night to change servers.
Guidelines on demining of common problems
Q: What can I do if the proxy IP is too slow?
A: Go with ipipgo'sExclusive Bandwidth PackagesThe download speed is up to 3MB/s, which is faster than the shared proxy.
Q: How to change proxy IP automatically?
A: Add a timer in the code and call ipipgo's API to get the new address. Their interface return format thief simple, direct JSON parsing on the line.
Q: What should I do if I encounter an SSL certificate error?
A: Add this to the HttpClientHandler:
ServerCertificateCustomValidationCallback = (msg, cert, chain, errors) => true
However, be aware of the security risks and it's best to use it with ipipgo's HTTPS proxy.
Five Principles of Anti-Blocking Tips
- Don't request too regularly (random hibernation 0.5-3 seconds)
- User-Agent have a few more in rotation.
- For important projectsStatic long-lasting proxy for ipipgo
- Timely handling of website anti-crawl cookies
- Reduced collection frequency at night
Lastly, I would like to say a few words, don't save money on agents for crawlers. Before the cheap use of free proxy, 8 out of 10 can not be used, but also always lose data. Since the enterprise version of ipipgo, a million requests a day as stable as the old dog, really fragrant!

