
What you didn't realize until you were pulled from a website
When I first learned to crawl, I always thought that everything would be fine if the code ran. Until one day, I received 403 errors continuously and stared at the screen."Your visits are too frequent."The tip, only to realize that the site's anti-climbing mechanism is more sensitive than imagined. At this time just by changing User-Agent is no longer good, have to come up with a more professional solution.
Timeout settings are a mystery
Many newbies tend to ignore the timeout parameter, and as a result, their programs get stuck without moving. As an example, the safest way to use the requests library is to write it this way:
response = requests.get(url, timeout=(3.05, 27))
here are3.05 secondsis a connection timeout.27 seconds.It's a read timeout. Don't use integers, a decimal point will avoid conflicts with some servers' time settings. If you don't get a response after the set time, disconnect and move on to the next task, don't hang on to the same tree.
The right way to open a proxy IP
Standalone HF requests are like using the same key to keep opening a lock, sooner or later the locksmith will find out. That's when you need toipipgoThe Dynamic Proxy service allows each request to change to a different "jacket". Their IP pool is updated frequently enough, and the actual test can automatically switch 200+ valid nodes per hour.
proxies = {
'http': 'http://user:pass@gateway.ipipgo.com:9020',
'https': 'http://user:pass@gateway.ipipgo.com:9020'
}
response = requests.get(url, proxies=proxies, timeout=10)
Performance Tuning Triple Axe
| be tactful | Parameter recommendations | effect |
|---|---|---|
| Concurrent control | Number of threads ≤ 50 | Avoid triggering wind control |
| timeout steps | 3-10-30 seconds | Hierarchical handling of exceptions |
| IP Rotation | 5 requests/IP | Extended agent life |
Record of actual pitfalls
There was a time when I crawled government public data and set a timeout of 3 seconds. As a result, some pages with a lot of fields kept timing out, and I later found out that it was theSSL HandshakeIt takes too long. Set the connection timeout to 5 seconds, and keep the read timeout at 15 seconds, and the problem is solved. This kind of details in the official document will not write, are all blood and tears lessons.
QA First Aid Kit
Q: Why is it still blocked after using a proxy?
A: Check the frequency of IP usage, it is recommended that a single IP request no more than 50 times per hour. ipipgo's background can set the automatic switching frequency.
Q: What is the appropriate timeout setting?
A: first look at the average response speed of the site, during the test with a 10-second baseline, the official run shortened to 70% time
Q: What should I do if my proxy IP suddenly fails?
A: Add a retry mechanism to the exception handling module, like this:
try.
Normal request code
except (Timeout, ProxyError): ipipgo.refresh_ip() Call API to change IP.
ipipgo.refresh_ip() calls the API to change the IP.
logger.warning("Triggered fusion mechanism")
Tell the truth.
Crawler is essentially a battle of wits with website operations and maintenance. The last time I used ipipgo'sGeographic orientationFunction, specifically call the IP of the Shanghai server room to catch the local forum, the success rate is directly doubled. Their technical staff also taught a trick: the timeout time and proxy switching strategy binding, slow nodes automatically degraded, this set of combinations down, the collection efficiency has increased more than three times.

