When crawlers meet the TLS fingerprinting hurdle
Those of you who do data crawling should have noticed recently that quite a few sites are starting to engage inTLS Fingerprint RecognitionThis new routine. Simply put, the server will check the characteristics of the client handshake, such as browser version, encryption kit and other parameters. If we send a request with a normal curl or requests library, it will be recognized as a robot and the IP will be blocked directly.
If you just use a proxy IP to change the address, it's like putting a wig on a robot - it doesn't cure the symptoms, the IP is changed but the handshake characteristics remain the same, and people will still recognize you as the same "person". So you have tolit. paint holding two brushes (idiom); fig. to work on two tasks at the same time: Both the IP address change and the TLS fingerprinting feature.
How to play the artifact curl_cffi
And here we come to the main character.curl_cffilibrary. It's based on curl but has been deeply modified to emulate the TLS fingerprints of different browsers. Let's look at a comparison table to visualize it:
artifact | Supported Protocols | Fingerprint Simulation | concurrency |
---|---|---|---|
requests | HTTP/1.1 | × | center |
Common curl | HTTP/2 | × | your (honorific) |
curl_cffi | HTTP/3 | √ | ultra-high |
Installation is simple and straightforwardpip install curl_cffi
Just do it. The point is to specify the browser fingerprint when using it, like this:
from curl_cffi import requests resp = requests.get("https://目标网站", impersonate="chrome110", proxies={"https": "http://user:pass@ipipgo proxy address:port"} )
here areThe impersonate parameterThat's the key, it supports all versions from chrome99 to chrome120. It is recommended to choose the mainstream version within three months, too new or too old are easy to reveal.
How to choose a reliable proxy IP
Since we're changing IPs, we need to talk about ouripipgo proxy serviceThe following is an example of how to do this. Many agents on the market only care about providing IPs and don't care about application scenarios. But we have to consider three hard indicators to do anti-reverse crawling:
- The IP type must be specified with theResidential AgentsThe data center IP has been blacklisted for a long time.
- Each IP survival time should be controlled in 5-15 minutes, not long-term reuse
- The exit location has to match the geolocation of the fingerprinted browser version
For example, if you're emulating a US Chrome browser, the proxy IP would have to be a US residential address as well. ipipgo'sDynamic residential agent poolJust the thing for these needs, with automatic assignment of the latest IP for each request, and with geolocation selection.
A practical guide to avoiding the pit
Combined with our experience in solving problems for customers, we summarize a few common minefields:
- Don't try to save yourself the trouble of using a free proxy, those IPs have already been flagged by major websites!
- It's best to randomly hibernate for 0.5-3 seconds per request, don't mess with the mechanical rhythm
- Remember to update the curl_cffi version regularly, the browser fingerprinting library is updated every month!
Here's a configuration template to refer to:
import random import time from curl_cffi import requests def safe_request(url): time.sleep(random.uniform(1, 3)) time.sleep(random.uniform(1, 3)) return requests.get(url, impersonate=random.choice["chrome119", random.uniform(1, 3)) impersonate=random.choice(["chrome119", "chrome120"]), proxies={"https": f "http://{ipipgo auto-assign proxy}"} )
Frequently Asked Questions QA
Q: I've already used a proxy IP, why is it still blocked?
A: Simply changing IP without modifying TLS fingerprints is like changing clothes without changing face, people will still recognize it. You have to work with curl_cffi, which is a fingerprint disguise tool.
Q: How do I access the code for the ipipgo agent?
A: Get the API access address in the personal center, it is recommended to use theUsername + PasswordThe authentication method is more convenient than binding to an IP whitelist.
Q: What should I do to debug the TLS detection?
A: First withopenssl s_client -connect target site:443
Check out the handshaking process of a normal browser and compare it to your own program's parameter adjustments.
As a final reminder, technical tools are only effective when paired with reliable agency services. Let'sipipgoProvide 24-hour technical support, encounter specific problems can directly contact the engineer one-on-one debugging, more reliable than online tutorials.