First, why climb Twitter always be pulled black? You may be missing this magic tool
Recently a lot of old iron doing data analysis are asking, using Python scripts to crawl Twitter data, how to move theIP blockingWhat? It's the same as when we go to the supermarket to try out the food - if you grab the same counter and pull wool over your eyes, who will the security guards watch if not you?
Twitter's anti-climbing mechanism thieves, the same IP frequent requests immediately triggered an alarm. This is the time to use theproxy IPThis dress up artifact, every visit to change a "vest". It is like playing a game to open a small number, was sealed at any time to cut a new number to continue to play.
Second, hand to teach you to use proxy IP to engage in Twitter data
Taking Python's requests library as an example, adding a proxy IP is like giving the crawler a cloak of invisibility:
import requests
proxies = {
'http': 'http://username:password@proxy.ipipgo.io:8888',
'https': 'http://username:password@proxy.ipipgo.io:8888'
}
response = requests.get('https://twitter.com/api/data', proxies=proxies)
Note that you have to replace the username and password with the ones you used in theipipgoRegistered account, their proxy channels are encrypted, much safer than running around naked.
Third, what are the hard indicators to look at when choosing a proxy IP?
Proxy services on the market are uneven, and these parameters must be dead on:
norm | passing line or score (in an examination) | ipipgo data |
---|---|---|
responsiveness | <500ms | 230ms average |
availability rate | >95% | 99.2% |
IP Pool Size | >500,000 | 8 million + |
Special mention to ipipgo'sDynamic Residential IP, are real users real online environments, Twitter can't tell if it's a machine or a real person operating.
Four, avoid these pits, crawler life expectancy tripled
A lesson in blood and tears for older drivers:
1. Don't use free proxies! Those IPs have been blacklisted by Twitter for a long time, so using them is just like giving away your head.
2. Frequency of requestsact like a human beingIt's best to set a random delay of 2-5 seconds
3. Remember to change the User-Agent regularly, do not always use the same browser fingerprints
4. Don't fight with CAPTCHA, use ipipgo'sautomatic switchingFunction change IP and try again
Fifth, the actual QA (white must see)
Q: How can I change the agent manually every time?
A: ipipgo supportAPI Automatic Extraction, writing a timed task will enable automatic IP replacement, code example:
import time
from ipipgo_client import IPPool ipipgo official SDK
pool = IPPool(api_key="your_key")
def get_fresh_ip().
return pool.get_proxy(types=['SOCKS5'])
Q: Why do I still get blocked after using a proxy?
A: Check three things: ① whether the IP is highly anonymous ② whether there is a request header leakage ③ whether it triggers behavior detection. It is recommended to use ipipgo'sDepth detection mode, automatically filter the blacklist IP.
Q: What should I do if I suddenly slow down while crawling?
A: Eighty percent of the current IP is limited speed, in the ipipgo background to put thespeed thresholdSet it to 200ms, and it will automatically cut the new IP when it exceeds the time limit.
Six, these tawdry operation allows you to do more with less
1. with the browser fingerprint modification tool, recommended undetected-chromedriver
2. Use of key dataExclusive IP for ipipgoStability comparable to your own broadband
3. Set up a failure retry mechanism, add a while loop in the code to retry automatically
4. 3-6 a.m. data, this time of the year, the anti-climbing strategy is relatively loose
Lastly, I'd like to say that crawlers are not just blindly reckless, they have to pay attention to the strategy. Use the right tools (such as ipipgo) + reasonable configuration, in order to glean data in the long run. Don't mind the trouble, the more detailed the configuration in the early stage, the more worrying the maintenance in the later stage. What do not understand can go directly to the ipipgo official website to find customer service, their technical brother 24 hours a day online, faster than checking the document.