First, white people can also understand the reptile introductory posture
Want to skim data from web pages but afraid of being blocked? Remember this firstGolden TriangleThe first thing you need to do is to use the requests library to send out requests, XPath to find the location, and regular expressions to pick up the details. Don't be intimidated by the terminology, let's take an e-commerce site price monitoring as an example, assuming that you want to catch the price of the phone, with requests.get () can get the page source code.
at this momentProxy IP Pool for ipipgoThat's where it comes in. Why? The same IP is requesting like crazy, if the website doesn't block you, who will? Add a few lines of proxy settings in the code, and rotate the IP address provided by ipipgo, just like playing the "face" game, so that the site thinks that each time a different person is visiting.
Second, XPath to find the data than the drawer is also simple
Imagine the structure of a web page as a closet, and XPath is the navigational language that tells the program "the second piece of clothing on the left in the third drawer". right-click on an element in the Chrome Developer Tools (F12) and select Copy XPath to get the location path directly. For example, the price of a cell phone might be in the range of//div[@class='price-box']/span[1]
This position.
Common Positioning Scenarios | XPath writing |
---|---|
Find by class | //div[@class='product'] |
By text content | //a[contains(text(),'Buy Now')] |
multilayered nesting | //ul[@id='list']/li[3]/div |
III. Regular Expressions: The Swiss Army Knife of Data Cleansing
When web data is a mess, regulars are your filter. For example, if you grab a price that says "from ¥3,299", use thed+,d+
Then you can extract 3299. memorize the three cardinal signs:.? (any character),d+ (number),w+ (alphanumeric)The
Practical case: dealing with phone numbers with impurities
Original text: Customer service phone number400-1234-5678(working days)
Regular formulas:d{3}-d{4}-d{4}
Fourth, the correct way to open the proxy IP
I've seen the 4:00 a.m.ConnectionErrorOh, yeah? That's what happens when you don't use a good proxy. Adding ipipgo's proxy to your code is like putting cloak and dagger on a crawler:
proxies = { 'http': 'http://username:password@ipipgo-proxy-server:port', 'https': 'https://username:password@ipipgo-proxy-server:port' } response = requests.get(url, proxies=proxies)
Here's the point:Random IP selection per request,Abnormal automatic switching,Timed IP availability checkingThe ipipgo API returns a list of available proxies directly, which is much less time-consuming than maintaining them yourself.
V. Guide to avoiding the pit: 5 common mistakes made by novices
1. Forgot to set the request header and was intercepted as a bot
2. Single IP swipe, 10 minutes happy to get the banning package
3. Not dealing with asynchronous loading of pages and catching a loneliness
4. The rules are too rigid, the webpage will not work if you change the style.
5. No exception handling, program crashes in the middle of the night
VI. QA Time: You're going to want to ask about this!
Q: What should I do if XPath is invalid for website revamping?
A: Use more relative paths and fuzzy matches, for example//[contains(@class,'price')]
More resistant to revision than fixed classes
Q: How are ipipgo's agents billed?
A: Their family pressesActual usageCharge, unlike some platforms where you have to buy a package. New users get a $5 bonus, enough to test a few thousand requests
Q: How do I break the CAPTCHA when I encounter it?
A: Reduce the request frequency + switch UA + use ipipgo's high stash proxy three-piece set. Really encountered hardcore CAPTCHA, it is recommended to access the coding platform
VII. Ultimate advice: don't be reckless
Crawlers are persistent battles, and it's all about who lives longer. Do these three things well:
1. Random sleep per request (1-3 seconds)
2. Preparation of 3 sets of analytical solutions for important projects
3. Use of ipipgoexclusive IP poolDoing the guaranteed program
Remember, sustainable crawling is the way to go, don't lose out on a small amount of money to save on agent fees.