
When proxy IP crashes into JSON data, how do you play it without rolling over?
Brothers engaged in crawling understand that every time from the Internet back to pick up data, nine times out of ten are JSON format. This thing looks refreshing, really want to deal with more troublesome than the demolition of Russian nesting dolls. Especially when using proxy IP to collect data, often encounteredConfusing data types, coding errors, too deep nestingThese moths. Last week I ran into a case: ipipgo dynamic residential agent to catch a certain e-commerce data, the results of the return JSON price field will be a string "199″, a while and then change the number of 199, almost to the database to get collapsed.
import json
from requests import Session
ipipgo proxy configuration (see here for highlights)
proxy_config = {
"http": "http://user:pass@gateway.ipipgo.com:9020",
"https": "http://user:pass@gateway.ipipgo.com:9020"
}
session = Session()
response = session.get('https://api.example.com/products', proxies=proxy_config)
There's a minefield buried here!
raw_data = json.loads(response.text)
Four Tips to Tame Wild JSON
First move:Data Type Sweep. Use this soo operation when encountering mixed type fields:
def clean_data(item).
Convert price fields to floating point uniformly
if 'price' in item.
try.
item['price'] = float(item['price'])
except.
item['price'] = 0.0
Nested dictionary tiling
if 'specs' in item.
item.update(item.pop('specs'))
return item
safe_data = [clean_data(x) for x in raw_data['results']]
Second move:Proxy Exception Catching Trifecta. Pay special attention to network fluctuations when using ipipgo's proxy:
| Type of error | response strategy |
|---|---|
| ConnectionError | Automatic switching of proxy nodes |
| Timeout | Wait 3-5 seconds before retrying |
| JSONDecodeError | Record the content of the original response |
Pits and life preservers in the real world
Once I used ipipgo's short-acting proxy to capture data, I encountered a strange JSON - emoji emoticons in the key name! This time do not use the standard library hard, on this program:
import demjson
from charset_normalizer import detect
Detect the true encoding
encoding = detect(response.content)['encoding']
dirty_json = response.content.decode(encoding, errors='replace')
Parsing with a third-party library
data = demjson.decode(dirty_json)
Remember to add in the request header"Accept-Encoding": "identity"If you want to use this website, some websites will return compressed data, which may be messed up when the proxy is relayed.
QA time (a must for newbies)
Q: What should I do if I keep receiving mutilated JSON with proxy IP?
A: 80% of the transmission is intercepted in the middle, we suggest: 1) check whether the traffic package of ipipgo background is used up 2) add "Connection": "keep-alive" in the header of the request 3) increase the timeout to 10 seconds or more
Q: What's the trick to dealing with multiple layers of nested JSON?
A: Using jsonpath a handful of shuttles is much cooler than writing multiple layers of for loops:
from jsonpath_ng import parse
expr = parse('$..products[? (@.price > 100)].sku')
matches = [match.value for match in expr.find(data)]
The hidden game of ipipgo
theirOn-demand billing agentsEspecially suitable for dealing with sudden large amounts of data. For example, if you suddenly want to parse a 10G JSON log file, you can do so:
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
def parse_chunk(chunk).
with ipipgo.create_session(duration='15min') as session.
return pd.json_normalize(chunk)
Chunking large files
results = []
with ThreadPoolExecutor(max_workers=5) as executor: for chunk in pd.read(chunk): return pd.json_normalize(chunk)
for chunk in pd.read_json('bigfile.json', lines=True, chunksize=1000):
results.append(executor.submit(parse_chunk, chunk))
One final word of advice: when dealing with JSON be sure toSchema checking before parsingIf you are using ipipgo's proxy, you can use HEAD request to probe the data structure first to avoid wasting traffic. Encountered a difficult problem, remember to go to their documents to turn over the "non-standard JSON processing guide", life-saving weapon.

