在进行json处理的时候,经常会遇到需要提取不同级别的字段内容并合并成表格,见下图所示。本篇就总结下遇到这种情况下如何进行文本内容的提取。在知乎上看到一篇《骚操作!嵌套 JSON 秒变 Dataframe》似乎和我遇到的情况类似,不过在实际处理的时候又有不同。
先看我要处理的原始数据:
{ "code": 0, "error": "", "message": "", "data": { "page": 1, "page_size": 3, "total": 1026, "list": [ [ { "_object_id": "App", "instanceId": "5b655ab36b9c3", "name": "nginx" }, { "_object_id": "VIRTUAL_MACHINE", "instanceId": "5b793e7f8f932", "name": "CSSCU_VDU-18" }, { "_object_id": "PHYSICAL_MACHINE", "instanceId": "5b471103effff", "name": "IDC-0D12-S-SRV-41" }, { "_object_id": "SWITCH", "instanceId": "5b4714cb11685", "name": "IDC-0D11-M-TOR-01" }, { "_object_id": "SWITCH", "instanceId": "5b4714ca14f5d", "name": "IDC-0G06-M-EOR-02" } ], [ { "_object_id": "App", "instanceId": "5b655ab36b9c3", "name": "nginx" }, { "_object_id": "VIRTUAL_MACHINE", "instanceId": "5b793e7f8f932", "name": "CSSCU_VDU-18" }, { "_object_id": "PHYSICAL_MACHINE", "instanceId": "5b471103effff", "name": "IDC-0D12-S-SRV-41" }, { "_object_id": "SWITCH", "instanceId": "5b4714cb11685", "name": "IDC-0D11-M-TOR-01" }, { "_object_id": "SWITCH", "instanceId": "5b4714c98cbd1", "name": "IDC-0F06-M-EOR-01" } ], [ { "_object_id": "App", "instanceId": "5b655ab36b9c3", "name": "nginx" }, { "_object_id": "VIRTUAL_MACHINE", "instanceId": "5b793e7f8f932", "name": "CSSCU_VDU-18" }, { "_object_id": "PHYSICAL_MACHINE", "instanceId": "5b471103effff", "name": "IDC-0D12-S-SRV-41" }, { "_object_id": "SWITCH", "instanceId": "5b4714ca966ae", "name": "IDC-0D12-M-TOR-02" }, { "_object_id": "SWITCH", "instanceId": "5b4714ca14f5d", "name": "IDC-0G06-M-EOR-02" } ] ] } }
上面是一条CMDB路径处理下的三条数据,对应的路径是 APP1(nginx) — 对应的虚拟机—虚拟机所在的物理机 — TOR交换 — EOR交换机。直接通过pandas进行读取截取时,可以获取到如下信息:
import json import pandas as pd f = open("vhost.json") results = json.load(f) print(results["data"]["list"]) df = pd.DataFrame(results["data"]["list"],columns = ["APP","VM" , "PY", "TOR", "EOR"])
此时处理过以后,虽然可以正常返回对应的五列数据,不过每列的数据里又有json。
参考w3resource 上关于json的处理方法,多次折腾后,发现还是不对。所以就换笨方法,for循环解决。代码如下:
for col in ["APP","VM" , "PY", "TOR", "EOR"]: df[col] = df.apply(lambda df: df[col]["name"], axis=1)
处理完可以正常返回每个单元格里的json里的name字段。由于返回的这里是有重复值的,比如APP这里都是nginx,所以还会涉及到汇总去重统计,可以直接调用df.nunique()函数搞定。需要拿到每列的去重具体结果,可以使用DataFrame.drop_duplicates方法搞定。