在进行json处理的时候,经常会遇到需要提取不同级别的字段内容并合并成表格,见下图所示。本篇就总结下遇到这种情况下如何进行文本内容的提取。在知乎上看到一篇《骚操作!嵌套 JSON 秒变 Dataframe》似乎和我遇到的情况类似,不过在实际处理的时候又有不同。
先看我要处理的原始数据:
{
"code": 0,
"error": "",
"message": "",
"data": {
"page": 1,
"page_size": 3,
"total": 1026,
"list": [
[
{
"_object_id": "App",
"instanceId": "5b655ab36b9c3",
"name": "nginx"
},
{
"_object_id": "VIRTUAL_MACHINE",
"instanceId": "5b793e7f8f932",
"name": "CSSCU_VDU-18"
},
{
"_object_id": "PHYSICAL_MACHINE",
"instanceId": "5b471103effff",
"name": "IDC-0D12-S-SRV-41"
},
{
"_object_id": "SWITCH",
"instanceId": "5b4714cb11685",
"name": "IDC-0D11-M-TOR-01"
},
{
"_object_id": "SWITCH",
"instanceId": "5b4714ca14f5d",
"name": "IDC-0G06-M-EOR-02"
}
],
[
{
"_object_id": "App",
"instanceId": "5b655ab36b9c3",
"name": "nginx"
},
{
"_object_id": "VIRTUAL_MACHINE",
"instanceId": "5b793e7f8f932",
"name": "CSSCU_VDU-18"
},
{
"_object_id": "PHYSICAL_MACHINE",
"instanceId": "5b471103effff",
"name": "IDC-0D12-S-SRV-41"
},
{
"_object_id": "SWITCH",
"instanceId": "5b4714cb11685",
"name": "IDC-0D11-M-TOR-01"
},
{
"_object_id": "SWITCH",
"instanceId": "5b4714c98cbd1",
"name": "IDC-0F06-M-EOR-01"
}
],
[
{
"_object_id": "App",
"instanceId": "5b655ab36b9c3",
"name": "nginx"
},
{
"_object_id": "VIRTUAL_MACHINE",
"instanceId": "5b793e7f8f932",
"name": "CSSCU_VDU-18"
},
{
"_object_id": "PHYSICAL_MACHINE",
"instanceId": "5b471103effff",
"name": "IDC-0D12-S-SRV-41"
},
{
"_object_id": "SWITCH",
"instanceId": "5b4714ca966ae",
"name": "IDC-0D12-M-TOR-02"
},
{
"_object_id": "SWITCH",
"instanceId": "5b4714ca14f5d",
"name": "IDC-0G06-M-EOR-02"
}
]
]
}
}
上面是一条CMDB路径处理下的三条数据,对应的路径是 APP1(nginx) — 对应的虚拟机—虚拟机所在的物理机 — TOR交换 — EOR交换机。直接通过pandas进行读取截取时,可以获取到如下信息:
import json
import pandas as pd
f = open("vhost.json")
results = json.load(f)
print(results["data"]["list"])
df = pd.DataFrame(results["data"]["list"],columns = ["APP","VM" , "PY", "TOR", "EOR"])
此时处理过以后,虽然可以正常返回对应的五列数据,不过每列的数据里又有json。
参考w3resource 上关于json的处理方法,多次折腾后,发现还是不对。所以就换笨方法,for循环解决。代码如下:
for col in ["APP","VM" , "PY", "TOR", "EOR"]:
df[col] = df.apply(lambda df: df[col]["name"], axis=1)
处理完可以正常返回每个单元格里的json里的name字段。由于返回的这里是有重复值的,比如APP这里都是nginx,所以还会涉及到汇总去重统计,可以直接调用df.nunique()函数搞定。需要拿到每列的去重具体结果,可以使用DataFrame.drop_duplicates方法搞定。