概要
この記事は次の記事の続きです。Elasticsearchのバルクロード用のJSON LinesファイルをイメージしたJSONの簡易フィルターコマンド相当のPythonでのツール例です。
経緯は下記の記事のとおりです。いわゆる拙作ではありますが、前回記事を書いたのちに自分の中で意外に便利な気がしたので、他にもバリエーションを追加してみました。
インプットファイルは、次のリンク先の形式をイメージしています。以降コマンド例が出てきますが、json.jsonlinesというファイル名で保存されている想定です。
ツール例
以下ツール例です。
全体的に、
{"a":{"b":{"c":100}}}
というJSONがあれば、
cのフィールドに対して
「a.b.c」
のドット表記でフィールドを指定できるようなインタフェースになっています。
1. 命名 jsonlineselect (フィールド値が所定のものをselect)
def getobj(jsonstr, keystr, force_str=True): import json wrk = json.loads(jsonstr) for i in keystr.split('.'): if wrk.get(i): wrk = wrk[i] else: wrk = '' #ここはこのツールの限界になりうるがひとまずこの挙動としておく break return str(wrk) if force_str else wrk if __name__ == "__main__": import sys import re """ cat json.jsonlines | python3 thisapp.py either index._id '^1' first '^jo.*' """ mode = sys.argv[1] #both/either/each ii = sys.argv[2] i_re = re.compile(sys.argv[3]) jj = sys.argv[4] j_re = re.compile(sys.argv[5]) jsons = [] for l in sys.stdin: jsons.append(l.rstrip('\n')) for i, j in zip(jsons[0::2], jsons[1::2]): cond1 = re.findall(i_re, getobj(i,ii)) cond2 = re.findall(j_re, getobj(j,jj)) if (mode == 'both' and (cond1 and cond2)) or \ (mode == 'either' and (cond1 or cond2) ): print(i) print(j) elif mode == 'each': if cond1: print(i) if cond2: print(j) else: PREFIX = 'STDERR\t' print(PREFIX + i,file=sys.stderr) print(PREFIX + j,file=sys.stderr)
走行例
$ cat json.jsonlines | python3 jsonlineselect.py both index._id '^1' first '^jo.*'
{"index":{"_id":1}}
{"first":"johnny","last":"日本語gaudreau","goals":[9,27,1],"assists":[17,46,0],"gp":[26,82,1],"born":"1993/08/13"}
STDERR {"index":{"_id":2}}
STDERR {"first":"sean","last":"monohan","goals":[7,54,26],"assists":[11,26,13],"gp":[26,82,82],"born":"1994/10/12"}
STDERR {"index":{"_id":3}}
STDERR {"first":"jiri","last":"hudler","goals":[5,34,36],"assists":[11,62,42],"gp":[24,80,79],"born":"1984/01/04"}
STDERR {"index":{"_id":4}}
STDERR {"first":"micheal","last":"frolik","goals":[4,6,15],"assists":[8,23,15],"gp":[26,82,82],"born":"1988/02/17"}
STDERR {"index":{"_id":5}}
STDERR {"first":"sam","last":"bennett","goals":[5,0,0],"assists":[8,1,0],"gp":[26,1,0],"born":"1996/06/20"}
STDERR {"index":{"_id":6}}
STDERR {"first":"dennis","last":"wideman","goals":[0,26,15],"assists":[11,30,24],"gp":[26,81,82],"born":"1983/03/20"}
STDERR {"index":{"_id":7}}
STDERR {"first":"david","last":"jones","goals":[7,19,5],"assists":[3,17,4],"gp":[26,45,34],"born":"1984/08/10"}
STDERR {"index":{"_id":8}}
STDERR {"first":"tj","last":"brodie","goals":[2,14,7],"assists":[8,42,30],"gp":[26,82,82],"born":"1990/06/07"}
STDERR {"index":{"_id":39}}
STDERR {"first":"mark","last":"giordano","goals":[6,30,15],"assists":[3,30,24],"gp":[26,60,63],"born":"1983/10/03"}
STDERR {"index":{"_id":10}}
STDERR {"first":"mikael","last":"backlund","goals":[3,15,13],"assists":[6,24,18],"gp":[26,82,82],"born":"1989/03/17"}
{"index":{"_id":11}}
{"first":"joe","last":"colborne","goals":[3,18,13],"assists":[6,20,24],"gp":[26,67,82],"born":"1990/01/30"}
2. 命名 jsonlineselect_sort (ソート用のフィールド名を指定してその順序にバルクロードファイルを並べる)
JSON Linesのアクション行(action)とペアとなるドキュメント行(doc)をひとかたまりに並べ替えます。
import itertools import jsonlineselect import sys import json """ cat json.jsonlines | python3 thisapp.py a_d index._id first """ sorttype = sys.argv[1] #action/doc/a_d/d_a ii = sys.argv[2] jj = sys.argv[3] jsons = [] for l in sys.stdin: jsons.append(l.rstrip('\n')) c = itertools.count(1) records = [] for i, j in zip(jsons[0::2], jsons[1::2]): funcmap = { 'action': lambda a,d: a, 'doc': lambda a,d: d, 'a_d': lambda a,d: a + '___' + d, 'd_a': lambda a,d: d + '___' + a } a = jsonlineselect.getobj(i,ii) d = jsonlineselect.getobj(j,jj) records.append({'sortkeyval': funcmap[sorttype](a, d),'n': next(c),'i': i, 'j': j}) for r in sorted(records, key=lambda r: r['sortkeyval']): print(r['i']) print(r['j'])
走行例
$ cat json.jsonlines | python3 jsonlineselect_sort.py doc index._id first
{"index":{"_id":7}}
{"first":"david","last":"jones","goals":[7,19,5],"assists":[3,17,4],"gp":[26,45,34],"born":"1984/08/10"}
{"index":{"_id":6}}
{"first":"dennis","last":"wideman","goals":[0,26,15],"assists":[11,30,24],"gp":[26,81,82],"born":"1983/03/20"}
{"index":{"_id":3}}
{"first":"jiri","last":"hudler","goals":[5,34,36],"assists":[11,62,42],"gp":[24,80,79],"born":"1984/01/04"}
{"index":{"_id":11}}
{"first":"joe","last":"colborne","goals":[3,18,13],"assists":[6,20,24],"gp":[26,67,82],"born":"1990/01/30"}
{"index":{"_id":1}}
{"first":"johnny","last":"日本語gaudreau","goals":[9,27,1],"assists":[17,46,0],"gp":[26,82,1],"born":"1993/08/13"}
{"index":{"_id":39}}
{"first":"mark","last":"giordano","goals":[6,30,15],"assists":[3,30,24],"gp":[26,60,63],"born":"1983/10/03"}
{"index":{"_id":4}}
{"first":"micheal","last":"frolik","goals":[4,6,15],"assists":[8,23,15],"gp":[26,82,82],"born":"1988/02/17"}
{"index":{"_id":10}}
{"first":"mikael","last":"backlund","goals":[3,15,13],"assists":[6,24,18],"gp":[26,82,82],"born":"1989/03/17"}
{"index":{"_id":5}}
{"first":"sam","last":"bennett","goals":[5,0,0],"assists":[8,1,0],"gp":[26,1,0],"born":"1996/06/20"}
{"index":{"_id":2}}
{"first":"sean","last":"monohan","goals":[7,54,26],"assists":[11,26,13],"gp":[26,82,82],"born":"1994/10/12"}
{"index":{"_id":8}}
{"first":"tj","last":"brodie","goals":[2,14,7],"assists":[8,42,30],"gp":[26,82,82],"born":"1990/06/07"}
3. 命名 jsonlineselect_fieldselect(各行のうち指定のフィールド名の値のみ抜き出します)
これは前2つと違いペアリングはさほど意識していません。
def select_fields(jsonstr, fields): import os import jsonlineselect import json wrk = {} for f in fields: tmp = {} if _val := jsonlineselect.getobj(jsonstr, f, force_str=False): tmp = _val for i in reversed(f.split('.')): _d = {} _d[i] = tmp tmp = _d wrk = dict(**wrk,**tmp) sort_keys = False _jsk = os.environ.get('JSON_SORT_KEYS') if _jsk == 'True': sort_keys = True return json.dumps(wrk,ensure_ascii=False,sort_keys=sort_keys) if __name__ == "__main__": """ cat json.jsonlines | python3 thisapp.py index._id first goals """ import sys fields = [ i for i in sys.argv[1:]] jsons = [] for l in sys.stdin: jsons.append(l.rstrip('\n')) for i in jsons: print(select_fields(i,fields))
走行例 (エラーチェックはしていないので、何かあればすぐ自爆します。引数の指定の仕方に注意が必要ですが、複数指定可能です。)
$ cat json.jsonlines | python3 jsonlineselect_fieldselect.py index._id first goals
{"index": {"_id": 1}}
{"first": "johnny", "goals": [9, 27, 1]}
{"index": {"_id": 2}}
{"first": "sean", "goals": [7, 54, 26]}
{"index": {"_id": 3}}
{"first": "jiri", "goals": [5, 34, 36]}
{"index": {"_id": 4}}
{"first": "micheal", "goals": [4, 6, 15]}
{"index": {"_id": 5}}
{"first": "sam", "goals": [5, 0, 0]}
{"index": {"_id": 6}}
{"first": "dennis", "goals": [0, 26, 15]}
{"index": {"_id": 7}}
{"first": "david", "goals": [7, 19, 5]}
{"index": {"_id": 8}}
{"first": "tj", "goals": [2, 14, 7]}
{"index": {"_id": 39}}
{"first": "mark", "goals": [6, 30, 15]}
{"index": {"_id": 10}}
{"first": "mikael", "goals": [3, 15, 13]}
{"index": {"_id": 11}}
{"first": "joe", "goals": [3, 18, 13]}
標準入力を受け取って、標準出力に流すので、複数パイプすることもできます。