I have a file containing json lines that need to be verified for its validity based on the sequence of each json's flapping "alert.status" value.
A sample of valid json lines:
{"id":123,"code":"foo","severity":"Critical","severityCode":1, "property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":123,"code":"foo","severity":"Critical","severityCode":1,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"Off"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"Off"}}
{"id":123,"code":"foo","severity":"Critical","severityCode":1,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
The above file is valid since the duplicate jsons (line 1,5 and line 2,6) have status flapping from "on", "off", "on" and so on.
A sample of invalid json lines:
{"id":123,"code":"foo","severity":"Critical","severityCode":1, "property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":123,"code":"foo","severity":"Critical","severityCode":1,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"Off"}}
{"id":123,"code":"foo","severity":"Critical","severityCode":1,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"Off"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"Off"}}
The above is invalid since jsons in line 1 and 3 are duplicate having its "status" value stays the same without flapping from on or off.
I tried to use jq to read the json lines into a json array
jq --slurp 'map(select(. >= 2))' jsonfile > jsonarray
But since the sequence in each line is important, I don't think I can use group_by to look for duplicates (the group_by's result is sorted).
I'm thinking about inserting a new key with incremental number in each json so after using group_by, we can sort the result based on this new key to get back the sequence.
Is there a way in jq to use group by all except two keys? (in this case "status" and the new key with incremental number).
Is there any better approach how to solve this problem?
Thanks so much for your help!
I don't think I can use group_by to look for duplicates (the group_by's result is sorted).
That's right, but it's very easy to define a non-sorting "group_by", which, as we'll see, can also easily be used to sort by all keys except for specifically designated ones.
First, here is a simple filter which retains the original order of items within each group:
# The filter, f, must produce a string for each item in `stream`
def GROUPS_BY(stream; f):
reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;
The "S" in the name emphasizes that the function is stream-oriented, first in that the first argument is a stream, and second in that the function produces a stream of the groups; the name is upper-cased to emphasize the differences with the existing built-in function.
To illustrate how this can be used to group by all but a specific key, consider this example (taken from another SO question):
def data:
[{"foo":1,"bar":"a","baz":"whatever"},
{"foo":1,"bar":"a","baz":"hello"},
{"foo":1,"bar":"b","baz":"world"}] ;
GROUPS_BY(data[]; del(.baz) | tostring)
[{"foo":1,"bar":"a","baz":"whatever"},{"foo":1,"bar":"a","baz":"hello"}]
[{"foo":1,"bar":"b","baz":"world"}]
It may be objected that requiring that f always be string-valued introduces several potential difficulties, so here is an efficient but more versatile definition:
# Emit a stream of the groups defined by f, without using sort.
# f need not be string-valued.
def GROUPS_BY(stream; f):
reduce stream as $x ({};
($x|f) as $s
| ($s|type) as $t
| (if $t == "string" then $s else ($s|tojson) end) as $y
| .[$t][$y] += [$x] )
| .[][]
;
Now we can simply write:
GROUPS_BY(data[]; del(.baz))
The simplest way to use GROUPS_BY with a JSON-Lines file is with inputs, e.g. assuming the more versatile def is used, you'd write:
GROUPS_BY(inputs; del(.alert))
Don't forget to invoke jq with the -n option when using inputs.
According to my understanding of the problem, the following filter can be used to determine validity of a group:
def changing(f):
def c:
if length <= 1 then true
elif (.[0] | f) == (.[1] | f) then false
else .[1:] | c
end;
c ;
(The inner function, c, is used here for efficient recursion. Of course, if computing f redundantly is a concern, then a variant definition should be used.)
Putting it altogether using the more versatile definition of GROUPS_BY, and assuming we wish to identify the invalid groups, the solution seems to be a two-liner:
GROUPS_BY(inputs; del(.alert))
| select( changing(.alert.status) | not )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With