Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing JSON lines with JQ for flapping key values in sequence

Tags:

json

jq

I have a file containing json lines that need to be verified for its validity based on the sequence of each json's flapping "alert.status" value.

A sample of valid json lines:

{"id":123,"code":"foo","severity":"Critical","severityCode":1, "property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":123,"code":"foo","severity":"Critical","severityCode":1,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"Off"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"Off"}}
{"id":123,"code":"foo","severity":"Critical","severityCode":1,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}

The above file is valid since the duplicate jsons (line 1,5 and line 2,6) have status flapping from "on", "off", "on" and so on.

A sample of invalid json lines:

{"id":123,"code":"foo","severity":"Critical","severityCode":1, "property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":123,"code":"foo","severity":"Critical","severityCode":1,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"On"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"Off"}}
{"id":123,"code":"foo","severity":"Critical","severityCode":1,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"Off"}}
{"id":456,"code":"bar","severity":"High","severityCode":2,"property":{ "priority":"top", "owner":"dev"}, "alert":{"mgmt":"yes", "status":"Off"}}

The above is invalid since jsons in line 1 and 3 are duplicate having its "status" value stays the same without flapping from on or off.

I tried to use jq to read the json lines into a json array

jq --slurp 'map(select(. >= 2))' jsonfile > jsonarray

But since the sequence in each line is important, I don't think I can use group_by to look for duplicates (the group_by's result is sorted).

I'm thinking about inserting a new key with incremental number in each json so after using group_by, we can sort the result based on this new key to get back the sequence.

Is there a way in jq to use group by all except two keys? (in this case "status" and the new key with incremental number).

Is there any better approach how to solve this problem?

Thanks so much for your help!

like image 393
M.Ridha Avatar asked Dec 02 '25 20:12

M.Ridha


1 Answers

I don't think I can use group_by to look for duplicates (the group_by's result is sorted).

That's right, but it's very easy to define a non-sorting "group_by", which, as we'll see, can also easily be used to sort by all keys except for specifically designated ones.

GROUPS_BY

First, here is a simple filter which retains the original order of items within each group:

# The filter, f, must produce a string for each item in `stream`
def GROUPS_BY(stream; f):
  reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;

The "S" in the name emphasizes that the function is stream-oriented, first in that the first argument is a stream, and second in that the function produces a stream of the groups; the name is upper-cased to emphasize the differences with the existing built-in function.

Example

To illustrate how this can be used to group by all but a specific key, consider this example (taken from another SO question):

def data:
  [{"foo":1,"bar":"a","baz":"whatever"},
   {"foo":1,"bar":"a","baz":"hello"},
   {"foo":1,"bar":"b","baz":"world"}] ;

GROUPS_BY(data[]; del(.baz) | tostring)

Output

[{"foo":1,"bar":"a","baz":"whatever"},{"foo":1,"bar":"a","baz":"hello"}]
[{"foo":1,"bar":"b","baz":"world"}]

Refinement

It may be objected that requiring that f always be string-valued introduces several potential difficulties, so here is an efficient but more versatile definition:

# Emit a stream of the groups defined by f, without using sort.
# f need not be string-valued.
def GROUPS_BY(stream; f): 
   reduce stream as $x ({};
     ($x|f) as $s
     | ($s|type) as $t
     | (if $t == "string" then $s else ($s|tojson) end) as $y
     | .[$t][$y] += [$x] )
   | .[][]
   ;

Now we can simply write:

GROUPS_BY(data[]; del(.baz))

Usage with the JSON-Lines file

The simplest way to use GROUPS_BY with a JSON-Lines file is with inputs, e.g. assuming the more versatile def is used, you'd write:

GROUPS_BY(inputs; del(.alert))

Don't forget to invoke jq with the -n option when using inputs.

A filter to determine validity

According to my understanding of the problem, the following filter can be used to determine validity of a group:

def changing(f):
  def c:
    if length <= 1 then true
    elif (.[0] | f) == (.[1] | f) then false
    else .[1:] | c
    end;
  c ;

(The inner function, c, is used here for efficient recursion. Of course, if computing f redundantly is a concern, then a variant definition should be used.)

Solution

Putting it altogether using the more versatile definition of GROUPS_BY, and assuming we wish to identify the invalid groups, the solution seems to be a two-liner:

GROUPS_BY(inputs; del(.alert))
| select( changing(.alert.status) | not )
like image 107
peak Avatar answered Dec 05 '25 21:12

peak