I'm looking to build a regex with multiple optional capture groups to parse a large JSON where the tags can change in each line.
The JSON looks something like this (this is a simplified version):
{"company": 123, "irrelevant": "nonsense", "address": {"address1": "123 London Road", "country": "GB"}}
{"company": 456, "irrelevant": "nonsense", "country": "GB", "Name": "Mary"}
{"company": 789, "irrelevant": "nonsense", "address": {"address1": "123 Paris Road", "country": "FR", "Name": "Joe"}}
{"company": 444}
I'm looking to parse all data available for each tag, see expected output below.
groups from row 1: ["123", "123 London Road", "GB", ""]
groups from row 2: ["456", "", "GB", "Mary"]
groups from row 3: ["789", "123 Paris Road", "FR", "Joe"]
groups from row 4: ["444", "", "", ""]
I've tried the below ReGex. I've tried the following regex, however this only works for records where all tags are present.
(?<=company": )(\d+).+("address1": ".+?").+("country": ".+?").+("name": ".+?")
As soon as I make one of the capture groups optional, the greedy any character token captures the whole string, and not the individual groups.
(?<=company": )(\d+).+("address1": ".+?")?.+("country": ".+?")?.+("name": ".+?")?
How can I change this such that additional capture groups can be added, that cater for an inconsistent structure with the JSON, yet capture all available tags?
Added alternative Out of Order solution at the bottom.
You have to bring out each segment as a cluster group and make it optional.
Each segment is self contained.
"company"\s*:\s*(\d+)(?:.*?"address1"\s*:\s*"(.*?)")?(?:.*?"country"\s*:\s*"(.*?)")?(?:.*?"Name"\s*:\s*"(.*?)")?
https://regex101.com/r/Ahueyi/1
"company" \s* : \s*
( \d+ ) # (1), Company req'd
(?:
.*?
"address1" \s* : \s* "
( .*? ) # (2), Addr optional
"
)?
(?:
.*?
"country" \s* : \s* "
( .*? ) # (3), country optional
"
)?
(?:
.*?
"Name" \s* : \s* "
( .*? ) # (4), Name optional
"
)?
Out of Order regex solution.
A commenter said the items could be Out of Order.
The solution then is to change all (?: clusters to (?= assertions.
"company"\s*:\s*(\d+)(?=.*?"address1"\s*:\s*"(.*?)")?(?=.*?"country"\s*:\s*"(.*?)")?(?=.*?"Name"\s*:\s*"(.*?)")?
https://regex101.com/r/gM6pdJ/1
It is always bad idea to try parse JSON (or other well structured data) with regex. There are dedicated tools for that, such as JSON deserialization, which would make it many times cleaner and easier to follow.
It also enables you to do some more parsing login on that data.
See below powershell script:
# Define JSON content as a single string (here-string)
$jsonData = @'
{"company": 123, "irrelevant": "nonsense", "address": {"address1": "123 London Road", "country": "GB"}}
{"company": 456, "irrelevant": "nonsense", "country": "GB", "Name": "Mary"}
{"company": 789, "irrelevant": "nonsense", "address": {"address1": "123 Paris Road", "country": "FR", "Name": "Joe"}}
{"company": 444}
'@
# Split the content into individual JSON lines
$lines = $jsonData -split "`n"
# Loop through each JSON line
foreach ($line in $lines) {
$line = $line.Trim()
try {
$obj = $line | ConvertFrom-Json
# Extract values safely
$company = $obj.company
$address1 = if ($obj.address -and $obj.address.address1) { $obj.address.address1 } else { "" }
$country = if ($obj.address -and $obj.address.country) {
$obj.address.country
} elseif ($obj.country) {
$obj.country
} else {
""
}
$name = ""
if ($obj.address -and $obj.address.name) {
$name = $obj.address.Name
}
# Print tab-separated output
Write-Output "$company`t$address1`t$country`t$name"
} catch {
Write-Warning "Failed to parse line: $line"
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With