I get the following input that I want to split into four parts:
-
KPDX 021453Z 16004KT 10SM FEW007 SCT060 BKN200 11/09 A3002 RMK
     AO2 SLP166 T01060094 55008
TAF AMD KPDX 021453Z 0215/0312 10005KT P6SM FEW006 SCT060 BKN150
     FM021800 11005KT P6SM SCT050 OVC100
     FM022200 11007KT P6SM -RA OVC050
     FM030500 12005KT P6SM -RA OVC035
KSEA 021453Z 15003KT 10SM FEW035 BKN180 11/09 A3001 RMK AO2
     SLP168 60000 T01110089 58010
TAF AMD KSEA 021501Z 0215/0318 14004KT P6SM SCT020 BKN150
     FM021800 16005KT P6SM SCT025 OVC090
     FM030100 19005KT P6SM OVC070
     FM030200 15005KT P6SM -RA OVC045
     FM030600 16007KT P6SM -RA BKN025 OVC045
It's a METAR, then a TAF, then a METAR, then a TAF.
Input rules:
I want to grab each report by itself, so I'm using the regex ^(\\w+.*?)(?:^\\b|\\Z) in the following code:
ArrayList<String> reports = new ArrayList<String>();
Pattern pattern = Pattern.compile( "^(\\w+.*?)(?:^\\b|\\Z)", Pattern.DOTALL|Pattern.MULTILINE );
Matcher matcher = pattern.matcher( input );
while( matcher.find() )
    reports.add( new String( matcher.group( 1 ).trim() ) );
It works great, I get 4 results:
1:
KPDX 021453Z 16004KT 10SM FEW007 SCT060 BKN200 11/09 A3002 RMK
     AO2 SLP166 T01060094 55008
2:
TAF AMD KPDX 021453Z 0215/0312 10005KT P6SM FEW006 SCT060 BKN150
     FM021800 11005KT P6SM SCT050 OVC100
     FM022200 11007KT P6SM -RA OVC050
     FM030500 12005KT P6SM -RA OVC035
3:
KSEA 021453Z 15003KT 10SM FEW035 BKN180 11/09 A3001 RMK AO2
     SLP168 60000 T01110089 58010
4:
TAF AMD KSEA 021501Z 0215/0318 14004KT P6SM SCT020 BKN150
     FM021800 16005KT P6SM SCT025 OVC090
     FM030100 19005KT P6SM OVC070
     FM030200 15005KT P6SM -RA OVC045
     FM030600 16007KT P6SM -RA BKN025 OVC045
I have encountered a case where my regex fails. Occasionally, a TAF line will run too long and will be wrapped (I have no control over this), so it might look like (notice the "BKN150" right below "TAF AMD PDX"):
-
KPDX 021453Z 16004KT 10SM FEW007 SCT060 BKN200 11/09 A3002 RMK
     AO2 SLP166 T01060094 55008
TAF AMD KPDX 021453Z 0215/0312 10005KT P6SM FEW006 SCT060
BKN150
     FM021800 11005KT P6SM SCT050 OVC100
     FM022200 11007KT P6SM -RA OVC050
     FM030500 12005KT P6SM -RA OVC035
KSEA 021453Z 15003KT 10SM FEW035 BKN180 11/09 A3001 RMK AO2
     SLP168 60000 T01110089 58010
TAF AMD KSEA 021501Z 0215/0318 14004KT P6SM SCT020 BKN150
     FM021800 16005KT P6SM SCT025 OVC090
     FM030100 19005KT P6SM OVC070
     FM030200 15005KT P6SM -RA OVC045
     FM030600 16007KT P6SM -RA BKN025 OVC045
When this happens, I get 5 results:
1:
KPDX 021453Z 16004KT 10SM FEW007 SCT060 BKN200 11/09 A3002 RMK
     AO2 SLP166 T01060094 55008
2:
TAF AMD KPDX 021453Z 0215/0312 10005KT P6SM FEW006 SCT060
3:
BKN150
     FM021800 11005KT P6SM SCT050 OVC100
     FM022200 11007KT P6SM -RA OVC050
     FM030500 12005KT P6SM -RA OVC035
4:
KSEA 021453Z 15003KT 10SM FEW035 BKN180 11/09 A3001 RMK AO2
     SLP168 60000 T01110089 58010
5:
TAF AMD KSEA 021501Z 0215/0318 14004KT P6SM SCT020 BKN150
     FM021800 16005KT P6SM SCT025 OVC090
     FM030100 19005KT P6SM OVC070
     FM030200 15005KT P6SM -RA OVC045
     FM030600 16007KT P6SM -RA BKN025 OVC045
Can anyone figure out a regex that will correctly split this odd case? Alternatively I could try to remove the problem line break in the input string before running the regex on it, but I can't figure out how to detect it.
You could start with a line that begins with a letter. Then collect at least one line, that starts with five spaces (you could easily loosen that condition to at least one whitespace character or something). And then go until the next line that starts with a word character.
"^(\\w+.*?^[ ]{5}.*?)(?:^\\b|\\Z)"
The [] around the space are not necessary, but I like to include them for readability. If you want only to assert, that there is a line that starts with any whitespace, replace [ ]{5} by \\s.
Note that you do not have to use the capturing group. A lookahead will make sure that you end at a position that is followed by either a new report or the end of the file:
"^\\w+.*?^[ ]{5}.*?(?=^\\b|\\Z)"
This is slightly more efficient and cleans up the following code a bit (because you can use the full match instead of retrieving the group.
Update:
To accommodate the possibility of single-line reports (and in general) it is even easier, to change the ending condition ^\\b to match the beginning of a new report. According to the format description given in the comment, you could use:
"^\\w+.*?(?=^(?:SPECI\\s|TAF\\sAMD\\s)?[A-Z]{3,4}\\s\\d+Z|\\Z)"
This requires a new report to start with either "optional SPECI"-"3 or 4 letters"-"timestamp" or "optional TAF AMD"-"3 or 4 letters"-"timestamp".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With