Disclaimer: This is my first post. Feel free to give me feedback and how I should or shouldn't have formatted this question. Thanks!
I'm looking to pull out data from text blocks by capturing anything that matches a pattern of a date format followed by a colon. I have successfully used regular expressions to capture information including an observation date, a colon, and any text that follows up to the period before the next date.
For example:
1999-01-01: 10 birds observed.
The problem that I am having is that some of my data contains site names followed by a colon within the observation data that follows that observation date and first colon. This sub-pattern of 'sitename: data' could occur zero or many times within the block following the observation date.
For example:
1999-01-01: BS-001: 5 birds observed. All in good health. BS-002: 5 birds observed, some in poor health.
What pattern should I use to capture all text after the date format and colon, including the potential site names, their colons, and related data up to the period before the next observation date?
I currently extract the simple observation data (without multiple sites within them) by date and observation using the following pattern:
pattern = re.compile(r'(\d\d\d\d\-*\s*\&*\d+\-*\d*:[A-Za-z0-9\s\,\(\)\;\"\-]*\.*)')
The code above lets me pull out observation dates that could be in a variety of forms. Using periods as part of the pattern is tricky since observation data could be one or many sentences.
Here is an example of the text I am trying to search and split out. Each new match should begin with an observation date, so in the data below there should be 3 matches returned (2013-04-13: data, 2017-01-01: data, and 2018-07-04: data):
2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched. 2017-01-01: 23 individuals observed. Egg masses were not present. 2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.
Ideally the output would look like this:
2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched.
2017-01-01: 23 individuals observed. Egg masses were not present.
2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.
Basically, it sounds like you want to separate your text into fields that start with a date and end just before a date or the end of the text. Here's one possibility:
\d{4}-\d\d-\d\d: # date with colon
.*? # the minimal amount of any characters required to match
(?= # positive lookahead (match text but don't consume it)
\d{4}-\d\d-\d\d: # date with colon
| # or
$ # end of text
) # end lookahead
Use it in conjunction with re.findall():
findall(r'\d{4}-\d\d-\d\d:.*?(?=\d{4}-\d\d-\d\d:|$)', mytext)
Run against your sample text above:
['2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat.
Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk
old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing
in the masses were AMJE-like). BS-443: 3 egg masses observed in
vernal pool habitat. A few egg masses may have been missed due to
poor light conditions. Smith-019: 250 egg masses observed in
vernal pool habitat. Observer searched only portions abutting the
road (SW margin of pool). Many AMJE masses observed attached
to herbaceous vegetation and difficult to differentiate from
one another. AMJE egg-mass count is a rough estimate within
area searched. ',
'2017-01-01: 23 individuals observed. Egg masses were not present. ',
'2018-07-04: BS-440: All individuals took a break from breeding for
the long holiday weekend.']
You can try a replacement of all white-spaces followed by a date with two newline characters:
s = re.sub(r'\s+(?=\d{4}-*\s*&*\d+-*\d*:)', "\n\n", s)
This way you don't match the first date at the beginning of the string.
If you are unsure each date is preceded by whitespaces, you can also write it like this:
s = re.sub(r'\s*(?!^)(?=\d{4}-*\s*&*\d+-*\d*:)', "\n\n", s)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With