This would be relatively easy if I only cared about a single min and max for each group, the problem is my requirement is to find the various boundaries. An example data set is as follows:
BoundaryColumn GroupIdentifier 1 A 3 A 4 A 7 A 8 B 9 B 11 B 13 A 14 A 15 A 16 A
What I need from the sql is a result set as follows:
min max groupid
1 7 A
8 11 B
13 16 A
Essentially finding the boundaries for each cluster of the groups.
The data would be stored in either oracle11g or mysql so syntax can be provided for either platform.
A disclaimer: It's a lot easier to query partial results and process something like this with a front-end language. That said...
The following query works for Oracle (which supports analytic queries) but not for MySQL (which does not). There's a SQL Fiddle here.
WITH BoundX AS (
SELECT * FROM (
SELECT
BoundaryColumn,
GroupIdentifier,
LAG(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLag,
LEAD(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLead
FROM MyTable
ORDER BY BoundaryColumn
)
WHERE GIDLag IS NULL OR GroupIdentifier <> GIDLag
OR GIDLead IS NULL OR GroupIdentifier <> GIDLead
)
SELECT MIN, MAX, GROUPID
FROM (
SELECT
BoundaryColumn AS MIN,
LEAD(BoundaryColumn) OVER (ORDER BY BoundaryColumn) AS MAX,
GroupIdentifier AS GROUPID,
GIDLag,
GIDLead
FROM BoundX
)
WHERE GROUPID = GIDLead
Here's the logic, step by step. You may be able to improve on this, because I get the feeling there's one subquery too many here...
This query pulls the prior and following GroupIdentifier values into each row:
SELECT
BoundaryColumn,
GroupIdentifier,
LAG(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLag,
LEAD(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLead
FROM MyTable
ORDER BY BoundaryColumn
The result looks like this:
BoundaryColumn GroupIdentifier GIDLag GIDLead
1 A A
3 A A A
4 A A A
7 A A B
8 B A B
9 B B B
11 B B A
13 A B A
14 A A A
15 A A A
16 A A
If you add logic to get rid of all the rows where GIDLag = GIDLead = GroupIdentifier, you'll end up with the boundaries:
WITH BoundX AS (
SELECT * FROM (
SELECT
BoundaryColumn,
GroupIdentifier,
LAG(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLag,
LEAD(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLead
FROM MyTable
ORDER BY BoundaryColumn
)
WHERE GIDLag IS NULL OR GroupIdentifier <> GIDLag
OR GIDLead IS NULL OR GroupIdentifier <> GIDLead
)
SELECT
BoundaryColumn AS MIN,
LEAD(BoundaryColumn) OVER (ORDER BY BoundaryColumn) AS MAX,
GroupIdentifier AS GROUPID,
GIDLag,
GIDLead
FROM BoundX
With this addition the results are:
MIN MAX GROUPID GIDLAG GIDLEAD
--- --- ------- ------ -------
1 7 A A
7 8 A A B
8 11 B A B
11 13 B B A
13 16 A B A
16 A A
Finally, include only those rows where GroupID = GIDLead. That's the query at the top of this answer. The results are:
MIN MAX GROUPID
--- --- -------
1 7 A
8 11 B
13 16 A
Take a look at this site regarding "runs" of data: http://www.sqlteam.com/article/detecting-runs-or-streaks-in-your-data
Armed with the knowledge provided in that link, you could write a query like this:
SELECT BoundaryColumn,
GroupIdentifier,
(
SELECT COUNT(*)
FROM Table T
WHERE T.GroupIdentifier <> TR.GroupIdentifier
AND T.BoundaryColumn <= TR.BoundaryColumn
) as RunGroup
FROM Table TR
Using this information, you could then group by "RunGroup", and select the GroupIdentifier and min/max BoundaryColumn.
EDIT: I've felt the peer pressure, here's an SQLFiddle with my version of the answer: http://www.sqlfiddle.com/#!8/9a24c/4/0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With