I have data where I am trying to identify patterns from. However the data in each table isn't complete (there are missing rows). I would like to separate the table into chunks of complete data and then identify the patterns from each. I have a column where I can use to identify if the data is complete or not called sequence
.
Data will look like:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
First I'd like to split the data into complete sections:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
---------------------------
8 closed
9 open
---------------------------
11 open
---------------------------
13 closed
14 open
15 open
---------------------------
18 closed
19 open
20 closed
Then I'd like to identify the pattern closed open, ..., open, closed
such that we go from closed to open for n rows (where n is at least 1) and then back to closed
From the sample data this would leave:
Sequence Position
2 closed
3 open
4 open
5 closed
---------------------------
18 closed
19 open
20 closed
This leaves my final table where I can perform analysis on as I know there are no broken sequences. I also have another column where the position
is binary if that is easier to work with.
The tables are large, so although I think I can write loops to figure out my result, I don't think that method would be efficient enough. Alternatively I was going to pull the whole table into R
, then find the result table but this requires pulling everything into R
first so I'm wondering if this is feasible in SQL
EDIT: Different sample data that's more representative:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
21 closed
22 closed
23 closed
24 open
25 open
26 closed
27 open
Note this should have the same results but also with
23 closed
24 open
25 open
26 closed
21
, 22
and 27
are not as they dont fit the closed
, open
...,open
, closed
pattern
BUT if we had 28 closed
we would want 27
and 28
as there is no time gap and the pattern would fit. If instead of 28
it was 29 closed
we wouldnt want 27
or 29
(because although the pattern is right the sequence breaks).
To add some context, think of a machine that goes from stop, to running, to stopped. We record the data, but have gaps in the recording which here are represented by the breaking of the sequences. As well as missing data in the middle of the stop running stop cycle; the data also sometimes starts recording when the machine is already running or stops recording before the machine stops. I don't want that data as it is not a complete cycle of stop, running, stop. I only want those complete cycles, and where the sequence was continuous. This means I can transform my original data set into one with only complete cycles one after the other.
You can use it.
DECLARE @MyTable TABLE (Sequence INT, Position VARCHAR(10))
INSERT INTO @MyTable
VALUES
(1,'open'),
(2,'closed') ,
(3,'open'),
(4,'open'),
(5,'closed'),
(8,'closed'),
(9,'open'),
(11,'open'),
(13,'closed'),
(14,'open') ,
(15,'open'),
(18,'closed'),
(19,'open'),
(20,'closed'),
(21,'closed'),
(22,'closed'),
(23,'closed'),
(24,'open'),
(25,'open'),
(26,'closed'),
(27,'open')
;WITH CTE AS(
SELECT * ,
CASE WHEN Position ='closed' AND LAG(Position) OVER(ORDER BY [Sequence]) ='closed' THEN 1 ELSE 0 END CloseMark
FROM @MyTable
)
,CTE_2 AS
(
SELECT
[New_Sequence] = [Sequence] + (SUM(CloseMark) OVER(ORDER BY [Sequence] ROWS UNBOUNDED PRECEDING ))
, [Sequence]
, Position
FROM CTE
)
,CTE_3 AS (
SELECT *,
RN = ROW_NUMBER() OVER(ORDER BY [New_Sequence])
FROM CTE_2
)
,CTE_4 AS
(
SELECT ([New_Sequence] - RN) G
, MIN(CASE WHEN Position = 'closed' THEN [Sequence] END) MinCloseSq
, MAX(CASE WHEN Position = 'closed' THEN [Sequence] END) MaxCloseSq
FROM CTE_3
GROUP BY ([New_Sequence] - RN)
)
SELECT
CTE.Sequence, CTE.Position
FROM CTE_4
INNER JOIN CTE ON (CTE.Sequence BETWEEN CTE_4.MinCloseSq AND CTE_4.MaxCloseSq)
WHERE
CTE_4.MaxCloseSq > CTE_4.MinCloseSq
AND (CTE_4.MaxCloseSq IS NOT NULL AND CTE_4.MinCloseSq IS NOT NULL)
Result:
Sequence Position
----------- ----------
2 closed
3 open
4 open
5 closed
--- ---
18 closed
19 open
20 closed
--- ---
23 closed
24 open
25 open
26 closed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With