When formatting data for time-series analysis, a common need is to impute missing values by filling-forward values over time (also called Last-Observation-Carried-Forward / LOCF ).
While data-analysis environments typically provide that functionality (e.g. Pandas fillna() ), for larger data-sets it can be more efficient to compute it in SQL (e.g. utilizing a data-parallel data-warehouse appliance).
For example, consider:
    | UNIT | TIME | VALUE |
    |------|------|-------|
    | 1    | 1    | NULL  |
    | 1    | 2    | .5    |
    | 1    | 3    | NULL  |
    | 1    | 4    | NULL  |
    | 1    | 5    | .2    |
    | 1    | 6    | NULL  |
    | 2    | 1    | .6    |
    | 2    | 2    | NULL  |
which, after filling-forward the VALUE column over TIME (independently for each UNIT) yields:
    | UNIT | TIME | VALUE |
    |------|------|-------|
    | 1    | 1    | NULL  |
    | 1    | 2    | .5    |
    | 1    | 3    | .5    |
    | 1    | 4    | .5    |
    | 1    | 5    | .2    |
    | 1    | 6    | .2    |
    | 2    | 1    | .6    |
    | 2    | 2    | .6    |
(Notice the initial NULL for UNIT 1 can't be imputed as there is no prior value)
Time could also be a timestamp or datetime type column.
With some databases, e.g., Postgres, you can define your own aggregation functions. LOCF is just a running COALESCE.
CREATE OR REPLACE FUNCTION locf_state( FLOAT, FLOAT )
RETURNS FLOAT
LANGUAGE SQL
AS $f$
  SELECT COALESCE($2,$1)
$f$;
CREATE AGGREGATE locf(FLOAT) (
  SFUNC = locf_state,
  STYPE = FLOAT
);
The query is then more readable:
SELECT unit, time, 
       locf(value) OVER( PARTITION BY unit ORDER BY time )
FROM   mytable;
SQLFiddle: http://sqlfiddle.com/#!15/2c73b/1/0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With