sort and extract certain number of rows from a file containing dates

Question

i have in a txt file, date like:

yyyymmdd

raw data are like:

They are more than 100k rows. i am trying to keep in one file the "newest" 10k lines, and in a separate file the 10k "oldest" 10k lines.

I guess this must be a two steps process:

sort lines,
then extract the 10k rows that are on top, the "newest = most recent dates" and the 10k rows that are towards the end of the file ie the "oldest = most ancient dates"

How could i achieve it using awk?

I even tried with perl no luck though, so a perl one liner would be highly accepted as well.

Edit: i would prefer a clean clever solution so that i learn from, and not an optimization of my attempts.

example with perl

@dates = ('20170401', '20170721', '20200911');
@ordered = sort { &compare } @dates;
sub compare {
    $a =~ /(\d{4})(\d{2})(\d{2})/;
    $c = $3 . $2 . $1;
    $b =~ /(\d{4})(\d{2})(\d{2})/;
    $c = $3 . $2 . $1;
    $c <=> $d;
}
print "@ordered
";

Ted Lyngmo · Accepted Answer

This is an answer using perl. If you want the oldest on top, you can use the standard sort order:

@dates = sort @dates;

Reverse sort order, with the newest on top:

@dates = sort { $b <=> $a } @dates;
#                  ^^^
#                   |
# numerical three-way comparison returning -1, 0 or +1

You can then extract 10000 of the entries from the top:

my $keep = 10000;
my @top = splice @dates, 0, $keep;

And 10000 from the bottom:

$keep = @dates unless(@dates >= $keep);
my @bottom = splice @dates, -$keep;

@dates will now contain the dates between the 10000 at the top and the 10000 at the bottom that you extracted.

You can then save the two arrays to files if you want:

sub save {
    my $filename=shift;
    open my $fh, '>', $filename or die "$filename: $!";
    print $fh join("
", @_) . "
" if(@_);
    close $fh;
}

save('top', @top);
save('bottom', @bottom);

zdim · Answer

A command-line script ("one"-liner) with Perl

perl -MPath::Tiny=path -we'
    $f = shift; $n = shift//2;              # filename; number of lines or default
    @d = sort +(path($f)->lines);           # sort lexicographically, ascending
    $n = int @d/2 if 2*$n > @d;             # top/bottom lines, up to half of file
    path("bottom.txt")->spew(@d[0..$n-1]);  # write files, top/bottom $n lines
    path("top.txt")   ->spew(@d[$#d-$n+1..$#d])
' dates.txt 4

Comments

Needs a filename, and can optionally take the number of lines to take from top and bottom; in this example 4 is passed (with default 2), for easy tests with small files. Don't need to check for the filename since the library used to read it, Path::Tiny, does that
For the library (-MPath::Tiny) I specify the method name (=path) only for documentation; this isn't necessary since the libary is a class, so that =path may be just removed
Sorting is alphabetical but that is fine with dates in this format; oldest dates come first but that doesn't matter since we'll split off what we need. To enforce numerical sorting, and once at it to sort in descending order, use sort { $b <=> $a } @d;. See sort
We check whether there is enough lines in the file for the desired number of lines to shave off from the (sorted) top and bottom ($n). If there isn't then that's set to half the file
The syntax $#ary is the last index of the array @ary and that is used to count off $n items from the back of the array with lines @d

This is written as a command-line program ("one-liner") merely because that was asked for. But that much code would be far more comfortable in a script.

sort and extract certain number of rows from a file containing dates

Tags:

awk

perl

pesaw

2 Answers

Ted Lyngmo

zdim

Recent Activity

Donate For Us

sort and extract certain number of rows from a file containing dates

Tags:

awk

perl

pesaw

2 Answers

Ted Lyngmo

zdim

Related questions

Recent Activity

Donate For Us