Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sort and extract certain number of rows from a file containing dates

Tags:

awk

perl

i have in a txt file, date like:

yyyymmdd

raw data are like:

20171115
20171115
20180903
...
20201231

They are more than 100k rows. i am trying to keep in one file the "newest" 10k lines, and in a separate file the 10k "oldest" 10k lines.

I guess this must be a two steps process:

  1. sort lines,

  2. then extract the 10k rows that are on top, the "newest = most recent dates" and the 10k rows that are towards the end of the file ie the "oldest = most ancient dates"

How could i achieve it using awk?

I even tried with perl no luck though, so a perl one liner would be highly accepted as well.

Edit: i would prefer a clean clever solution so that i learn from, and not an optimization of my attempts.

example with perl

@dates = ('20170401', '20170721', '20200911');
@ordered = sort { &compare } @dates;
sub compare {
    $a =~ /(\d{4})(\d{2})(\d{2})/;
    $c = $3 . $2 . $1;
    $b =~ /(\d{4})(\d{2})(\d{2})/;
    $c = $3 . $2 . $1;
    $c <=> $d;
}
print "@ordered\n";
like image 892
pesaw Avatar asked Sep 17 '25 19:09

pesaw


2 Answers

This is an answer using perl. If you want the oldest on top, you can use the standard sort order:

@dates = sort @dates;

Reverse sort order, with the newest on top:

@dates = sort { $b <=> $a } @dates;
#                  ^^^
#                   |
# numerical three-way comparison returning -1, 0 or +1

You can then extract 10000 of the entries from the top:

my $keep = 10000;
my @top = splice @dates, 0, $keep;

And 10000 from the bottom:

$keep = @dates unless(@dates >= $keep);
my @bottom = splice @dates, -$keep;

@dates will now contain the dates between the 10000 at the top and the 10000 at the bottom that you extracted.

You can then save the two arrays to files if you want:

sub save {
    my $filename=shift;
    open my $fh, '>', $filename or die "$filename: $!";
    print $fh join("\n", @_) . "\n" if(@_);
    close $fh;
}

save('top', @top);
save('bottom', @bottom);
like image 190
Ted Lyngmo Avatar answered Sep 20 '25 23:09

Ted Lyngmo


A command-line script ("one"-liner) with Perl

perl -MPath::Tiny=path -we'
    $f = shift; $n = shift//2;              # filename; number of lines or default
    @d = sort +(path($f)->lines);           # sort lexicographically, ascending
    $n = int @d/2 if 2*$n > @d;             # top/bottom lines, up to half of file
    path("bottom.txt")->spew(@d[0..$n-1]);  # write files, top/bottom $n lines
    path("top.txt")   ->spew(@d[$#d-$n+1..$#d])
' dates.txt 4

Comments

  • Needs a filename, and can optionally take the number of lines to take from top and bottom; in this example 4 is passed (with default 2), for easy tests with small files. Don't need to check for the filename since the library used to read it, Path::Tiny, does that

  • For the library (-MPath::Tiny) I specify the method name (=path) only for documentation; this isn't necessary since the libary is a class, so that =path may be just removed

  • Sorting is alphabetical but that is fine with dates in this format; oldest dates come first but that doesn't matter since we'll split off what we need. To enforce numerical sorting, and once at it to sort in descending order, use sort { $b <=> $a } @d;. See sort

  • We check whether there is enough lines in the file for the desired number of lines to shave off from the (sorted) top and bottom ($n). If there isn't then that's set to half the file

  • The syntax $#ary is the last index of the array @ary and that is used to count off $n items from the back of the array with lines @d

This is written as a command-line program ("one-liner") merely because that was asked for. But that much code would be far more comfortable in a script.

like image 31
zdim Avatar answered Sep 21 '25 00:09

zdim