Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

huge text file (6Gb) search and replace

I have a huge file (6Gb) with 74.000 articles in this format:

<text id="1">
bla bla bla bla.........
</text>
<text id="2">
bla bla bla bla.........
</text>
<text id="3">
bla bla bla bla.........
</text>
<text id="............ and so on untill 74.000

then I have another file having the title corresponding to each of the id's, like this:

1       title1
2       title2
3       title3
...
74000   title74000

I have to put the corresponding title to each of the id's in the first file so I transformed the second file into this script:

sed -i "s/<text id="1">/<text id="1" title="title1">/" file1
sed -i "s/<text id="2">/<text id="2" title="title2">/" file1
sed -i "s/<text id="3">/<text id="3" title="title3">/" file1
...
sed -i "s/<text id="74000">/<text id="74000" title="title74000">/" file1

Notice I didn't put the g at the end of sed command because it is not global serch, that means at the first match it changes the string and goes to the next search. The script works, but due to the huge size of the file it takes 12 minutes per change, that gives me about two years to complete all the changes while I need them ASAP, so my question is if somebody knows how can I perform this changes in a faster way, maybe with some other utility, python, perls or any other...

like image 252
Andrés Chandía Avatar asked Jan 21 '26 09:01

Andrés Chandía


1 Answers

In Gnu Awk version 4, you could try:

gawk4 -f a.awk file2 RS="^$" file1

where a.awk is:

NR==FNR {
   b["<text id=\""$1"\">"]=$2
   next
}

{
    n=split($0,a,/<text id=[^>]*>/,s)
    printf "%s%s",s[0],a[1]
    for (i=1; i<n; i++) {
        ind=index(s[i],">")
        printf "%s%s", substr(s[i],1,ind-1) " title=\""b[s[i]]"\">", a[i+1]
    }
    printf "%s",s[n]
}

Output:

<text id="1" title="title1">
  bla bla bla bla.........
</text>
<text id="2" title="title2">
  bla bla bla bla.........
</text>
<text id="3" title="title3">
  bla bla bla bla.........
</text>

Update

Just for fun, I tested some of the solutions here on 3.9Mb xml file (80000 titles) and a 1.3Mb info file (also 80000 titles)

  • @HåkonHægland : 0.629s
  • @tangent : 0.645s
  • @Borodin : 0.718s
  • @glennjackman : 1.098s

(Scripts for generating the input files can be found here: http://pastebin.com/PpTPt0gk )

Update 2

To get more reliable timing results I took an average over 20 runs:

  • @EdMorton : 0.485s (Gnu Awk version 4.1)
  • @EdMorton : 0.528s (Gnu Awk version 3.1.8)
  • @HåkonHægland : 0.589s
  • @Borodin : 0.599s
  • @tangent : 0.626s
  • @glennjackman : 1.074s
like image 73
Håkon Hægland Avatar answered Jan 23 '26 21:01

Håkon Hægland



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!