huge text file (6Gb) search and replace

Question

I have a huge file (6Gb) with 74.000 articles in this format:

<text id="1">
bla bla bla bla.........
</text>
<text id="2">
bla bla bla bla.........
</text>
<text id="3">
bla bla bla bla.........
</text>
<text id="............ and so on untill 74.000

then I have another file having the title corresponding to each of the id's, like this:

1       title1
2       title2
3       title3
...
74000   title74000

I have to put the corresponding title to each of the id's in the first file so I transformed the second file into this script:

sed -i "s/<text id="1">/<text id="1" title="title1">/" file1
sed -i "s/<text id="2">/<text id="2" title="title2">/" file1
sed -i "s/<text id="3">/<text id="3" title="title3">/" file1
...
sed -i "s/<text id="74000">/<text id="74000" title="title74000">/" file1

Notice I didn't put the g at the end of sed command because it is not global serch, that means at the first match it changes the string and goes to the next search. The script works, but due to the huge size of the file it takes 12 minutes per change, that gives me about two years to complete all the changes while I need them ASAP, so my question is if somebody knows how can I perform this changes in a faster way, maybe with some other utility, python, perls or any other...

Håkon Hægland · Accepted Answer

In Gnu Awk version 4, you could try:

gawk4 -f a.awk file2 RS="^$" file1

where a.awk is:

NR==FNR {
   b["<text id=\""$1"\">"]=$2
   next
}

{
    n=split($0,a,/<text id=[^>]*>/,s)
    printf "%s%s",s[0],a[1]
    for (i=1; i<n; i++) {
        ind=index(s[i],">")
        printf "%s%s", substr(s[i],1,ind-1) " title=\""b[s[i]]"\">", a[i+1]
    }
    printf "%s",s[n]
}

Output:

<text id="1" title="title1">
  bla bla bla bla.........
</text>
<text id="2" title="title2">
  bla bla bla bla.........
</text>
<text id="3" title="title3">
  bla bla bla bla.........
</text>

Update

Just for fun, I tested some of the solutions here on 3.9Mb xml file (80000 titles) and a 1.3Mb info file (also 80000 titles)

@HåkonHægland : 0.629s
@tangent : 0.645s
@Borodin : 0.718s
@glennjackman : 1.098s

(Scripts for generating the input files can be found here: http://pastebin.com/PpTPt0gk )

Update 2

To get more reliable timing results I took an average over 20 runs:

@EdMorton : 0.485s (Gnu Awk version 4.1)
@EdMorton : 0.528s (Gnu Awk version 3.1.8)
@HåkonHægland : 0.589s
@Borodin : 0.599s
@tangent : 0.626s
@glennjackman : 1.074s

huge text file (6Gb) search and replace

Tags:

python

shell

sed

awk

perl

Andrés Chandía

1 Answers

Håkon Hægland

Recent Activity

Donate For Us

huge text file (6Gb) search and replace

Tags:

python

shell

sed

awk

perl

Andrés Chandía

1 Answers

Håkon Hægland

Related questions

Recent Activity

Donate For Us