Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python removing duplicate names

I have plain text file with words in each line:

3210    <DOCID>GH950102-000003<DOCID>/O
  3243  Australia/LOCATION
  3360  England/LOCATION
  3414  India/LOCATION
  3474  Melbourne/LOCATION
  3497  England/LOCATION
  3521  >India<TOPONYM>/O
  3526  >Zimbabwe<TOPONYM>/O
  3531  >England<TOPONYM>/O
  3536  >Melbourne<TOPONYM>/O
  3541  >England<TOPONYM>/O
  3546  >England<TOPONYM>/O
  3551  >Glasgow<TOPONYM>/O
  3556  >England<TOPONYM>/O
  3561  >England<TOPONYM>/O
  3566  >Australia<TOPONYM>/O
3568    <DOCID>GH950102-000004<DOCID>/O
  3739  Hampden/LOCATION
  3821  Hampden/LOCATION
  3838  Ibrox/LOCATION
  3861  Neerday/LOCATION
  4161  Fir Park/LOCATION
  4229  Park<TOPONYM>/O
  4234  >Hampden<TOPONYM>/O
  4239  >Hampden<TOPONYM>/O
  4244  >Midfield<TOPONYM>/O
  4249  >Glasgow<TOPONYM>/O
  4251  <DOCID>GH950102-000005<DOCID>/O
  4535  Edinburgh/LOCATION
  4840  Road<TOPONYM>/O
  4845  >Edinburgh<TOPONYM>/O
  4850  >Glasgow<TOPONYM>/O``

I want to remove same location names in this list and it should look like this:

3210    <DOCID>GH950102-000003<DOCID>/O
  3243  Australia/LOCATION
  3360  England/LOCATION
  3414  India/LOCATION
  3474  Melbourne/LOCATION
  3497  England/LOCATION
  3526  >Zimbabwe<TOPONYM>/O
  3551  >Glasgow<TOPONYM>/O
3568    <DOCID>GH950102-000004<DOCID>/O
  3739  Hampden/LOCATION
  3838  Ibrox/LOCATION
  3861  Neerday/LOCATION
  4161  Fir Park/LOCATION
  4229  Park<TOPONYM>/O
  4244  >Midfield<TOPONYM>/O
  4249  >Glasgow<TOPONYM>/O
  4251  <DOCID>GH950102-000005<DOCID>/O
  4535  Edinburgh/LOCATION
  4840  Road<TOPONYM>/O
  4850  >Glasgow<TOPONYM>/O

I want to remove the duplicate locations name and docid should remain in the file. I know there is a way through linux using uniq but if I'll run that it will remove locations within different docid. Is there anyway to split it through every docid and within docid if location names are same then it should remove duplicate names.

like image 912
Moizzy Avatar asked May 21 '26 05:05

Moizzy


2 Answers

I am writing from mobile, so this will not be a complete solution, but the key points:

import re
Docid=re.compile("^ *\d+ +<DOCID>")
Location=re.compile("^ *\d +>?(. +)/")
Lines={} 
for line in file:
    if re.match(Docid,line):
        Lines={}
        print line
    else:
        loc=re.findall(Location, line)[0]
        if loc not in Lines.keys():
             print line
             Lines[loc] = True

Basically it checks each line of it is not a new docid. If it isn't, it then tries to read location and see if it already was read. If not, it prints the location and adds it to the list of locations tead.

If there is a new docid, it resets the last of read locations.

like image 52
Gnudiff Avatar answered May 24 '26 03:05

Gnudiff


Here is a way to do it.

import string
filename = 'testfile'
lines = tuple(open(filename, 'r'))

final_list = []
unique_list = [] # this resets itself every docid
for line in lines:
    currentline = str(line)
    if 'DOCID' in currentline:
        unique_list = []  # this resets itself every docid
        final_list.append(line)
    else:
        exclude = set(string.punctuation)
        currentline = ''.join(ch if ch not in exclude else " " for ch in currentline)
        city = currentline.split()[1]
        if city not in unique_list:
            unique_list.append(city)
            final_list.append(line)

for line in final_list:
    print(line)

output:

3210    <DOCID>GH950102-000003<DOCID>/O

  3243  Australia/LOCATION

  3360  England/LOCATION

  3414  India/LOCATION

  3474  Melbourne/LOCATION

  3526  >Zimbabwe<TOPONYM>/O

  3551  >Glasgow<TOPONYM>/O

3568    <DOCID>GH950102-000004<DOCID>/O

  3739  Hampden/LOCATION

  3838  Ibrox/LOCATION

  3861  Neerday/LOCATION

  4161  Fir Park/LOCATION

  4229  Park<TOPONYM>/O

  4244  >Midfield<TOPONYM>/O

  4249  >Glasgow<TOPONYM>/O

  4251  <DOCID>GH950102-000005<DOCID>/O

  4535  Edinburgh/LOCATION

  4840  Road<TOPONYM>/O

  4850  >Glasgow<TOPONYM>/O``

Note: The testfileis a text file with your input text. You can optimize the code if necessary.

like image 28
utengr Avatar answered May 24 '26 02:05

utengr