I have XML file as below.
<?xml version="1.0" encoding="UTF-8"?><searching>
<query>query01</query>
<document id="0">
<title>lord of the rings.</title>
<snippet>
this is a snippet of a document.
</snippet>
<url>http://www.google.com/</url>
</document>
<document id="1">
<title>harry potter.</title>
<snippet>
this is a snippet of a document.
</snippet>
<url>http://www.google.com/</url>
</document>
........ #and other documents .....
<group id="0" size="298" score="145">
<title>
<phrase>GROUP A</phrase>
</title>
<document refid="0"/>
<document refid="1"/>
<document refid="84"/>
</group>
<group id="0" size="298" score="55">
<title>
<phrase>GROUP B</phrase>
</title>
<document refid="2"/>
<document refid="13"/>
<document refid="3"/>
</group>
</<searching>>
I want to get the group name above and what are the document id (and its title) in each group. My idea is store document id and document title into dictionary as:
import codecs
documentID = {}
group = {}
myfile = codecs.open("file.xml", mode = 'r', encoding = "utf8")
for line in myfile:
line = line.strip()
#get id from tags
#get title from tag
#store in documentID
#get group name and document reference
Moreover, I have tried BeautifulSoup but very new to it. I don't know how to do. this is the code I am doing.
def outputCluster(rFile):
documentInReadFile = {} #dictionary to store all document in readFile
myfile = codecs.open(rFile, mode='r', encoding="utf8")
soup = BeautifulSoup(myfile)
# print all text in readFile:
# print soup.prettify()
# print soup.find+_all('title')
outputCluster("file.xml")
Please kindly leave me some suggestion. Thank you.
The previous posters have the right of it. The etree documentation can be found here:
https://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
And can help you out. Here's a code sample that might do the trick (partially taken from the above link):
import xml.etree.ElementTree as ET
tree = ET.parse('your_file.xml')
root = tree.getroot()
for group in root.findall('group'):
title = group.find('title')
titlephrase = title.find('phrase').text
for doc in group.findall('document'):
refid = doc.get('refid')
Or if you want the ID stored in the group tag, you'd use id = group.get('id')
instead of searching for all the refid
s.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With