Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get value from XML Tag in Python?

Tags:

python

xml

I have XML file as below.

<?xml version="1.0" encoding="UTF-8"?><searching>
   <query>query01</query>
   <document id="0">
      <title>lord of the rings.</title>
    <snippet>
      this is a snippet of a document.
    </snippet>
      <url>http://www.google.com/</url>
   </document>
   <document id="1">
      <title>harry potter.</title>
    <snippet>
            this is a snippet of a document.
    </snippet>
      <url>http://www.google.com/</url>
   </document>
   ........ #and other documents .....

  <group id="0" size="298" score="145">
      <title>
         <phrase>GROUP A</phrase>
      </title>
      <document refid="0"/>
      <document refid="1"/>
      <document refid="84"/>
   </group>
  <group id="0" size="298" score="55">
      <title>
         <phrase>GROUP B</phrase>
      </title>
      <document refid="2"/>
      <document refid="13"/>
      <document refid="3"/>
   </group>
   </<searching>>

I want to get the group name above and what are the document id (and its title) in each group. My idea is store document id and document title into dictionary as:

import codecs
documentID = {}    
group = {}

myfile = codecs.open("file.xml", mode = 'r', encoding = "utf8")
for line in myfile:
    line = line.strip()
    #get id from tags
    #get title from tag
    #store in documentID 


    #get group name and document reference

Moreover, I have tried BeautifulSoup but very new to it. I don't know how to do. this is the code I am doing.

def outputCluster(rFile):
    documentInReadFile = {}         #dictionary to store all document in readFile

    myfile = codecs.open(rFile, mode='r', encoding="utf8")
    soup = BeautifulSoup(myfile)
    # print all text in readFile:
    # print soup.prettify()

    # print soup.find+_all('title')

outputCluster("file.xml")

Please kindly leave me some suggestion. Thank you.

like image 507
theteddyboy Avatar asked Sep 03 '25 03:09

theteddyboy


1 Answers

The previous posters have the right of it. The etree documentation can be found here:

https://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

And can help you out. Here's a code sample that might do the trick (partially taken from the above link):

import xml.etree.ElementTree as ET
tree = ET.parse('your_file.xml')
root = tree.getroot()

for group in root.findall('group'):
  title = group.find('title')
  titlephrase = title.find('phrase').text
  for doc in group.findall('document'):
    refid = doc.get('refid')

Or if you want the ID stored in the group tag, you'd use id = group.get('id') instead of searching for all the refids.

like image 122
TheSoundDefense Avatar answered Sep 05 '25 01:09

TheSoundDefense