Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems.. It's 100% free, no registration required.

I have to delete particular tags from an xml file. Sample xml below.

       <data>
          <tag:action/>
        </data>

I want to delete all contents between data and /data. The XML tags are not displayed in the question after posting.

I am able to do this by using remove() method in Python ElementTree xml parser. I am writing the modified contents to a new after the deletion of the element.

tree.write('new.xml');

The problem is that all the tag names in the original xml file are renamed to ns0, ns1 and so on in new.xml.

Is there any way to modify the XML file keeping all other contents in tact?

share|improve this question
    
That looks like an incomplete XML file to me. How would lxml know what namespace to associate with tag? –  Anthon May 8 at 5:25
add comment

1 Answer

You can use beautiful soup to do the job :

#!/usr/bin/python
# -*- coding: utf-8 -*-

import bs4

content = '''
<people>

  <person born="1975">
    <name>
      <first_name>John</first_name>
      <last_name>Doe</last_name>
    </name>
    <profession>computer scientist</profession>
    <homepage href="http://www.example.com/johndoe"/>
  </person>

  <person born="1977">
    <name>
      <first_name>Jane</first_name>
      <last_name>Doe</last_name>
    </name>
    <profession>computer scientist</profession>
    <homepage href="http://www.example.com/janedoe"/>
  </person>

</people>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(content)

for s in soup('name'):
    s.extract()

print(soup)

It produces the following result :

<html><body><people>
<person born="1975">

<profession>computer scientist</profession>
<homepage href="http://www.example.com/johndoe"></homepage>
</person>
<person born="1977">

<profession>computer scientist</profession>
<homepage href="http://www.example.com/janedoe"></homepage>
</person>
</people>
</body></html>

With namespaces :

#!/usr/bin/python
# -*- coding: utf-8 -*-

import bs4

content = '''<people xmlns:h="http://www.example.com/to/">

  <h:person born="1975">
    <h:name>
      <h:first_name>John</h:first_name>
      <h:last_name>Doe</h:last_name>
    </h:name>
    <h:profession>computer scientist</h:profession>
    <h:homepage href="http://www.example.com/johndoe"/>
  </h:person>

  <h:person born="1977">
    <h:name>
      <h:first_name>Jane</h:first_name>
      <h:last_name>Doe</h:last_name>
    </h:name>
    <h:profession>computer scientist</h:profession>
    <h:homepage href="http://www.example.com/janedoe"/>
  </h:person>

</people>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(content).people

for s in soup('h:name'):
    s.extract()

print(soup)

I added .people to prevent <html><body> </body></html> in the result.

<people xmlns:h="http://www.example.com/to/">
<h:person born="1975">

<h:profession>computer scientist</h:profession>
<h:homepage href="http://www.example.com/johndoe"></h:homepage>
</h:person>
<h:person born="1977">

<h:profession>computer scientist</h:profession>
<h:homepage href="http://www.example.com/janedoe"></h:homepage>
</h:person>
</people>
share|improve this answer
    
Thank You for the answer. I got it working with beautifulsoup. –  Akhitha May 8 at 11:30
    
Thank You for the answer. I got it working with beautifulsoup. But, there are namespaces in XML tags. How can I search for a particular tag if namespace is present. I used find and find_all, but its not returning the values. –  Akhitha May 8 at 13:23
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.