Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to remove or replace specific chars are between two xml tags [linux, python, lxml, sed, awk,...]?

I'm using LXML library in python for XML parsing.

in a XML file, i have some bad characters that lead to below error in python:

lxml.etree.XMLSyntaxError: CharRef

Before opening and fetching the content of XML file in python, I must remove bad chars from two tags:

1: <essid cloaked="true">....</essid> or <essid cloaked="false">....</essid>.

2: <client-manuf>....</client-manuf>

the size of XML file is big. so I want to do it with sed or awk or similar tools.

    <crypt>0</crypt>
        <total>20    50</total>
        <fragments>0</fragments>
        <retries>0</retries>
    </packets>
    <datasize>0</datasize>
    <wireless-client number="1" type="established" first-time="Thu Feb 15 16:45:43 2018" last-time="Thu Feb 15 16:45:43 2018">
        <client-mac>08:EA:40:D0:55:43</client-mac>
        <client-manuf>SHENZHEN BILIAN ELECTRONIC CO.&#x  ef;&#x  bc;&#x  8c;LTD</client-manuf>
        <essid cloaked="true">&#x   0;&#x   0;&#x   0;&#x   0;&#x   0;</essid>
        <channel>8</channel>
        <maxseenrate>1.000000</maxseenrate>
        <carrier>IEEE 802.11b+</carrier>
        <encoding>CCK</encoding>
        <packets>
            <LLC>0</LLC>
            <data>0</data>
            <crypt>0</crypt>

I want to remove the bad chars from these tags (client-manuf and essid).

From: <client-manuf>SHENZHEN BILIAN ELECTRONIC CO.&#x ef;&#x bc;&#x 8c;LTD</client-manuf>

To (or this): <client-manuf>SHENZHEN BILIAN ELECTRONIC CO. LTD</client-manuf>

To (or this): <client-manuf>SHENZHEN BILIAN ELECTRONIC CO</client-manuf>

-----------------------------------------------

From: <essid cloaked="true">&#x 0;&#x 0;&#x 0;&#x 0;&#x 0;</essid>

From: <essid cloaked="false">&#x 0;&#x WiFi 0;&#x MTN 0;&#x 0;&#x 0;</essid>

To (or this): <essid cloaked="true"></essid>

To (or this): <essid cloaked="true">N/A SSID</essid>

To (or this): <essid cloaked="false">WiFi MTN</essid>

for example, two bad chars:

1: 0;

2: &#x

This is my solution. but it doesn't work well for my needs:

sed -e '/<essid cloaked="\(true\|false"\)>*.*<\/essid>/ s/\(&#x\|0;\)//g' a.txt

like image 952
ali reza Avatar asked Dec 08 '25 21:12

ali reza


1 Answers

Your sed command didn't look so bad, it just left a lot of whitespace.

Since sed is normally greedy, you may specify any amount of space with " *".

cat bad.xml | sed '/<essid cloaked="\(true\|false"\)>*.*<\/essid>/ s/ *\(&#x\|0;\) *//g'

On the other hand, if there is some valid text, you might not want to stick it together, so you could add one space per removed pattern:

cat bad.xml | sed '/<essid cloaked="\(true\|false"\)>*.*<\/essid>/ s/ *\(&#x\|0;\) */ /g'

In the end you might condense multiple spaces to just one:

cat bad.xml | sed '/<essid cloaked="\(true\|false"\)>*.*<\/essid>/{s/ *\(&#x\|0;\) */ /g;s/  */ /g}'

Note, that the construct {foo;bar} binds the two commands to a block of commands, only operating on the before grabbed pattern. The second pattern would else affect the whole file.

With another masked pair of parenthesis and a masked plus:

cat bad.xml | sed '/<essid cloaked="\(true\|false"\)>*.*<\/essid>/{s/\( *\(&#x\|0;\) *\)\+/ missing essid /g;s/  */ /g}'

you can s:substitute a repeated occurence of a pattern with just one thing.

      s/\( *\(&#x\|0;\) *\)\+/ missing essid /;
      ^  (   (pattern1)   )+ / replacement   /(g now obsolete
         (pattern .......2)

The inner pattern is an alternative &#x or 0;. The outer pattern is the inner pattern, optionally guarded by blanks like

     "0;"
     "0; "
     " 0; "
     " 0;"
     "    0;  "
     "    &#x"

and so on.

You want the inner pattern, let's call it X, be repeated once or more than once, therefore the +. But without parens, + only addresses the last character, not the whole pattern.

You have to learn this regex-language. Find a tutorial. You can't ask for every possible variation you will need in your life.

It pays off very rapidly to have good, basic understanding. You don't need to know everything by hearth, but the basic stuff and should have a good estimation, what is possible and what not. Then a repo, to search for the things, rarely used. And then you might only ask the hard/complicated stuff.

like image 125
user unknown Avatar answered Dec 11 '25 11:12

user unknown



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!