Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems.. It's 100% free, no registration required.

I have the following file:

<?xml version="1.0" encoding="utf-8"?>
<!--Generated by crowdin.net-->
  <string name="test" >- test</string>
  <string name="test" >test-test</string>
  <string name="test" >test - test</string>

and I would like to replace the en dash with its unicode value, but not all of them, just the one in the string tag

I run several sed with different regex, but I couldn't figured it out. One of those was

sed -i.bak "s/-[^-\<\>0-9]/\&#8211\;/g" strings.xml

the output was:

<?xml version="1.0" encoding="utf-8"?>
<!-&#8211;enerated by-->
  <string name="test" >&#8211;test</string>
  <string name="test2" >test&#8211;est</string>
  <string name="test3" >test &#8211;test</string>

my problem is that is also replacing empty spaces and the first char of the second word. I have not that big experience with regex and sed. Could you please explain me what I am doing wrong?

Note: I'm using OSX.

share|improve this question
    
    
Take a look at www.rubular.com and plug in your regex expression and XML code. As you know, you do need to tweak your regex code but that website is useful –  ryekayo yesterday
add comment

3 Answers

up vote 3 down vote accepted

With a recent (for \K and s///r) perl and assuming your <string> tags don't nest:

perl -0777 -pi.bak -e's{<string.*?>\K.*?(?=</string>)}{$&=~s/-/&#8211;/rg}ges' file.xml
  • -0777: slurp mode: handle the whole file at once (to allow <string> tags to span several lines).
  • -p: sed mode
  • -i.bak: in-place editing with .bak extension (BTW, that's where some sed implementations got that idea from)
  • s{...}{...}ges: substitute globally (g), where . matches newline characters as well (s), and treat the replacement as perl code to execute (e).
  • <string.*?>\K.*?</string>: match from <string...> to </string> but don't include the tags themselves in the part that is matched (\K defines where the matched portion starts, and (?=...) is a look-ahead operator that only checks if </string> is there, but doesn't include it in the match).
  • $&=~s/.../.../rg. Do the substitution on the matched part ($&). The r flag is to actually not modify $& but return the substituted string.
share|improve this answer
    
Nice! Why do you say this won't work on nested tags? It does on my system. –  terdon yesterday
    
thank you very much –  blackbelt yesterday
    
@terdon, it won't replace the - in <string><string></string>-</string> –  Stéphane Chazelas yesterday
    
Ah, yes. I saw it would work on <string>-<string>-</string></string> and assumed. –  terdon yesterday
add comment

Phew, after some time I got it. This is a naive solution. terdon's answer is more correct and you should use his though :).

sed -Ei.bak "s/(.*<string[^>]*\")(.*)-(.*)/\1\2\&#8211;\3/g" strings.xml

I am using Backreferences to refer back to a previously matched string. These are \1 \2 etc.

In this case sed should match following groups:

  • (.*<string[^>]*\") - any characters followed by a string tag opening until a quote ". Group 1
  • (.*) - anything after the " (including right now >) until group 3. Group 2
  • - the matching dash
  • (.*) - anything after the matching dash Group 3

Then I replace it with the previously matched groups and the dash HTML value &#8211;, by using \n with n as the reference to group n.

Problems:

I currently try to fix some problems, so please cope with me:

  1. Group 1 matches also dsfjpasj<string
  2. Group 1 should include the string tag ending character >
  3. As terdon points out: "this won't work for cases where you have >1 - or nested tags or tags spanning multiple lines"

Read more:

http://toytoygogie.blogspot.de/2010/02/using-sed-with-backreference-as.html

share|improve this answer
1  
Note that the format for -i is different for non-GNU sed (the OP is on OSX). Also, this won't work for cases where you have >1 - or nested tags or tags spanning multiple lines. –  terdon yesterday
2  
@terdon, that -Ei.bak syntax will work with FreeBSD/OSX sed. The difference is that the backup extension is optional in GNU sed (and has to be -i.bak, not -i .back) while it's required in FreeBSD/OSX (where both -i.back and -i .back are allowed). But here, since it's provided and is not in the form -i .back, it will work with both GNU and FreeBSD/OSX sed. –  Stéphane Chazelas yesterday
    
thank you very match, I appreciate –  blackbelt yesterday
    
two more questions: It would be much effort make it works for the case with more than 1 "-"? If I would like to replace em dash, can I simply change from (.*)-(.*) to (.*)--(.*) ? –  blackbelt yesterday
add comment

If I understand correctly, you want to replace all cases (three in your example) of - within <strng></string> tags and only those cases. If so, these approaches should work assuming your XML is sane:

  1. Use a regular expression and a simple tool like sed

    sed 's/\(<string[^>]*>[^-]*\)-\([^-]*<\/string\)/\1\&#8211;\2/' file.xml 
    
  2. If your file is always like the example above and you can be sure that your tags will always be <string name="test" ></string>, you can use lookbehinds:

    perl -pe 's/(?<=<string name="test" >)([^<]*?)-([^<]*)/$1&#8211;$2/g' file.xml
    
  3. None of the above will work if you have more than a single - within the tags. To deal with such cases, you can write a simple little script that checks whether we're within <string></string> tags. This should also deal with nested tags.

    perl -F'<' -lane 'for($i=0;$i<=$#F;$i++){
        $a++ if $F[$i]=~/^string/; 
        $F[$i]=~s/-/&#8211;/g if $a>0; 
        $a-- if $F[$i]=~/^\/string/
    } print join "<",@F' file.xml
    
share|improve this answer
    
Hi thanks for your answer. I tried the lookbehind solution but it looks like is not supported/implemented in sed osx –  blackbelt yesterday
    
@blackbelt the lookbehind example is using perl, not sed. As far as I know, sed does not support lookarounds. –  terdon yesterday
    
I see, thanks for your reply –  blackbelt yesterday
    
@blackbelt you're welcome :). I suggest you accept Stephane's answer below, it is simpler than mine yet still works for multiline, (some) nested tags and multiple - cases. –  terdon yesterday
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.