Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

The title might seem a bit recursive, and indeed it is.

I am working on a Javascript which can highlight/color Javascript code displayed in HTML. Thus, in the Internet Browser, comments will be turned green, definitions (for, if, while, etc.) will be turned a dark blue and italic, numbers will be red, and so on for other elements. However, the coloring is not all that important.

I am trying to figure out two different regular expressions which have started to cause a minor headache.

1. Finding a regular expression using a regular expression

I want to find regular expressions within the script-tags of HTML using a Javascript, such as:

    match(/findthis/i);

, where the regex part of course is "/findthis/i".

The rules are as follows:

  1. Finding multiple occurrences (/g) is not important.
  2. It must be on the same line (not /m).
  3. Caseinsensitive (/i).
  4. If a backward slash (ignore character) is followed directly by a forward slash, "/", the forward slash is part of the expression - not an escape character. E.g.: /itdoesntstop\/untilnow:/
  5. Two forward slashes right next to each other (//) is: (A) At the beginning: Not a regex; it's a comment. (B) Later on: First slash is the end of the regex and the second slash is nothing but a character.
  6. Regex continues until the line breaks or end of input (\n|$), or the escape character (second forward slash which complies with rule 4) is encountered. However, also as long as only alphabetic characters are encountered, following the second forward slash, they are considered part of the regex. E.g.: /aregex/allthisispartoftheregex

So far what I've got is this:

    '\\/(?:[^\\/\\\\]|\\/\\*)*\\/([a-zA-Z]*)?'

However, it isn't consistent. Any suggestions?

2. Find digits (alphanumeric, floating) using a regular expression

Finding digits on their own is simple. However, finding floating numbers (with multiple periods) and letters including underscore is more of a challenge.

All of the below are considered numbers (a new number starts after each space):

3 3.1 3.1.4 3a 3.A 3.a1 3_.1

The rules:

  1. Finding multiple occurrences (/g) is not important.
  2. It must be on the same line (not /m).
  3. Caseinsensitive (/i).
  4. A number must begin with a digit. However, the number can be preceeded or followed by a non-word (\W) character. E.g.: "=9.9;" where "9.9" is the actual number. "a9" is not a number. A period before the number, ".9", is not considered part of the number and thus the actual number is "9".
  5. Allowed characters: [a-zA-Z0-9_.]

What I've got:

'(^|\\W)\\d([a-zA-Z0-9_.]*?)(?=([^a-zA-Z0-9_.]|$))'

It doesn't work quite the way I want it.

share|improve this question

1 Answer 1

up vote 3 down vote accepted

For the first part, I think you are quite close. Here is what I would use (as a regex literal, to avoid all the double escapes):

/\/(?:[^\/\\\n\r]|\\.)+\/([a-z]*)/i

I don't know what you intended with your second alternative after the character class. But here the second alternative is used to consume backslashes and anything that follows them. The last part is important, so that you can recognize the regex ending in something like this: /backslash\\/. And the ? at the end of your regex was redundant. Otherwise this should be fine.

Test it here.

Your second regex is just fine for your specification. There are a few redundant elements though. The main thing you might want to do is capture everything but the possible first character:

/(?:^|\W)(\d[\w.]*)/i

Now the actual number (without the first character) will be in capturing group 1. Note that I removed the ungreediness and the lookahead, because greediness alone does exactly the same.

Test it here.

share|improve this answer
    
Hello m.buettner. Thanks for your answer! Regex: As you can see in this example the regex magically thinks it is okay to suddenly search on new lines (at "match(/r/r/r);"). An issue I keep encountering. –  Kafoso Dec 6 '12 at 10:09
    
Digits: Works beautifully. Indeed your simplified version is better. Thanks. –  Kafoso Dec 6 '12 at 10:18
    
@Kafoso ah right, the line breaks. See my edit. Simply include the line break characters in the negated character class. –  Martin Büttner Dec 6 '12 at 12:05
    
It works. Thank you! :) - Though, I'm still a bit thrown off by the fact that it searches multiple lines without /m. –  Kafoso Dec 6 '12 at 12:21
    
@Kafoso m has nothing to do with matching line breaks or not. All m does is make ^ and $ match at the beginning and ending of lines, respectively. Nothing more nothing less. . will never match a line break (in JavaScript), regardless of whether you use m or not. But a negated character class means "any character except...", and that includes line breaks. –  Martin Büttner Dec 6 '12 at 14:43

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.