0

I worked out this regex and its working close but only one more problem in that, it matches any word with more than one period (.)

For example: stuf... (got matched)

How do I limit the period to "Only allow 1 period per set of bracket" in the regex?

'#((\w+://)?(\w+\.)([a-z0-9\-/.?=_&%])+)#i'
4
  • 1
    \.{1} normally works, but I don't know php :-) Commented Jun 28, 2011 at 13:24
  • 1
    Its a preg_match to replace links in text. So perhaps filter couldn't help? I did try {1} but if I put in, I will get error: '#((\w+://)?(\w+\.)([a-z0-9\-/.{1}?=_&%])+)#i' Commented Jun 28, 2011 at 13:29
  • Pakito, to which RFC are you referring when asking the question about URLs? Would be good to know the protocol as well. Is this specifically for the http and https protocols? Commented Jun 28, 2011 at 13:50
  • Both actually. Something that is able to match most common urls as much as possible. Even ftp,hence the \w at the front. Commented Jun 28, 2011 at 13:52

3 Answers 3

0

Try this:

'#((\w+://)?(\w+)(\.[a-z0-9\-/?=_&%]+)+)#i'

This requires a non-period-char after each period.

But I would recommend something like this:

'#((\w+://)?\w+(\.[a-z0-9\-]+)*\.[a-z\-]{2,}(/[\w\-./?=&%]*)?)#i'
Sign up to request clarification or add additional context in comments.

6 Comments

Unfortunately this accepst URIs like http://.-sample.-com which are invalid (a dash cannot be the first character).
lol i tested on facebook with http://.-sample.-com and it accepted as a valid url as well.
@pakito: do you want to match an URL in a text or just check it's correctness?
Yes I want to match url(s) in text and create as hyperlink. Sorry if I sound unclear. Floern: I am currently using your recommended regex and seems like working good for most urls. The only thing need to add in to the end is underscore and period. #((\w+://)?(\w+)(\.[a-z0-9\-]+)*\.[a-z\-]{2,}(/[\w\-/?=_.&%]*)?)#i
my bad, I added the period, but the underscore is already contained in \w.
|
0

This should work:

[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}

The issue in the regex you're using is that you're using the greedy version "+" in the character class that includes your period. The regex I posted here checks for only a single perdiod in the name.

This pattern will successfully match google.com, www.google.com, and any arbitrary number of subdomains.

NOTE: ICANN recently announced that soon they will allow for any top-level domain (e.g. instead of just .com, .org, etc. they will soon allow .whatever), so you may need to adjust the last part of the regex, "{2,4}", since TLDs will soon be of arbitrary length.

2 Comments

Yes thanks normalo, that is (tld) something I am trying to avoid as well.
Ah, didn't know .museum was already in use. Well, that makes it a bit more challenging, because now you're essentially looking for any.combination.of.valid.characters.delimeted.by.periods.without.spaces.and.not.ending.in.a.punctuation.mark :S The challenge comes from the increase of possible false positive matches. I guess you'll just have to try it and see what happens.
0

Well, if you want to validate URLs, why not use parse_url()? I think it's tricky to create a general regex for so many varied URL forms

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.