Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I'm currently doing a site search like : site:somedomain.com into BING using Python and Mechanize.

It is submitting fine to bing and returning an output - looks like Json? I can't seem to figure out a good way to further parse the results. Is it JSON?

I'm getting an output like:

Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=478', text='SomeSite -  Professor Rating of Louis Scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=478'), ('h', 'ID=SERP,5105.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=527', text='SomeSite -  Professor Rating of Jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=527'), ('h', 'ID=SERP,5118.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=645', text='SomeSite -  Professor Rating of David Kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=645'), ('h', 'ID=SERP,5131.1')])

I want to get all the urls like:

http://www.somesite.com/prof.php?pID=478
http://www.somesite.com/prof.php?pID=527
http://www.somesite.com/prof.php?pID=645

and so on, so the url attribute within

How can I further do this with mechanize within my code? Keep in mind, some urls in the future might look like:

http://www.anothersite.com/dir/dir/dir/send.php?pID=100

Thank you !

share|improve this question
If you're using Microsoft's Azure API to get Bing results, you may append "format=JSON" as GET param to your request URL. Then you'll receive a JSON string as response object. – Simon Steinberger Aug 8 at 13:00
Can you show a brief example of just the url? Does it matter when format=JSON is appended? – CodeTalk Aug 8 at 13:02
Do you have any idea what kind of output that is in my above question? I cannot find it for the life of me – CodeTalk Aug 8 at 13:05
Using Python requests (no indenting possible as comment): req = requests.get(u'api.datamarket.azure.com/Data.ashx/Bing/SearchWeb/…; % urllib.quote(q.encode('utf8'), ''), auth=('', u'YOU_API_KEY')) results = req.json()['d']['results'] – Simon Steinberger Aug 8 at 13:06
That looks horrible as a comment ... I'll create a proper answer, even if it's not 100% accurate to the question ... – Simon Steinberger Aug 8 at 13:07
show 2 more commentsadd comment (requires an account with 50 reputation)

2 Answers

up vote 1 down vote accepted

Well mechanize is more a browser like package for Python, for parsing HTML/XML I would recommend Lxml, you can feed that data to lxml and look for urls. Another option is to use regular expressions to look for urls, this approach would be more flexible.

import re 
url_regex = re.compile('http:[^\']+')
urls = re.findall(url_regex, html_text)

Edit:

Well instead of printing output, just pass output instead of html_text in re.findall() and then print urls

share|improve this answer
Can you show this incorporated with my code? – CodeTalk Aug 8 at 1:21
@CodeTalk edited, its really simple, you gotta play with it, findall() receives a regular expression and a string to look into. Edited the answer. – PepperoniPizza Aug 8 at 2:46
Thank you! I will have to play around with this a bit I guess – CodeTalk Aug 8 at 12:49
One question - is that output I provided above html though? – CodeTalk Aug 8 at 12:50
add comment (requires an account with 50 reputation)

Using Microsoft's Azure Datamarket API with Python requests, you can request JSON strings directly:

import requests, urllib
q = u'Hello World'
q = urllib.quote(q.encode('utf8'), '')
req = requests.get(
    u'https://api.datamarket.azure.com/Data.ashx/Bing/SearchWeb/Web?$format=JSON&Query=%%27%s%%27' % q,
    auth=('', u'YOU_API_KEY')
)
# print req.json()
results = req.json()['d']['results']
list_of_urls = [ r['Url'] for r in results]        

Depending on your input data you may or may not need the .encode('utf8') part of "q". The "site:xy.com" query should work, too, but I didn't test this. Additionally, we occasionally had some strange encodings returned from Bing ... so we had to re-encode returned URLs like so:

url = r['Url'].encode('latin1')

But those were really special cases ...

You need to register for the Azure API (free of charge) and up to 5000 Bing search requests per month are free: http://datamarket.azure.com/dataset/bing/search

There are several params to fine tune your results: http://datamarket.azure.com/dataset/bing/search#schema

share|improve this answer
Are you paid by MS? lol – CodeTalk Aug 8 at 13:34
Admitted, I kind of sound like that :-) But no, we originally used Google's search API, but that has turned into "paid only". Bing is (to my knowledge) the only viable search engine that still provides a free API - even if limited in requests. – Simon Steinberger 2 days ago
It isn't free anymore. – CodeTalk 2 days ago
Read the links I've posted: it is free up to 5,000 requests per month and we're currently using it in our projects. – Simon Steinberger 2 days ago
Does a request = 1 extract from one search result? e.g. the first page for a given query and each additional page is another request ? – CodeTalk 2 days ago
show 1 more commentadd comment (requires an account with 50 reputation)

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.