I'm working on making a CLI for google searches for myself, and I've been using nodejs for a little while working on a chatbot, so I'd like to get it working with nodejs. I can pull the data just fine, and end up with a string that has all the html data from the page. It's even easy to sort out in the html what the results that I want are:

<div class="jd"><a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/&amp;ved=0CBAQFjAA&amp;usg=AFQjCNEEnWGHwxNnuwKenqm4ajKfTM6Xxw" ><b>League of Legends</b> - Free Online Game | <b>LoL</b> - <b>League of Legends</b></a> </div> <div class="kd">3 days ago&nbsp;… Official website for <b>League of Legends</b>. Join millions of players in an award   winning Multiplayer Online Battle Arena. </div> <div class="qdlmxn"><a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/board&amp;ved=0CBEQ0gIoADAA&amp;usg=AFQjCNHpmmAdFFbTgm8C_gJvsjVhMzVKUQ" >Community</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://signup.leagueoflegends.com/en/signup/redownload&amp;ved=0CBIQ0gIoATAA&amp;usg=AFQjCNFHGUtn4ItgQIzODgIZRv_237Mq0A" >PVP.NET</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/board/forumdisplay.php?f%3D2&amp;ved=0CBMQ0gIoAjAA&amp;usg=AFQjCNHpycJ8WGh7xvWw1qNu8NjjU1EA0Q" >General Discussion</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/champions&amp;ved=0CBQQ0gIoAzAA&amp;usg=AFQjCNEpeBzNefwag5xmkFcFhCW27FoAew" >Champions</a> </div><span class="c">leagueoflegends.com/</span> -  <div class="txnles" onclick="_popup('web_result_popup_10836585','inline');"> <div class="wx4xyp" id="web_result_popup_10836585"> <div class="vfc7iu"><a class="s" href="/search?q=cache:GCRD1wy5e3QJ:leagueoflegends.com/" >Cached</a> <br/><a class="s" href="/m/?q=related:leagueoflegends.com/&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBYQHzAA" >Similar</a> <br/><a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://leagueoflegends.com/" >Mobile formatted</a> </div> </div><a class="s" href="javascript:void(0)" >Options</a> <div class="m6u8fq"> </div> </div> </div> </div> <div> <div class="r ld"> <div class="jd"><a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://en.wikipedia.org/wiki/LOL&amp;ved=0CBgQFjAB&amp;usg=AFQjCNFOhgg5Y2E5SFuS5I-8830OJ9VR9Q" ><b>LOL</b> - Wikipedia, the free encyclopedia</a> </div> <div class="kd"><b>LOL</b>, an abbreviation for <b>laughing out loud</b>, or <b>laugh out loud</b>, is a common   element of Internet slang. It was used historically on Usenet&nbsp;… </div><span class="c">en.wikipedia.org/wiki/LOL</span> -  <div class="txnles" onclick="_popup('web_result_popup_30597472','inline');"> <div class="wx4xyp" id="web_result_popup_30597472"> <div class="vfc7iu"><a class="s" href="/search?q=cache:mhIpOeXQp38J:en.wikipedia.org/wiki/LOL" >Cached</a> <br/><a class="s" href="/m/?q=related:en.wikipedia.org/wiki/LOL&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBkQHzAB" >Similar</a> <br/><a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://en.wikipedia.org/wiki/LOL" >Mobile formatted</a> </div> </div><a class="s" href="javascript:void(0)" >Options</a> <div class="m6u8fq"> </div> </div> </div> </div> <div> <div class="r ld">

Anything that is .jd is a result, so I first need to separate those out, then working on getting the URLs and the Descriptions separated out, too. I've never done string manipulation to this extreme, so I've got no idea where to start with this.

Here's the html in a more readable format, realize though that I'm dealing with just one long string.

<div>
  <div class="r ld">
    <div class="jd">
      <a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/&amp;ved=0CBAQFjAA&amp;usg=AFQjCNEEnWGHwxNnuwKenqm4ajKfTM6Xxw" >
        <b>League of Legends</b> - Free Online Game | <b>LoL</b> - <b>League of Legends</b>
      </a>
    </div>
    <div class="kd">
      3 days ago&nbsp;… Official website for <b>League of Legends</b>. Join millions of players in an award   winning Multiplayer Online Battle Arena. 
    </div>
    <div class="qdlmxn">
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/board&amp;ved=0CBEQ0gIoADAA&amp;usg=AFQjCNHpmmAdFFbTgm8C_gJvsjVhMzVKUQ" >Community</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://signup.leagueoflegends.com/en/signup/redownload&amp;ved=0CBIQ0gIoATAA&amp;usg=AFQjCNFHGUtn4ItgQIzODgIZRv_237Mq0A" >PVP.NET</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/board/forumdisplay.php?f%3D2&amp;ved=0CBMQ0gIoAjAA&amp;usg=AFQjCNHpycJ8WGh7xvWw1qNu8NjjU1EA0Q" >General Discussion</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/champions&amp;ved=0CBQQ0gIoAzAA&amp;usg=AFQjCNEpeBzNefwag5xmkFcFhCW27FoAew" >Champions</a> 
    </div>
    <span class="c">leagueoflegends.com/</span> -  
    <div class="txnles" onclick="_popup('web_result_popup_10836585','inline');">
      <div class="wx4xyp" id="web_result_popup_10836585"> <div class="vfc7iu">
        <a class="s" href="/search?q=cache:GCRD1wy5e3QJ:leagueoflegends.com/" >Cached</a> 
        <br/>
        <a class="s" href="/m/?q=related:leagueoflegends.com/&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBYQHzAA" >Similar</a> 
        <br/>
        <a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://leagueoflegends.com/" >Mobile formatted</a> 
      </div> 
    </div>
    <a class="s" href="javascript:void(0)" >Options</a> 
    <div class="m6u8fq"> </div> 
    </div> 
  </div> 
</div> 
<div> 
  <div class="r ld"> 
    <div class="jd">
      <a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://en.wikipedia.org/wiki/LOL&amp;ved=0CBgQFjAB&amp;usg=AFQjCNFOhgg5Y2E5SFuS5I-8830OJ9VR9Q" >
        <b>LOL</b> - Wikipedia, the free encyclopedia
      </a>
    </div>
    <div class="kd">
      <b>LOL</b>, an abbreviation for <b>laughing out loud</b>, or <b>laugh out loud</b>, is a common   element of Internet slang. It was used historically on Usenet&nbsp;… 
    </div>
    <span class="c">en.wikipedia.org/wiki/LOL</span> -  
    <div class="txnles" onclick="_popup('web_result_popup_30597472','inline');"> 
      <div class="wx4xyp" id="web_result_popup_30597472"> 
        <div class="vfc7iu">
          <a class="s" href="/search?q=cache:mhIpOeXQp38J:en.wikipedia.org/wiki/LOL" >Cached</a> 
          <br/>
          <a class="s" href="/m/?q=related:en.wikipedia.org/wiki/LOL&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBkQHzAB" >Similar</a> 
          <br/>
          <a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://en.wikipedia.org/wiki/LOL" >Mobile formatted</a> 
        </div> 
      </div>
      <a class="s" href="javascript:void(0)" >Options</a> 
      <div class="m6u8fq"> </div> 
    </div> 
  </div> 
</div>
share|improve this question
1  
Why have you pasted a single, monstrous line of HTML? Do you think we can read that easily or something? Please format it into a usable state. – Bojangles Mar 20 '12 at 1:07
That's exactly what I get, that's what I'm working with. Why would post something that isn't relevant to the question? – Rob Mar 20 '12 at 1:58
why not use jsDom and manipulate the DOM nodes? – papirtiger Mar 20 '12 at 2:00
I'm planning on having this as a commandline tool, so I can just pop over to my terminal and say "google 'terms'" and have it return me results, and eventually move it to the chatbot. – Rob Mar 20 '12 at 2:02
@JamWaffles I don't see how being able to read it more easily will help, but there it is. – Rob Mar 20 '12 at 2:11
show 3 more comments

1 Answer

up vote 0 down vote accepted

Luxun lead me back to the google API, and I found out how to search full web results instead of just included sites here: http://support.google.com/customsearch/bin/answer.py?hl=en&answer=1210656

To create a search engine that searches the entire web:

From the Google Custom Search homepage, click Create a Custom Search Engine.
Type a name and description for your search engine.
Under Define your search engine, in the Sites to Search box, enter at least one valid URL (e.g. www.google.com).
Select the CSE edition you want and accept the Terms of Service, then click Next. Select the layout option you want, and then click Next.
Click any of the links under the Next steps section to navigate to your Control panel.
In the left-hand menu, under Control Panel, click Basics.
In the Search Preferences section, select Search the entire web but emphasize included sites.
Click Save Changes.
In the left-hand menu, under Control Panel, click Sites.
Delete the site you entered during the initial setup process.
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.