There are several similar questions here on Stack but I can't get any answers working for me, I'm completely new to Node and the idea of asynchronous programming so please bear with me.
I'm building a scraper that currently has a 4 step process:
- I give it a collection of links
- It goes to each of these links, finds all relevant
img src
on the page - It finds the "next page" link, gets its
href
, retrieves the dom from saidhref
and repeats step #2. - All of these
img src
are put into an array and returned
Here's the code. getLinks
can be called asynchronously but the while
loop in it currently cannot:
function scrape(url, oncomplete) {
console.log("Scrape Function: " + url);
request(url, function(err, resp, body) {
if (err) {
console.log(UHOH);
throw err;
}
var html = cheerio.load(body);
oncomplete(html);
}
);
}
function getLinks(url, prodURL, baseURL, next_select) {
var urls = [];
while(url) {
console.log("GetLinks Indexing: " + url);
var html = scrape(url, function(data) {
$ = data;
$(prodURL).each(function() {
var theHref = $(this).attr('href');
urls.push(baseURL + theHref);
}
);
next = $(next_select).first().attr('href');
url = next ? baseurl + next : null;
}
);
}
console.log(urls);
return urls;
}
At present this goes into an infinite loop without scraping anything. If I put the url = next ? baseurl + next : null;
outside of the callback I get a "next" is not defined
error.
Any ideas on how I can re-work this to make it node-friendly? It seems like, by this problem's very nature, it needs to be blocking, no?