I'm using Anemone to Spider a website, I am then using a set of rules specific to that website, to find certain parameters.
I feel like it's simple enough, but any attempt I make to save the products into arrays looks very messy.
the Rules are different for each site (the script is simply grabbing site 1 from the DB at the moment). rule.name should become the column name.
Any ideas of a good way to store this data (not on a db)? my Array pushing seemed horrible.
So my way of storing goes like this: I have 2 hashes (entity and product) and an array (array). I loop through the rules making an entity which I merge to Product after each successful iteration. I then push Product to Array before moving on to the next page.
As I said.. It seems and feels crappy. I would like to add a Product Model with a method to set variable keys for a hash.. but I'm not certain.
desc "Crawl client site"
task :crawl => :environment do
require 'anemone'
@client = Client.find(1)
@rules = @client.rules
$i = 0 #just for testing
array = Array.new
#Set up model/object or array to save the data.
Anemone.crawl(@client.url) do |anemone|
anemone.on_every_page do |page|
#puts page.url
#Create new instance of object or row of array?
entity = Hash.new
product = Hash.new
product = {url: page.url}
@client.rules.each do |rule|
# if page.doc.at_css(rule.rule) != nil || !rule.required? #.text[/[0-9\.]+/]
if page.doc.xpath(rule.rule) != nil || !rule.required? #.text[/[0-9\.]+/]
entity[rule.product_attribute.name] = page.doc.xpath(rule.rule).remove
product.merge!(entity)
else
#Not a product Page. Break the rules loop and move on to next page. (also delete current instance)
product = nil
break
end
#$i+= 1
end
if product then array.push(product) end
end
end
#puts $i
puts array
end