Code Review Stack Exchange is a question and answer site for peer programmer code reviews. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I wrote this to scrape album review data from AOTY into a spreadsheet. Check it out and let me know what I could've done better.

Option Explicit

Const classAlbum As String = "listLargeTitle"
Const classScore As String = "listScoreContainer"
Const classRating As String = "listScoreValueContainer"
Const classReview As String = "listScoreText"
Const classNext As String = "pageSelect"

Sub ScrapeAOTY()

    Dim IE As MSXML2.XMLHTTP60
    Dim HTMLDoc As MSHTML.HTMLDocument
    Dim HTMLBody As MSHTML.HTMLBody
    Dim Albums As Object
    Dim Ratings As Object
    Dim Reviews As Object
    Dim MetaData As Object
    Dim InnerMeta As Object
    Dim nextPage As Object
    Dim selection As Object
    Dim sheet As Worksheet
    Dim aotyYear As String
    Dim url As String
    Dim bookmark As String
    Dim album As String
    Dim review As String
    Dim releaseDate As String
    Dim genre As String
    Dim test As String
    Dim rating As Double
    Dim index As Long
    Dim row As Long
    Dim page As Long
    Dim pageLoading As Boolean
    Dim exitFlag As Boolean

    Application.ScreenUpdating = False
    Application.Calculation = xlCalculationManual

    aotyYear = InputBox("Input the year you're scraping")
    If aotyYear < 1900 Then Exit Sub
    url = "http://www.albumoftheyear.org/ratings/6-highest-rated/" & aotyYear & "/1"
    Set sheet = ThisWorkbook.Worksheets("ScrapeSheet")
    row = 2
    exitFlag = False
    page = 1

    Do
        Set IE = New MSXML2.XMLHTTP60
        url = "http://www.albumoftheyear.org/ratings/6-highest-rated/" & aotyYear & "/" & page
        IE.Open "GET", url, False
        IE.send

        Do While IE.ReadyState <> 4
            DoEvents
        Loop

        Set HTMLDoc = New MSHTML.HTMLDocument
        Set HTMLBody = HTMLDoc.body
        HTMLBody.innerHTML = IE.responseText
        Set Albums = HTMLDoc.getElementsByClassName(classAlbum)
        Set Ratings = HTMLDoc.getElementsByClassName(classRating)
        Set Reviews = HTMLDoc.getElementsByClassName(classReview)
        Set MetaData = HTMLDoc.getElementsByClassName(classScore)

        For index = 0 To Albums.Length - 1
            Set InnerMeta = MetaData(index).parentElement.parentElement
            Set InnerMeta = InnerMeta.getElementsByTagName("div")

            album = Albums(index).innerText
            rating = CDbl(Ratings(index).getAttribute("title"))
            review = Reviews(index).innerText
            releaseDate = InnerMeta(0).innerText

            If InnerMeta.Length < 10 Then
                genre = ""
            Else
                genre = InnerMeta(1).innerText
            End If

            sheet.Cells(row, 1) = Mid(album, InStr(1, album, " ") + 1, InStr(1, album, " - ") - InStr(1, album, ".") - 2)
            sheet.Cells(row, 2) = Right(album, Len(album) - InStr(1, album, " - ") - 2)
            sheet.Cells(row, 3) = rating
            sheet.Cells(row, 4) = Left(review, Len(review) - 8)
            sheet.Cells(row, 5) = releaseDate
            sheet.Cells(row, 6) = genre

            row = row + 1
        Next index

        Set nextPage = HTMLDoc.getElementsByClassName(classNext)

        If nextPage.Length = 2 _
        Or nextPage(0).innerText = "Next >" Then
            page = page + 1
        Else
            exitFlag = True
        End If

        Set IE = Nothing
    Loop Until exitFlag = True

    MsgBox ("Done!")

    Application.ScreenUpdating = True
    Application.Calculation = xlCalculationAutomatic

End Sub
share|improve this question
3  
I wouldn't name a variable "sheet" or "selection" – Raystafarian Dec 2 '15 at 16:49

I can only point out cosmetic issues (I don't know sqat about VBA), I hope another reviewer will pass by and cover other aspects.

Unused variables

These variables seem to be unused, so why not remove them:

Dim selection As Object
Dim bookmark As String
Dim test As String
Dim pageLoading As Boolean

Magic string

This string appears twice:

"http://www.albumoftheyear.org/ratings/6-highest-rated/"

It would be better to define it one place, so you can change it one place if needed.

Readability

This block:

aotyYear = InputBox("Input the year you're scraping")
If aotyYear < 1900 Then Exit Sub
url = "http://www.albumoftheyear.org/ratings/6-highest-rated/" & aotyYear & "/1"
Set sheet = ThisWorkbook.Worksheets("ScrapeSheet")
row = 2
exitFlag = False
page = 1

Would be more readable if you break a line after the early return:

aotyYear = InputBox("Input the year you're scraping")
If aotyYear < 1900 Then Exit Sub

url = "http://www.albumoftheyear.org/ratings/6-highest-rated/" & aotyYear & "/1"
Set sheet = ThisWorkbook.Worksheets("ScrapeSheet")
row = 2
exitFlag = False
page = 1

It might be a matter of taste, but I think this would be more readable if the condition was not split, on a single line.

    If nextPage.Length = 2 _
    Or nextPage(0).innerText = "Next >" Then
        page = page + 1
    Else
        exitFlag = True
    End If
share|improve this answer
2  
Good catch with the unused/unassigned variables - the only thing Rubberduck picks up that you didn't (apart from a false positive with sheet being reported as not used), is that the ScrapeAOTY member is implicitly public. Note that the selection local variable hides the Selection object defined by the Excel object model. Nice answer! – Mat's Mug Dec 2 '15 at 19:39

First off, I just want to say that's a pretty solid program there. Compared to most reviews I do, most of these points are fairly minor. That said, let's begin.


Naming

Don't ever use protected names for your variables. Protected names are names that are already used by the VBA object model, such as selection or sheet. They can and will cause chaos and confusion down the road, not to mention the mistakes the compiler might make if it gets confused.

Your naming is not consistent. Some variables have leading capitals, some do not.

The general convention for VBA is:

camelCase for procedure-Level Variables
PascalCase for Module and Global-Level variables
SHOUTY_SNAKE_CASE for constants

you don't have to use these specifically, so long as you do pick a convention and apply it consistently.

Other than that, there are a couple of variable names that could be even clearer, but your naming is pretty solid. The variables I think could be clearer: pageLoading --> pageIsLoading, page --> pageNum, IE to something much clearer. I know, it's obvious here what it is, but variable names should always be able to stand on their own without additional context.


Other bits and pieces

I personally prefer to put my loop conditions at the start of my loops rather than the end. The end can be much easier to miss and this way, it means you know what to look out for while you're reading through it.

You may want to dim similar variables on the same line to make it less wall-of-text-like. Like so:

dim albums as object, ratings as object, reviews as object
share|improve this answer
2  
Good points, but determining where to put the condition on a loop very much comes down to how you want to use it. A loop that you know you want to run 1+ times, is naturally checked at the end. For loops that run 0+ times a check initially is natural. Since the OPs code actually determines whether or not to continue based upon data obtained within the loop a check at the end makes more sense. – Daniel Dec 4 '15 at 14:34

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.