2f8741c27326dcd6d1809db3874313b6

perfection is reached not when there is nothing left to add, but when there is nothing left to take away...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
require 'rubygems'
require 'net/http'
require 'uri'

def getbody(letter, pagenumber)
  url = URI.parse('http://www.somesite.com/')
  res = Net::HTTP.start(url.host, url.port) { |http| 
    http.get("/browse.php?character=#{letter}&page=#{pagenumber}")
  }
  gbody = res.body
  slice1 = gbody.index("columns")
  slice2 = gbody.rindex("<center>")
  gbody = gbody.slice(slice1, (slice2-slice1+150))
  return gbody
end

def getpages(htmlbody)
  pagebody = htmlbody.slice(htmlbody.index("<center>"),50)
  slice1 = pagebody.index("<b>")
  slice2 = pagebody.index("</b>")
  pages = pagebody.slice(slice1+3, (slice2-slice1-3))
  pages = Integer(pages)
  return pages
end

alphabet = %w(a b c d e f g h i j k l m n o p q r s t u v w x y z *)

alphacounter = 0;
allwords = Array.new
out = File.new('wordlist.html', 'w')
out.puts "<html>"

while alphacounter < alphabet.length
  counter = 1
  cycles = 1
  while counter <= cycles
    body = getbody(alphabet[alphacounter],(counter))
    if counter == 1
      if alphabet[alphacounter] == "s"
        cycles = 1010
      else
        cycles = getpages(body)
      end
      puts "cycles: #{cycles}"
    end
    counter += 1
    wordbody = body.slice(0,body.index("</table>"))
    peices = Array.new
    innercounter = 0;
    while (wordbody.include? "<a href")
      slice1 = wordbody.index("<a href")
      slice2 = wordbody.index("</a>")
      peices[innercounter] = wordbody.slice(slice1,(slice2-slice1+4))
      innercounter += 1
      wordbody = wordbody.slice((slice2+4),wordbody.length)
    end
    peices.compact!
    i = 0
    while i < peices.length
      peices[i].insert((peices[i].index("=")+2),"http://www.somesite.com")
      i += 1
    end
    peices.each do |item|
      out.puts "#{item} <br>"
    end
    puts "Letter: #{alphabet[alphacounter]}     page: #{counter}     total pages: #{cycles}"
  end
  alphacounter += 1  
end

out.puts "</html>"


Refactorings

No refactoring yet !

880cbab435f00197613c9cc2065b4f5a

danielharan

May 31, 2008, May 31, 2008 17:27, permalink

No rating. Login to rate!

You're mixing too many concerns. In particular, separating the scraping from the processing is usually a huge win for readability; I save fetched pages in /tmp/cache. That way if I mess up the retrieval of data from the HTML (which happens, oh, always), I don't have to hit their servers again.

If there are two levels of pages to retrieve - search listings and actual data pages, you can write two different scrapers.

1
2
3
4
# why does the alphabet include '*' anyways?
(('a'..'z').to_a << '*').each do |letter|
  `curl|request somesite.com/search?q=#{letter} to /tmp/cache/searches/#{letter}`
end
F22c48808ddcb56564889bb39f755717

elliottcable

June 9, 2008, June 09, 2008 16:21, permalink

No rating. Login to rate!

You might also want to look at Tempfile: http://www.ruby-doc.org/stdlib/libdoc/tempfile/rdoc/index.html

Using built-in Ruby tools > throwing it out to a system call d-:

1
2
3
Tempfile.new do |f|
  # Grab page, and save it here - then pass the tempfile on to the next section
end

Your refactoring





Format Copy from initial code

or Cancel