1 2 3 4 5 6 7 8
data = data.gsub("\xA0", " ") data = data.gsub("\x99", "(TM)") data = data.gsub("\x97", "-") data = data.gsub(/\xE9/, "e") data = data.gsub(/\xBD/, "1/2") data = data.gsub(/\xBE/, "3/4") data = data.gsub(/\xBC/, "1/4") # Continue with another 15 search/replace pairs
Refactorings
No refactoring yet !
chrismo
August 27, 2008, August 27, 2008 03:40, permalink
I took a stab at 'optimizing' this, by only going through the data string once - but this new version performs terribly - even just going over it once takes much longer than your version 200 times in a row -- probably because the gsub method would call into C code, whereas my new version is a bunch of Ruby calls.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
def process_orig(data) data = data.gsub("\xA0", " ") data = data.gsub("\x99", "(TM)") data = data.gsub("\x97", "-") data = data.gsub(/\xE9/, "e") data = data.gsub(/\xBD/, "1/2") data = data.gsub(/\xBE/, "3/4") data = data.gsub(/\xBC/, "1/4") return data end def process_one_pass(data) subs = { "\xA0" => " ", "\x99" => "(TM)", "\x97" => "-", "\xE9" => "e", "\xBD" => "1/2", "\xBE" => "3/4", "\xBC" => "1/4", } result = "" data.each_byte { |ch| append = nil subs.each_pair { |find, repl| append = repl if (ch.chr == find) } result << ((append.nil?) ? ch : append) } return result end if __FILE__ == $0 require 'profile' require 'test/unit' class TestProcess < Test::Unit::TestCase def test_process_orig filler = "*" * 10000 (1..200).each do assert_equal(" (TM)-e1/23/41/4" + filler, process_orig("\xA0\x99\x97\xE9\xBD\xBE\xBC" + filler)) end end def test_process_one_pass filler = "*" * 10000 (1..1).each do assert_equal(" (TM)-e1/23/41/4" + filler, process_one_pass("\xA0\x99\x97\xE9\xBD\xBE\xBC" + filler)) end end end end
chrismo
August 27, 2008, August 27, 2008 03:41, permalink
Probably then just a version that, as you said, puts the search/replace values into a hash and calls gsub in a loop would be a bit cleaner.
Adam
August 27, 2008, August 27, 2008 04:07, permalink
Why not use ActiveSupport::Multibyte::Chars instead?
1
data.chars.normalize!
John Sietsma
August 27, 2008, August 27, 2008 05:29, permalink
Use the speed of the dictionary hash lookup rather then an incremental search.
Your code:
Finished in 32.422 seconds.
With these changes:
Finished in 6.25 seconds.
1 2 3 4 5
result = "" data.each_byte { |ch| result << ( subs[ch.chr] || ch ) } return result
kenr
August 27, 2008, August 27, 2008 11:16, permalink
Thanks Guys!
The multibyte characters is a good idea, but I was trying to avoid that because the data is being indexed for text searching. I think we will try processing each character/byte with the hash.
Adam
August 27, 2008, August 27, 2008 13:42, permalink
Perhaps I wasn't clear with my previous post. The normalize method does the same thing you are doing with gsub.
1 2
"\xBD + \xBC".chars.normalize.to_s => "1/2 + 1/4"
kenr
August 27, 2008, August 27, 2008 16:01, permalink
Thanks Adam. I should have read your post more carefully and checked the API.
chrismo
August 27, 2008, August 27, 2008 21:21, permalink
Thanks, John - I knew I was being stupid when I got to that portion. However, this version still performs much worse than just looping over gsubs.
John Sietsma
August 28, 2008, August 28, 2008 01:45, permalink
It depends what you're after kenr. Do you want to:
- Transform the escape string into an ascii representation. Your gsub method does this. "\xBD" becomes "1/2".
- Transform the escape string into utf-8. normalize does this. "\xBD" becomes the byte 0xBD.
btw, the normalise function takes 60.997 seconds to finish the test!
kenr
August 28, 2008, August 28, 2008 16:45, permalink
The normalizing is really what we want, but there is a limited set of input characters that need to be fixed. Here is what my current code looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
@@lookup = {} @@lookup[0x92] = "'" @@lookup[0x91] = "'" @@lookup[0x96] = "-" @@lookup[0xAE] = "(R)" @@lookup[0x99] = "(TM)" @@lookup[0xBD] = "1/2" @@lookup[0xBE] = "3/4" @@lookup[0xBC] = "1/4" @@lookup[0xB0] = " degree " @@lookup[0xBA] = " degrees " # etc def fix_text data updated = "" (0..(data.length)).each { |i| c = data[i] || " "; updated << (@@lookup[c] || c) } end
Adam
August 28, 2008, August 28, 2008 18:50, permalink
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
class Normalizer < Struct.new(:string) REPLACEMENTS = { 0x92 => "'", 0x91 => "'", 0x96 => "-", 0xAE => "(R)", 0x99 => "(TM)", 0xBD => "1/2", 0xBE => "3/4", 0xBC => "1/4", 0xB0 => " degree ", 0xBA => " degrees " } def normalize string.enum_for(:each_byte).map { |char| REPLACEMENTS[char] || char.chr }.join end def normalize! string.replace(normalize) end end # Normalizer.new(data).normalize!
I could put the search/replace values into a hash and call gsub in a loop. But is there a more efficient way?