60997be4fbb4f5719d6eb68a1eadee9a

I could put the search/replace values into a hash and call gsub in a loop. But is there a more efficient way?

1
2
3
4
5
6
7
8
data = data.gsub("\xA0", " ")
data = data.gsub("\x99", "(TM)")
data = data.gsub("\x97", "-")
data = data.gsub(/\xE9/, "e")
data = data.gsub(/\xBD/, "1/2")
data = data.gsub(/\xBE/, "3/4")
data = data.gsub(/\xBC/, "1/4")
# Continue with another 15 search/replace pairs

Refactorings

No refactoring yet !

8fb47582197b2d50cd7d297841c8cc87

chrismo

August 27, 2008, August 27, 2008 03:40, permalink

No rating. Login to rate!

I took a stab at 'optimizing' this, by only going through the data string once - but this new version performs terribly - even just going over it once takes much longer than your version 200 times in a row -- probably because the gsub method would call into C code, whereas my new version is a bunch of Ruby calls.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def process_orig(data)
  data = data.gsub("\xA0", " ")
  data = data.gsub("\x99", "(TM)")
  data = data.gsub("\x97", "-")
  data = data.gsub(/\xE9/, "e")
  data = data.gsub(/\xBD/, "1/2")
  data = data.gsub(/\xBE/, "3/4")
  data = data.gsub(/\xBC/, "1/4")
  return data
end

def process_one_pass(data)
  subs = {
    "\xA0" => " ",
    "\x99" => "(TM)",
    "\x97" => "-",
    "\xE9" => "e",
    "\xBD" => "1/2",
    "\xBE" => "3/4",
    "\xBC" => "1/4",
  }
  
  result = ""
  data.each_byte { |ch|
    append = nil
    subs.each_pair { |find, repl|
      append = repl if (ch.chr == find)
    }
    result << ((append.nil?) ? ch : append)
  }
  return result
end

if __FILE__ == $0
  require 'profile'
  require 'test/unit'
  
  class TestProcess < Test::Unit::TestCase
    def test_process_orig
      filler = "*" * 10000
      (1..200).each do 
        assert_equal(" (TM)-e1/23/41/4" + filler, process_orig("\xA0\x99\x97\xE9\xBD\xBE\xBC" + filler))
      end
    end

    def test_process_one_pass
      filler = "*" * 10000
      (1..1).each do 
        assert_equal(" (TM)-e1/23/41/4" + filler, process_one_pass("\xA0\x99\x97\xE9\xBD\xBE\xBC" + filler))
      end
    end
  end
end
8fb47582197b2d50cd7d297841c8cc87

chrismo

August 27, 2008, August 27, 2008 03:41, permalink

No rating. Login to rate!

Probably then just a version that, as you said, puts the search/replace values into a hash and calls gsub in a loop would be a bit cleaner.

A8d3f35baafdaea851914b17dae9e1fc

Adam

August 27, 2008, August 27, 2008 04:07, permalink

1 rating. Login to rate!

Why not use ActiveSupport::Multibyte::Chars instead?

1
data.chars.normalize!
37ce1520d319fa8095d05aa475951616

John Sietsma

August 27, 2008, August 27, 2008 05:29, permalink

1 rating. Login to rate!

Use the speed of the dictionary hash lookup rather then an incremental search.

Your code:
Finished in 32.422 seconds.

With these changes:
Finished in 6.25 seconds.

1
2
3
4
5
  result = ""
  data.each_byte { |ch|
    result << ( subs[ch.chr] || ch )
  }
  return result
60997be4fbb4f5719d6eb68a1eadee9a

kenr

August 27, 2008, August 27, 2008 11:16, permalink

No rating. Login to rate!

Thanks Guys!

The multibyte characters is a good idea, but I was trying to avoid that because the data is being indexed for text searching. I think we will try processing each character/byte with the hash.

A8d3f35baafdaea851914b17dae9e1fc

Adam

August 27, 2008, August 27, 2008 13:42, permalink

1 rating. Login to rate!

Perhaps I wasn't clear with my previous post. The normalize method does the same thing you are doing with gsub.

1
2
"\xBD + \xBC".chars.normalize.to_s
=> "1/2 + 1/4"
60997be4fbb4f5719d6eb68a1eadee9a

kenr

August 27, 2008, August 27, 2008 16:01, permalink

No rating. Login to rate!

Thanks Adam. I should have read your post more carefully and checked the API.

8fb47582197b2d50cd7d297841c8cc87

chrismo

August 27, 2008, August 27, 2008 21:21, permalink

No rating. Login to rate!

Thanks, John - I knew I was being stupid when I got to that portion. However, this version still performs much worse than just looping over gsubs.

37ce1520d319fa8095d05aa475951616

John Sietsma

August 28, 2008, August 28, 2008 01:45, permalink

No rating. Login to rate!

It depends what you're after kenr. Do you want to:
- Transform the escape string into an ascii representation. Your gsub method does this. "\xBD" becomes "1/2".
- Transform the escape string into utf-8. normalize does this. "\xBD" becomes the byte 0xBD.

btw, the normalise function takes 60.997 seconds to finish the test!

60997be4fbb4f5719d6eb68a1eadee9a

kenr

August 28, 2008, August 28, 2008 16:45, permalink

No rating. Login to rate!

The normalizing is really what we want, but there is a limited set of input characters that need to be fixed. Here is what my current code looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@@lookup = {}

@@lookup[0x92] = "&#39;"
@@lookup[0x91] = "&#39;"
@@lookup[0x96] = "-"
@@lookup[0xAE] = "(R)"
@@lookup[0x99] = "(TM)"
@@lookup[0xBD] = "1/2"
@@lookup[0xBE] = "3/4"
@@lookup[0xBC] = "1/4"
@@lookup[0xB0] = " degree "
@@lookup[0xBA] = " degrees "
# etc

def fix_text data
  updated = ""
  (0..(data.length)).each { |i| c = data[i] || " "; updated << (@@lookup[c] || c) }
end
A8d3f35baafdaea851914b17dae9e1fc

Adam

August 28, 2008, August 28, 2008 18:50, permalink

No rating. Login to rate!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class Normalizer < Struct.new(:string)
  REPLACEMENTS = {
    0x92 => "&#39;",
    0x91 => "&#39;",
    0x96 => "-",
    0xAE => "(R)",
    0x99 => "(TM)",
    0xBD => "1/2",
    0xBE => "3/4",
    0xBC => "1/4",
    0xB0 => " degree ",
    0xBA => " degrees "
  }
  
  def normalize
    string.enum_for(:each_byte).map { |char| REPLACEMENTS[char] || char.chr }.join
  end
  
  def normalize!
    string.replace(normalize)
  end
end

# Normalizer.new(data).normalize!

Your refactoring





Format Copy from initial code

or Cancel