如何使我的Ruby正则表达式编译为与应用它的变量相同的编码?

3
请注意,本回答Ruby Regex Error: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)不适用于我,因为我已经使用的是 Ruby > 1.9。
我正在使用 Rails 4.2.7 和 Ruby 2.3。我有这个表达式:
phrase = phrase.gsub(/\A\p{Space}+|\p{Space}+\z/, '') 

很遗憾,如果变量“phrase”的编码为“ASCII-8BIT”,则会出现以下错误。有没有办法以与变量phrase的编码匹配的编码编写上述内容?据我所知,正则表达式会自动编译为UTF-8,即使我的变量可能不是UTF-8。

Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)
    from /Users/davea/Documents/workspace/demoapp/app/services/text_table_to_my_object_time_converter_service.rb:449:in `gsub'
    from /Users/davea/Documents/workspace/demoapp/app/services/text_table_to_my_object_time_converter_service.rb:449:in `find_header'
    from /Users/davea/Documents/workspace/demoapp/app/services/text_table_to_my_object_time_converter_service.rb:157:in `block in get_headers_by_line'
    from /Users/davea/.rvm/gems/ruby-2.3.0/gems/activesupport-4.2.7.1/lib/active_support/core_ext/range/each.rb:7:in `each'
    from /Users/davea/.rvm/gems/ruby-2.3.0/gems/activesupport-4.2.7.1/lib/active_support/core_ext/range/each.rb:7:in `each_with_time_with_zone'
    from /Users/davea/Documents/workspace/demoapp/app/services/text_table_to_my_object_time_converter_service.rb:156:in `get_headers_by_line'
    from /Users/davea/Documents/workspace/demoapp/app/services/text_table_to_my_object_time_converter_service.rb:99:in `get_headers'
    from /Users/davea/Documents/workspace/demoapp/app/services/text_table_to_my_object_time_converter_service.rb:243:in `block in get_data_hash'
    from /Users/davea/Documents/workspace/demoapp/app/services/text_table_to_my_object_time_converter_service.rb:242:in `each_line'
    from /Users/davea/Documents/workspace/demoapp/app/services/text_table_to_my_object_time_converter_service.rb:242:in `get_data_hash'
    from /Users/davea/Documents/workspace/demoapp/app/services/text_table_to_my_object_time_converter_service.rb:21:in `get_my_object_times'
    from /Users/davea/Documents/workspace/demoapp/app/services/text_processor_service.rb:33:in `process_page_data'
    from /Users/davea/Documents/workspace/demoapp/app/services/abstract_import_service.rb:82:in `process_my_object_data'
    from (irb):8
    from /Users/davea/.rvm/gems/ruby-2.3.0/gems/railties-4.2.7.1/lib/rails/commands/console.rb:110:in `start'
    from /Users/davea/.rvm/gems/ruby-2.3.0/gems/railties-4.2.7.1/lib/rails/commands/console.rb:9:in `start'
    from /Users/davea/.rvm/gems/ruby-2.3.0/gems/railties-4.2.7.1/lib/rails/commands/commands_tasks.rb:68:in `console'
    from /Users/davea/.rvm/gems/ruby-2.3.0/gems/railties-4.2.7.1/lib/rails/commands/commands_tasks.rb:39:in `run_command!'
    from /Users/davea/.rvm/gems/ruby-2.3.0/gems/railties-4.2.7.1/lib/rails/commands.rb:17:in `<top (required)>'
    from bin/rails:4:in `require'

你能提供一个触发此行为的phrase具体值吗?另外,ASCII-8BIT不适用于保存文本,而是纯字节级数据;你在phrase内容中实际使用的编码是什么?解决方案应该是强制对phrase进行实际编码的编码,然后编码为UTF-8,然后应用正则表达式。(我相信包含\p{Space}的正则表达式将是UTF-8。/foo/.encoding在我的机器上不是UTF-8。) - Amadan
当我运行'puts "#{phrase.encoding}"时,输出为"ASCII-8BIT",所以尽管你说"ASCII-8BIT"不适合保存文本,但在我的例子中这就是现实。 - Dave
我已经尝试在2.3.0和2.3.1中使用随机的ASCII-8BIT数据来重现这个问题,但都没有成功。了解“phrase”的内容将非常有帮助。 - Sculper
我说“不是用来保存文本”的原因是有道理的。它当然可以,但如果这样做,几乎总是程序员的错误。ASCII-8BIT实际上是“空编码”,它表示“我得到了这些字节,但我不知道它们的含义”。例如,ASCII-8BIT字符串"\xc3\xa4"在强制转换为UTF-8时是"ä",但在ISO-8859-1中是"ä",在SJIS中是"テ、"...只要你坚持使用ASCII的下半部分,你可能不会遇到任何错误,因为大多数编码在那里都是相同的,但一旦你涉及到第8位,正则表达式就需要知道确切的编码才能知道什么是“空格”。 - Amadan
1个回答

0

我想我们有一个类似的问题;简单来说:

str = "Some text\xC2\xA0followed by more text."
enc = str.encoding 

#  returns ==> #<Encoding:Windows-1252>

# here is my initial attempt, which failed:
matches = str.match(/\xC2\xA0/)
# returns ==> Encoding::CompatibilityError 
# (incompatible encoding regexp match (UTF-8 regexp with Windows-1252 string))

# here is my second attempt:
newstr = str.encode('UTF_8', 'Windows_1252', invalid: :replace, undef: :replace, replace: ' ')

# returns ==> Encoding::ConverterNotFoundError (code converter not found (Windows_1252 to UTF_8))

# here is my third attempt:
matches = str.match('\xC2\xA0'.encode(Encoding::Windows_1252))

# returns ==> #<MatchData "\xC2\xA0">

# here is my 4th attempt: 
matches = str.match('(.*)\xC2\xA0(.*)'.encode(Encoding::Windows_1252))
newstr = matches[1] + ' ' + matches[2]

# returns ==> "Some text followed by more text."

希望这能有所帮助。只晚了两年...


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接