当通过stdin和stdout发送输入并捕获输出时,是否可以将编码设置为utf-8,以便保留特殊字符(例如™,à等)?
这是我的代码(我正在使用Windows,我认为输出具有此编码:IBM866):
require 'open3'
require 'pragmatic_segmenter' # just a gem that segments paragraphs to sentences
Open3.popen3("tagger") do |stdin, stdout, stderr, wait_thread|
tokenized_group = Proc.new do |sentences|
sentences_array = PragmaticSegmenter::Segmenter.new(text: sentences).segment
sentences_array.map do |sentence|
stdin.puts "#{sentence}"
stdout.gets.gsub(/\n$/,"").encode("utf-8") #=> is it possible to get this utf-8, right now its IBM866?
end
end
puts tokenized_group.call "Some random sentence with ™. Another random sentence with à."
#output => Some/DT random/JJ sentence/NN with/IN тДв/NN ./. Another/DT random/JJ sentence/NN with/IN ├а/NN ./.
stdin.close
end
正如你所看到的,由于编码不同,特殊字符在输出中没有被保留。那么,我该如何在标准输出中恢复这些字符呢?
stdout.internal_encoding
和.external_encoding
返回什么?sentences_array
中的项目编码是什么?返回字符串中所涉及字符的实际字节值是多少? - Jordan Running不兼容的编码正则匹配(UTF-8 正则表达式与 IBM866 字符串)(Encoding::CompatibilityError)
。internal_encoding
返回 nil,external_encoding
返回 IBM866。对于 ™(返回 тДв),其编码为[209, 130, 208, 148, 208, 178]
,对于 à(返回 ├а),其编码为[226, 148, 156, 208, 176]
。 - B A