Chris Clark wrote:
> Philip Jenvey wrote:
>
>> #1066 is the main bug for this issue -- we just currently lack support for the asian codecs like shiftjis. The ImportError in sample #2 is a symptom of that. The same ImportError happens when you attempt to use the codec but it's masked as a LookupError.
>>
>> Supporting these via the JVM's nio codecs is definitely doable but nobody's gotten around to it yet.
>>
>>
>
> Is
http://java.sun.com/j2se/1.4.2/docs/guide/nio... the package you are
> referring to? I'm not a big Java guy but I may start hacking on a Python
> layer on top of this as an experiment/proof-of-concept. Presumably
>
http://java.sun.com/j2se/1.4.2/docs/api/java/...
> is what needs wrapping?
>
I had some time this afternoon whilst waiting for some builds to
complete... So I started experimenting on using nio from Python along
with a quick attempt at a shift_jis
I'm seeking feedback on a very INCOMPLETE demo that is attached. Sample
session:
C:\users\clach04\python\jython_character_encoding>c:\jython2.5.1\jython.bat
Jython 2.5.1 (Release_2_5_1:6813, Sep 26 2009, 13:47:54)
[Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.6.0_02
Type "help", "copyright", "credits" or "license" for more information.>>> x=''
>>> x.decode('shift_jis') # at this point there is a shift_jis.py in curdir
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding 'shift_jis'>>> import shift_jis # register the local module/encoding
>>> x.decode('shift_jis')
u''
>>>
There is no support for errors (or less strict conversion options),
there are imports in the middle of the script and you have to import the
encoding you need (and right now there is only one but it is easy to do
multiple with a template). I'm beginning to wonder if it would simply be
cleaner to use the CPython gencodec.py script and generate input to it
by using the CPython encodings. I've done this for some Windows (single
byte) encodings that are not supported by Python by auto-generating
tables from Windows codepages like cp708. The tables would be pretty big
though :-)
I'm really looking for "yes nio from Python approach is worth pursuing"
or "this is stupid, you should stop now" comments. I'm pretty sure
performance wise this approach is not a good idea but it is infinitely
faster than "doesn't work at all" :-)
Here is a slightly more real example:
C:\users\clach04\python\jython_character_encoding>c:\jython2.5.1\jython.bat
Jython 2.5.1 (Release_2_5_1:6813, Sep 26 2009, 13:47:54)
[Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.6.0_02
Type "help", "copyright", "credits" or "license" for more information.>>> import shift_jis # register the local module/encoding
>>> x = u"\u3042" # '3042 HIRAGANA LETTER A'
>>> x.encode('shift_jis')
'\x82\xa0'
>>> # hey! Looks like it matches
http://demo.icu-project.org/icu-bin/convexp?c...
Finally, does anyone know how IronPython handles CJK (or do they simply
make use of .NET strings)?
Chris
# java imports
import java.nio.charset
import java.nio.CharBuffer
import java.nio.ByteBuffer
# python imports
import array
class MyBaseException(Exception):
pass
def nio_unicode_to_bytes(nio_charset_name, unicode_string_data):
"""Take Python Unicode string and return python str type (byte)
encoded in nio_charset_name
nio_charset_name is a java.nio.charset name
"""
assert isinstance(unicode_string_data, unicode)
nio_charset = java.nio.charset.Charset.forName(nio_charset_name)
# TODO lookup could fail
nio_charset_encoder = nio_charset.newEncoder()
try:
bbuf = nio_charset_encoder.encode(java.nio.CharBuffer.wrap(unicode_string_data))
except java.nio.charset.UnmappableCharacterException:
# not possible to represent one or more Unicode character(s) in this encoding
raise MyBaseException('nio encoding failure - not implemented support yet')
tmp_byte_array = array.array('b', bbuf.array())
return tmp_byte_array .tostring()
def nio_bytes_to_unicode(nio_charset_name, byte_string_data):
"""Take Python str (byte) string and return python Unicode string type decoded using nio_charset_name
nio_charset_name is a java.nio.charset name
"""
assert isinstance(byte_string_data, str)
nio_charset = java.nio.charset.Charset.forName(nio_charset_name)
# TODO lookup could fail
nio_charset_decoder = nio_charset.newDecoder()
tmp_byte_buffer = java.nio.ByteBuffer.wrap(byte_string_data)
try:
cbuf = nio_charset_decoder.decode(tmp_byte_buffer)
except java.nio.charset.MalformedInputException:
raise MyBaseException('nio decoding failure - not implemented support yet')
tmp_unicode_str = cbuf.toString()
return tmp_unicode_str
## could probably use a decorator here....
def decode_Shift_JIS(input):
return nio_bytes_to_unicode('Shift_JIS', input)
def encode_Shift_JIS(input):
return nio_unicode_to_bytes('Shift_JIS', input)
#### Pretty much boiler plate codec
import codecs
def decode(input, errors='strict'):
return decode_Shift_JIS(input), len(input)
def encode(input, errors='strict'):
return encode_Shift_JIS(input), len(input)
class Codec(codecs.Codec):
def decode(self, input, errors='strict'):
return decode(input, errors)
def encode(self, input, errors='strict'):
return encode(input, errors)
class StreamReader(codecs.Codec, codecs.StreamReader):
pass
class StreamWriter(codecs.Codec, codecs.StreamWriter):
pass
# entry point
def getregentry():
return (encode, decode, StreamReader, StreamWriter)
##### not so boiler plate.....
def shift_jis_search_function(name):
if name == 'shift_jis':
import shift_jis
codec = shift_jis.Codec()
return (codec.encode, codec.decode, shift_jis.StreamReader, shift_jis.StreamWriter)
else:
return None
## works for 2.4 and 2.5
codecs.register(shift_jis_search_function)