How can I get a random unicode string

Question

I am testing a REST based service and one of the inputs is a text string. So I am sending it random unicode strings from my python code. So far the unicode strings that I sent were in the ascii range, so everything worked.

Now I am attempting to send characters beyond the ascii range and I am getting an encoding error. Here is my code. I have been through this link and still unable to wrap my head around it.

# coding=utf-8

import os, random, string
import json

junk_len = 512
junk =  (("%%0%dX" % junk_len) % random.getrandbits(junk_len * 8))

for i in xrange(1,5):
    if(len(junk) % 8 == 0):
        print u'decoding to hex'
        message = junk.decode("hex")

    print 'Hex chars %s' %message
    print u' '.join(message.encode("utf-8").strip())

The first line prints without any issues, but I can't send it to the REST service without encoding it. Hence the second line where I am attempting to encode it to utf-8. This is the line of code that fails with the following message.

UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 7: ordinal not in range(128)

Alastair McCormack · Accepted Answer

As others have said, it's very difficult to make valid random UTF-8 bytes as the byte sequences have to be correct.

As Unicode maps all characters to a number between 0x0000 and 0x10FFFF, all one needs to do is to randomly generate a number in that range to get a valid Unicode address. Passing the random number to unichar (or char on Py3), will return a Unicode string of the character at the random code point.

Then all you need to do is ask Python to encode to UTF-8 to create a valid UTF-8 sequence.

Because, there are many gaps and unprintable characters (due to font limitations) in the full Unicode range, using the range 0000-D7FF with return characters in the Basic Multilingual Plane, which will be more likely to be printable by your system. When encoded to UTF-8, this results in up to 3 byte sequences for each character.

Plain Random

import random

def random_unicode(length):
    # Create a list of unicode characters within the range 0000-D7FF
    random_unicodes = [unichr(random.randrange(0xD7FF)) for _ in xrange(0, length)] 
    return u"".join(random_unicodes)

my_random_unicode_str = random_unicode(length=512)
my_random_utf_8_str = my_random_unicode_str.encode('utf-8')

Unique random

import random

def unique_random_unicode(length):
    # create a list of unique randoms.
    random_ints = random.sample(xrange(0xD7FF), length)

    ## convert ints into Unicode characters
    # for each random int, generate a list of Unicode characters
    random_unicodes = [unichr(x) for x in random_ints]
    # join the list
    return u"".join(random_unicodes) 

my_random_unicode_str = unique_random_unicode(length=512)
my_random_utf_8_str = my_random_unicode_str.encode('utf-8')

How can I get a random unicode string

Tags:

encoding

python-unicode

utf-8

python-2.7

abhi

1 Answers

Alastair McCormack

Recent Activity

Donate For Us

How can I get a random unicode string

Tags:

encoding

python-unicode

utf-8

python-2.7

abhi

1 Answers

Alastair McCormack

Related questions

Recent Activity

Donate For Us