In Python 3.12, why does 'Öl' take less memory than 'Ö'?

Question

I just read PEP 393 and learned that Python's str type uses different internal representations, depending on the content. So, I experimented a little bit and was a bit surprised by the results:

>>> sys.getsizeof('')
41
>>> sys.getsizeof('H')
42
>>> sys.getsizeof('Hi')
43
>>> sys.getsizeof('Ö')
61
>>> sys.getsizeof('Öl')
59

I understand that in the first three cases, the strings don't contain any non-ASCII characters, so an encoding with 1 byte per char can be used. Putting a non-ASCII character like Ö in a string forces the interpreter to use a different encoding. Therefore, I'm not surprised that 'Ö' takes more space than 'H'.

However, why does 'Öl' take less space than 'Ö'? I assumed that whatever internal representation is used for 'Öl' allows for an even shorter representation of 'Ö'.

I'm using Python 3.12, apparently it is not reproducible in earlier versions.

AKX · Accepted Answer

This test code (the structures are only correct according to 3.12.4 source, and even so I didn't quite double-check them)

import ctypes
import sys


class PyUnicodeObject(ctypes.Structure):
    _fields_ = [
        ("ob_refcnt", ctypes.c_ssize_t),
        ("ob_type", ctypes.c_void_p),
        ("length", ctypes.c_ssize_t),
        ("hash", ctypes.c_ssize_t),
        ("state", ctypes.c_uint64),
    ]


class StateBitField(ctypes.LittleEndianStructure):
    _fields_ = [
        ("interned", ctypes.c_uint, 2),
        ("kind", ctypes.c_uint, 3),
        ("compact", ctypes.c_uint, 1),
        ("ascii", ctypes.c_uint, 1),
        ("statically_allocated", ctypes.c_uint, 1),
        ("_padding", ctypes.c_uint, 24),
    ]

    def __repr__(self):
        return ", ".join(f"{k}: {getattr(self, k)}" for k, *_ in self._fields_ if not k.startswith("_"))


def dump_s(s: str):
    o = PyUnicodeObject.from_address(id(s))
    state_int = o.state
    state = StateBitField.from_buffer(ctypes.c_uint64(state_int))
    print(f"{s!r}".ljust(8), f"{o.length=}, {sys.getsizeof(s)=}, {state}")


dump_s('5')
dump_s('a')
dump_s('ä')
dump_s('vvv')
dump_s('ÖÖÖ')
dump_s(str(chr(214)))  # avoid the string having been interned into module source
dump_s(str(chr(214) + chr(108)))  # avoid the string having been interned into module source

prints out

'5'      o.length=1, sys.getsizeof(s)=42, interned: 3, kind: 1, compact: 1, ascii: 1, statically_allocated: 1
'a'      o.length=1, sys.getsizeof(s)=42, interned: 3, kind: 1, compact: 1, ascii: 1, statically_allocated: 1
'ä'      o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1
'vvv'    o.length=3, sys.getsizeof(s)=44, interned: 2, kind: 1, compact: 1, ascii: 1, statically_allocated: 0
'ÖÖÖ'    o.length=3, sys.getsizeof(s)=60, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 0
'Ö'      o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1
'Öl'     o.length=2, sys.getsizeof(s)=59, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 0
'Ö'      o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1

– the smoking gun seems to be statically_allocated on Ö etc..

I think that stems from this line in pycore_runtime_init_generated where it looks like the runtime statically objects for all Latin-1 strings (among others). As discussed in the comments, this CPython PR added UTF-8 representations of all of these statically allocated strings, so Ö is statically stored as both Latin-1 (1 character) and UTF-8 (2 characters).

Also, I should note getsizeof() actually forwards to unicode_sizeof_impl, it's not just measuring memory.

In Python 3.12, why does 'Öl' take less memory than 'Ö'?

Tags:

python

string

python-internals

python-3.12

Aemyl

1 Answers

AKX

Recent Activity

Donate For Us

In Python 3.12, why does 'Öl' take less memory than 'Ö'?

Tags:

python

string

python-internals

python-3.12

Aemyl

1 Answers

AKX

Related questions

Recent Activity

Donate For Us