Understanding CPython by Patching – part 4

In the previous post, we have continued exploring CPython in order to find a way to turn the default base of integer literals in Python source code from decimal to hexadecimal.
We have added a patch to make the tokenizer identify a hex integer literal without any special prefix or suffix as a NUMBER token.

Somewhat as expected, the patch caused a faulty behavior in case of a hex integer literal:

>>> 2f3
ValueError: could not convert string to float: 2f3

This is expected because the tokenizer classified an invalid token as a NUMBER token, while all other parts of the interpreter are oblivious to the change (in part 2 we have gone over the steps taken by the CPython interpreter to run a piece of Python source code).

Actually, this error is probably raised by the code that tries to convert a NUMBER token into a Python numeric object (i.e. an instance of any of Python’s numeric types: ‘int’, ‘float’ and ‘complex’).

Let’s find that code.

We search for ‘could not convert string to float’, and find it in Modules\_pickle.c, Objects\floatobject.c and Python\pystrtod. The ‘pickle’ module is certainly irrelevant, so we start by examining the one in Objects\floatobject.c:

PyObject *
PyFloat_FromString(PyObject *v)
{
    ...
    if (end != last) {
        PyErr_Format(PyExc_ValueError,
                     "could not convert string to float: "
                     "%R", v);
        ...
    }
    ...
}

Sounds like this function receives a ‘str’ object. To be sure, we search for ‘PyFloat_FromString’ and find its declaration in Include\floatobject.h:

/* Return Python float from string PyObject. */
PyAPI_FUNC(PyObject *) PyFloat_FromString(PyObject*);

As we thought.

Now, it sounds very unlikely for the interpreter to first convert numeric literals in the source into ‘str’ objects, and only then convert them into numeric objects. So PyFloat_FromString is probably not the one that raised that ‘could not convert string to float’ error.

We turn to look at Python\pystrtod.c:

/* PyOS_string_to_double converts a null-terminated byte string s (interpreted
   as a string of ASCII characters) to a float.  The string should not have
   leading or trailing whitespace.  The conversion is independent of the
   current locale.
   ...
*/

double
PyOS_string_to_double(const char *s,
                      char **endptr,
                      PyObject *overflow_exception)
{
    ...
    else if (!endptr && (fail_pos == s || *fail_pos != '\0'))
        PyErr_Format(PyExc_ValueError,
                      "could not convert string to float: "
                      "%.200s", s);
    else if (fail_pos == s)
        PyErr_Format(PyExc_ValueError,
                      "could not convert string to float: "
                      "%.200s", s);
    ...
}

Hmmm…
Converting a string of ASCII chars into a float sounds more like something related to the handling of a numeric literal in Python source code.

We search for ‘PyOS_string_to_double’, and find it in 8 different files. However, one file stands out among all others: There is a call to PyOS_string_to_double in Python\ast.c, from the static function parsenumber.

Awesome!
In part 2 we have found that PyAST_FromNodeObject in Python\ast.c transforms the CST into an AST. Optimistic as ever, we guess that parsenumber is part of the code that transforms the CST into an AST.

But a guess is not enough, so we examine Python\ast.c more closely:

...
static int validate_stmts(asdl_seq *);
static int validate_exprs(asdl_seq *, expr_context_ty, int);
static int validate_nonempty_seq(asdl_seq *, const char *, const char *);
static int validate_stmt(stmt_ty);
static int validate_expr(expr_ty, expr_context_ty);
...
int
PyAST_Validate(mod_ty mod)
{
    ...
}
...
static PyObject *parsenumber(struct compiling *, const char *);
...

/* Transform the CST rooted at node * to the appropriate AST
*/

mod_ty
PyAST_FromNodeObject(const node *n, PyCompilerFlags *flags,
                     PyObject *filename, PyArena *arena)
{
    ...
}

mod_ty
PyAST_FromNode(const node *n, PyCompilerFlags *flags, const char *filename_str,
               PyArena *arena)
{
    mod_ty mod;
    PyObject *filename;
    filename = PyUnicode_DecodeFSDefault(filename_str);
    ...
    mod = PyAST_FromNodeObject(n, flags, filename, arena);
    ...
    return mod;

}
...
static PyObject *
parsenumber(struct compiling *c, const char *s)
{
    ...
}
...

It seems like Python\ast.c consists of two quite separate parts (there aren’t any calls from one part to functions in the other part):

  1. A part that contains functions related to validation checks, whose only non-static function is PyAST_Validate.
  2. A part that contains functions related to transforming the CST into an AST. The only non-static functions in this part are PyAST_FromNodeObject and PyAST_FromNode, while PyAST_FromNode just decodes some filename, and calls PyAST_FromNodeObject.

The second part is obviously the one we care about, so we would just ignore PyAST_Validate.
This means, in short, that the only way for parsenumber to ever be called, is through PyAST_FromNodeObject.

Very well.
This is our current hypothesis about the way in which the CPython interpreter handles numeric literals:

  1. The tokenizer classifies a numeric literal as a NUMBER token.
  2. The parser makes a CST node for the NUMBER token, in which it stores the token in some format, while also storing the literal’s source as a utf-8 encoded string.
  3. The parser adds the CST node to the appropriate place in the CST.
  4. While the CST is converted into an AST, that CST node is parsed and entered into the AST in some form. Among others, the literal’s source as a utf-8 encoded string is passed to parsenumber, which converts it into a Python numeric object (which is actually stored in the AST node).

In the previous post, we have seen that parsetok (in Parser\parsetok.c) calls PyTokenizer_Get to get the next token, and then calls PyParser_AddToken to add the token to the CST. parsetok passes the following information (which was received from PyTokenizer_Get) to PyParser_AddToken:

  1. the token’s type
  2. the token’s source as a utf-8 encoded string
  3. the token’s location in the source
  4. the address of some error code variable

In part 2, we have seen that (at least in the implementation of builtin ‘eval’), the CST is constructed and passed to PyAST_FromNodeObject right away.
Thus, to prove our hypothesis, we need only to:

  1. find how a token’s source as a utf-8 encoded string is stored in a CST node
  2. verify that the string passed to parsenumber is retrieved from where the token’s source was stored

Let us start by looking at PyParser_AddToken‘s implementation.
We search for ‘PyParser_AddToken’, and find it in Parser\parser.c.
Oh boy.
It seems like PyParser_AddToken is full of compilers stuff, which I don’t understand yet. Luckily, the source as a utf-8 encoded string parameter is used only 3 times in the function, so we would just follow the references to it:

int
PyParser_AddToken(parser_state *ps, int type, char *str,
                  int lineno, int col_offset, int *expected_ret)
{
    int ilabel;
    ...
    D(printf("Token %s/'%s' ... ", _PyParser_TokenNames[type], str));
    ...
    /* Find out which label this token is */
    ilabel = classify(ps, type, str);
    ...
    /* Loop until the token is shifted or an error occurred */
    for (;;) {
        ...
        /* Check accelerator */
        if (s->s_lower <= ilabel && ilabel < s->s_upper) {
            ...
            if (x != -1) {
                ...
                /* Shift the token */
                if ((err = shift(&ps->p_stack, type, str,
                                x, lineno, col_offset)) > 0) {
                    ...
                }
                ...
            }
        }
        ...
    }
}

Let’s go over the references one by one.

Ah? A function named ‘D’? Weird.
We look for ‘D’ in Parser\parser.c and find out it is actually a macro:

#ifdef Py_DEBUG
extern int Py_DebugFlag;
#define D(x) if (!Py_DebugFlag); else x
#else
#define D(x)
#endif

In other words, anything inside D would run only if we are in some debug mode.

Back to PyParser_AddToken.

We continue to the second reference to the source string.

Maybe classify stores the source string somewhere…
We search for ‘classify’, and find it in Parser\parser.c:

static int
classify(parser_state *ps, int type, const char *str)
{
    ...
    if (type == NAME) {
        const char *s = str;
        ...
        for (i = n; i > 0; i--, l++) {
            if (l->lb_type != NAME || l->lb_str == NULL ||
                l->lb_str[0] != s[0] ||
                strcmp(l->lb_str, s) != 0)
                continue;
                ...
            ...
            D(printf("It's a keyword\n"));
            return n - i;
        }
    }
    ...
}

In case classify receives a NAME token, it copies the address of the source string, and compares it to some strings.
In the previous post, we have seen that the tokenizer classifies most keywords (all of them, actually, except for ‘async’ and ‘await’) as NAME tokens. Looks like classify is the one that receives (among others) all NAME tokens, and distinguishes between real names and Python keywords.
(Hmmm… str is copied into s for no apparent reason. I have opened an issue about that in CPython’s bug tracker.)

Back to PyParser_AddToken.

Inside some infinite loop, it seems the token is shifted by calling shift. We don’t really know what that means, but we just follow references to the source string, so we search for ‘shift’ and find it also in Parser\parser.c:

static int
shift(stack *s, int type, char *str, int newstate, int lineno, int col_offset)
{
    ...
    err = PyNode_AddChild(s->s_top->s_parent, type, str, lineno, col_offset);
    ...
    return 0;
}

We search for ‘PyNode_AddChild’, and find it in Parser\node.c:

int
PyNode_AddChild(node *n1, int type, char *str, int lineno, int col_offset)
{
    ...
    node *n;
    ...
    n = &n1->n_child[n1->n_nchildren++];
    n->n_type = type;
    n->n_str = str;
    n->n_lineno = lineno;
    n->n_col_offset = col_offset;
    n->n_nchildren = 0;
    n->n_child = NULL;
    return 0;
}

The next empty child node of n1 is filled with the token’s attributes, just as they are, no conversions.
Most importantly for our purpose, we now know that the token’s source as a utf-8 encoded string is stored in the n_str field of every CST node.

Ok then, so to prove our hypothesis, we only have to verify that the string passed to parsenumber is retrieved from the n_str field of a CST node.

We search for ‘parsenumber’ and find (in Python\ast.c) a single call to it:

static expr_ty
ast_for_atom(struct compiling *c, const node *n)
{
    /* atom: '(' [yield_expr|testlist_comp] ')' | '[' [testlist_comp] ']'
       | '{' [dictmaker|testlist_comp] '}' | NAME | NUMBER | STRING+
       | '...' | 'None' | 'True' | 'False'
    */
    node *ch = CHILD(n, 0);
    int bytesmode = 0;

    switch (TYPE(ch)) {
    case NAME: {
        PyObject *name;
        const char *s = STR(ch);
        size_t len = strlen(s);
        if (len >= 4 && len <= 5) {
            if (!strcmp(s, "None"))
                return NameConstant(Py_None, LINENO(n), n->n_col_offset, c->c_arena);
            if (!strcmp(s, "True"))
                return NameConstant(Py_True, LINENO(n), n->n_col_offset, c->c_arena);
            if (!strcmp(s, "False"))
                return NameConstant(Py_False, LINENO(n), n->n_col_offset, c->c_arena);
        }
        name = new_identifier(s, c);
        ...
        /* All names start in Load context, but may later be changed. */
        return Name(name, Load, LINENO(n), n->n_col_offset, c->c_arena);
    }
    case STRING: {
        PyObject *str = parsestrplus(c, n, &bytesmode);
        ...
        if (bytesmode)
            return Bytes(str, LINENO(n), n->n_col_offset, c->c_arena);
        else
            return Str(str, LINENO(n), n->n_col_offset, c->c_arena);
    }
    case NUMBER: {
        PyObject *pynum = parsenumber(c, STR(ch));
        ...
        return Num(pynum, LINENO(n), n->n_col_offset, c->c_arena);
    }
    case ELLIPSIS: /* Ellipsis */
        ...
    case LPAR: /* some parenthesized expressions */
        ...
    case LSQB: /* list (or list comprehension) */
        ...
    case LBRACE: {
        ...
    }
    ...
}

We can not resist the temptation of glancing over the irrelevant (for our purpose) parts of ast_for_atom.

We start with the comment at the top of the function.

Hmmm… What is this atom thing, anyway? Maybe it is another word for a literal? We google ‘python atom literal’, and find 6. Expressions — Python 3.5.1 documentation.

So identifiers, literals and enclosures are all atoms. And ast_for_atom probably handles all of these.

Looks like ast_for_atom starts by retrieving a child node of the received node, using CHILD. We search for ‘CHILD’, and find it (along with some more macros we have seen in ast_for_atom) in Include\node.h:

/* Node access functions */
#define NCH(n)      ((n)->n_nchildren)

#define CHILD(n, i) (&(n)->n_child[i])
#define RCHILD(n, i)    (CHILD(n, NCH(n) + i))
#define TYPE(n)     ((n)->n_type)
#define STR(n)      ((n)->n_str)
#define LINENO(n)       ((n)->n_lineno)

Great! Macros to access the fields of a CST node!

Back to ast_for_atom.

Indeed, the address of the first child of the received CST node is stored in ch.

After that, there is a switch case on the type of the first child, which is probably the type given to it by the tokenizer:

  1. A NAME CST node is checked to determine whether its source string is ‘None’, ‘True’ or ‘False’ (the check is done by using STR, which returns the source string of a CST node):
    1. If it is, a NameConstant AST node that contains the right constant is returned.
    2. Otherwise, it looks like new_identifier is called to convert the source string into a ‘str’ object. Then, a Name AST node which contains the ‘str’ object is returned.
  2. A STRING CST node is converted into either a ‘str’ or ‘bytes’ object by parsestrplus. Later, either a Str or Bytes AST node (containing the converted object) is returned.
  3. A NUMBER CST node is converted into a Python numeric object (we guess) by parsenumber. Thankfully, we note that STR is used to pass the n_str field of the CST node to parsenumber. Subsequently, a Num AST node which contains the converted object is returned.
  4. ELLIPSIS, LPAR (left parenthesis), LSQB (left square parenthesis) and LBRACE (left curly parenthesis) CST nodes are also handled, but we would leave those for another time.

Excellent.

Seems our hypothesis proved right.
Which means we only have to realize how parsenumber works, patch it, and we are done!

Finally, we turn to examine parsenumber:

static PyObject *
parsenumber(struct compiling *c, const char *s)
{
    const char *end;
    long x;
    double dx;
    Py_complex compl;
    int imflag;
    ...
    errno = 0;
    end = s + strlen(s) - 1;
    imflag = *end == 'j' || *end == 'J';
    if (s[0] == '0') {
        x = (long) PyOS_strtoul(s, (char **)&end, 0);
        if (x < 0 && errno == 0) {
            return PyLong_FromString(s, (char **)0, 0);
        }
    }
    else
        x = PyOS_strtol(s, (char **)&end, 0);
    if (*end == '\0') {
        if (errno != 0)
            return PyLong_FromString(s, (char **)0, 0);
        return PyLong_FromLong(x);
    }
    /* XXX Huge floats may silently fail */
    if (imflag) {
        ...
        return PyComplex_FromCComplex(compl);
    }
    else
    {
        dx = PyOS_string_to_double(s, NULL, NULL);
        ...
        return PyFloat_FromDouble(dx);
    }
}

First thing first, we should give a quick look to some of the functions called here (i.e. the ones that look relevant for our purpose).
We search for ‘PyLong_FromLong’ and ‘PyLong_FromString’, and find both in Objects\longobject.c:

/* Create a new int object from a C long int */

PyObject *
PyLong_FromLong(long ival)
{
    ...
}
...
/* Parses an int from a bytestring. Leading and trailing whitespace will be
 * ignored.
 *
 * If successful, a PyLong object will be returned and 'pend' will be pointing
 * to the first unused byte unless it's NULL.
 *
 * If unsuccessful, NULL will be returned.
 */
PyObject *
PyLong_FromString(const char *str, char **pend, int base)
{
    ...
}

Quite straight forward… Oh, and this base parameter looks promising.

Next, we search for ‘PyOS_strtoul’ and ‘PyOS_strtol’, and find their definitions next to each other in Python\mystrtoul.c:

/*
**      strtoul
**              This is a general purpose routine for converting
**              an ascii string to an integer in an arbitrary base.
**              Leading white space is ignored.  If 'base' is zero
**              it looks for a leading 0b, 0o or 0x to tell which
**              base.  If these are absent it defaults to 10.
**              Base must be 0 or between 2 and 36 (inclusive).
**              If 'ptr' is non-NULL it will contain a pointer to
**              the end of the scan.
**              Errors due to bad pointers will probably result in
**              exceptions - we don't check for them.
*/
unsigned long
PyOS_strtoul(const char *str, char **ptr, int base)
{
    ...
    /* set pointer to point to the last character scanned */
    if (ptr)
        *ptr = (char *)str;

    return result;

overflowed:
    if (ptr) {
        /* spool through remaining digit characters */
        while (_PyLong_DigitValue[Py_CHARMASK(*str)] < base)
            ++str;
        *ptr = (char *)str;
    }
    errno = ERANGE;
    return (unsigned long)-1;
}
...
long
PyOS_strtol(const char *str, char **ptr, int base)
{
    long result;
    unsigned long uresult;
    char sign;

    while (*str && Py_ISSPACE(Py_CHARMASK(*str)))
        str++;

    sign = *str;
    if (sign == '+' || sign == '-')
        str++;

    uresult = PyOS_strtoul(str, ptr, base);

    if (uresult <= (unsigned long)LONG_MAX) {
        result = (long)uresult;
        if (sign == '-')
            result = -result;
    }
    ...
    else {
        errno = ERANGE;
        result = LONG_MAX;
    }
    return result;
}

The comment of PyOS_strtoul is really informative. Also, we realize that PyOS_strtoul fails in case it receives a number which is too big (to fit a C unsigned long), but it would still set the received pointer to the last character scanned (i.e. one char after the last digit char).

PyOS_strtol doesn’t have a useful comment, but it is short, so we would go over it quickly.

It starts with a while loop to skip all leading spaces in the number’s string. The first non-space char of the number is stored in sign, and is skipped in case it really is a sign symbol.

Now that it has the number stripped from the sign symbol, it just passes it to PyOS_strtoul, to do the job of converting it into a C unsigned long.

At last, the converted number (as a C unsigned long) is converted yet again (if possible), this time into a C long, which is returned.
If it isn’t possible (or if the number couldn’t be converted into a C unsigned long in the first place), the global errno is set to ERANGE (which probably means range error), and an error value is returned.

Back to parsenumber. This time for real.

First, the address of the number’s last char is calculated and stored in end. Whether this is an imaginary number literal (i.e. whether the last char is the letter ‘j’) is stored in imflag.

Then, if the number’s first char is zero, it is obviously not a negative number. Therefore, PyOS_strtoul is called to try to convert it into a C unsigned long, and the value it returns is casted into a C long.

What happens after the call to PyOS_strtoul is up to the number:

  1. If the number is too big to fit into a C unsigned long, errno is ERANGE, and x is -1.
  2. If the number fits into a C unsigned long, but not into a C long, errno is zero, and x is negative. In this case, the following if condition is met, so PyLong_FromString is called, and its return value is returned.
  3. If the number fits into a C unsigned long as well as into a C long, errno is zero, and x is positive.

If the number’s first char is not zero, it might be negative, so PyOS_strtol is called. Luckily, as we have seen earlier, PyOS_strtol keeps it simple. So there are only two options after calling it:

  1. The number fits into a C long, errno is zero, and x is the number.
  2. The number does not fit into a C long, errno is ERANGE, and x is some error value.

At this point, PyOS_strtoul must have been called, either explicitly, or implicitly through PyOS_strtol. Anyway, PyOS_strtoul has updated end to point to the char after the last digit it parsed.

Whatever comes next is determined by where end points to:

  1. PyOS_strtoul has parsed the whole number, and thus end points to the null-terminator. Whatever led us here, if errno is ERANGE, the number didn’t fit into a C long, so PyLong_FromString is called, and its return value is returned. Else, the number did fit into x (a C long), and so PyLong_FromLong is called, and its return value is returned.
  2. PyOS_strtoul has not finished parsing the number because it ends with the letter ‘j’. In that case, imflag is set. Ultimately, PyComplex_FromCComplex is called, and its return value is returned.
  3. PyOS_strtoul has not finished parsing the number because it is a number with an exponent part, which begins with the letter ‘e’. This time, imflag is not set, and as a result, PyOS_string_to_double is called to convert the number into a C double. Finally, PyFloat_FromDouble is called, and its return value is returned.
  4. PyOS_strtoul has not finished parsing the number because it is a fraction, which has a dot somewhere, and the flow is the same as the one in the previous scenario.
  5. There shouldn’t be any more scenarios… unless someone actually patched CPython in some weird way, just as we did 🙂
    In our patched CPython, it might be that the number PyOS_strtoul received is a hex integer literal without a prefix, but PyOS_strtoul treats it like a decimal integer literal, and stops parsing it on any of its alphabetical hex digits. In that case, imflag is not set, so just as in the two previous scenarios, PyOS_string_to_double is called. It fails, of course, as we have seen at the end of the last post, and the beginning of this one.

Hmmm…

It seems like, eventually, we are ready for the second patch.
We don’t care about the part of parsenumber that handles numbers starting with zero, nor do we care about the part that handles imaginary numbers or fractions.

We simply want integer literals without any prefix to be treated as hexadecimal numbers.

Again, our patch turns out to be quite simple:

static PyObject *
parsenumber(struct compiling *c, const char *s)
{
    const char *end;
    long x;
    double dx;
    Py_complex compl;
    int imflag;
    ...
    errno = 0;
    end = s + strlen(s) - 1;
    imflag = *end == 'j' || *end == 'J';
    if (s[0] == '0') {
        x = (long) PyOS_strtoul(s, (char **)&end, 0);
        if (x < 0 && errno == 0) {
            return PyLong_FromString(s, (char **)0, 0);
        }
    }
    else
        // origLine: x = PyOS_strtol(s, (char **)&end, 0);
        x = PyOS_strtol(s, (char **)&end, 0x10);   // orenmnLine
    if (*end == '\0') {
        if (errno != 0)
            // origLine: return PyLong_FromString(s, (char **)0, 0);
            return PyLong_FromString(s, (char **)0, 0x10);     // orenmnLine
        return PyLong_FromLong(x);
    }
    /* XXX Huge floats may silently fail */
    if (imflag) {
        ...
        return PyComplex_FromCComplex(compl);
    }
    else
    {
        dx = PyOS_string_to_double(s, NULL, NULL);
        ...
        return PyFloat_FromDouble(dx);
    }
}

We build our patched CPython, and somewhat surprisingly, the default base of integer literals indeed seems to be hexadecimal.

Understanding CPython by Patching – part 3

In the previous post, we have started exploring CPython in order to find a way to turn the default base of integer literals in Python source code from decimal to hexadecimal. (The last post ended with a short recap. Feel free to check it out for a fast recall.)
Without further preparations, we would continue right where we have stopped last time.

So, we found out parsetok in Parser\parsetok.c does the tokenizing and the parsing. For this purpose, it receives a pointer to a tok_state struct (i.e. a tokenizer struct) that contains (among others) the Python source code string. parsetok is a little big, but we are not intimidated:

/* Parse input coming from the given tokenizer structure.
   Return error code. */

static node *
parsetok(struct tok_state *tok, grammar *g, int start, perrdetail *err_ret,
         int *flags)
{
    parser_state *ps;
    node *n;
    ...
    if ((ps = PyParser_New(g, start)) == NULL) {
        ...
    }
    ...
    for (;;) {
        char *a, *b;
        int type;
        size_t len;
        char *str;
        ...
        type = PyTokenizer_Get(tok, &a, &b);
        if (type == ERRORTOKEN) {
            err_ret->error = tok->done;
            break;
        }
        ...
        len = b - a; /* XXX this may compute NULL - NULL */
        ...
        if (len > 0)
            strncpy(str, a, len);
        str[len] = '\0';
        ...
        if ((err_ret->error =
             PyParser_AddToken(ps, (int)type, str,
                               tok->lineno, col_offset,
                               &(err_ret->expected))) != E_OK) {
            ...
        }
    }

    if (err_ret->error == E_DONE) {
        n = ps->p_tree;
        ps->p_tree = NULL;
        ...
    }
    else
        n = NULL;
    ...
    PyTokenizer_Free(tok);

    return n;
}

Cool.
First, PyParser_New is called to create a parser_state struct (i.e. a parser struct), which also contains an empty CST. Then, in a loop, PyTokenizer_Get is called to get the next token’s string and type (in my humble opinion, ‘token_str_start_ptr’ and ‘token_str_end_ptr’ would have been more suitable names for the variables ‘a’ and ‘b’). If the token is valid (type != ERRORTOKEN), PyParser_AddToken is called to add the token to our CST. When there are no more tokens left, the tokenizing and parsing are completed. Subsequently, the tokenizer is freed, and the CST is returned.

We search for ‘PyTokenizer_Get’, and find it in Parser\tokenizer.c:

int
PyTokenizer_Get(struct tok_state *tok, char **p_start, char **p_end)
{
    int result = tok_get(tok, p_start, p_end);
    ...
    return result;
}

Ok, we go straight to tok_get (which is also in Parser\tokenizer.c), and…

Oh my.
tok_get is almost 500 lines of code. This is it. The tokenizing function. This is going to be a hell of a dive…
Well, actually we don’t feel like drowning today, so we would split it to some smaller dives:

/* Get next token, after space stripping etc. */

static int
tok_get(struct tok_state *tok, char **p_start, char **p_end)
{
    int c;
    ...
    /* Get indentation level */
    if (tok->atbol) {
        ...
        tok->atbol = 0;
        for (;;) {
            c = tok_nextc(tok);
            if (c == ' ')
                ...
            else if (c == '\t') {
                ...
            }
            else if (c == '\014') /* Control-L (formfeed) */
                ...
            else
                break;
        }
        tok_backup(tok, c);
        ...
    }
    
    tok->start = tok->cur;

    /* Return pending indents/dedents */
    if (tok->pendin != 0) {
        if (tok->pendin < 0) {
            tok->pendin++;
            return DEDENT;
        }
        else {
            tok->pendin--;
            return INDENT;
        }
    }
    ...
    /* Skip spaces */
    do {
        c = tok_nextc(tok);
    } while (c == ' ' || c == '\t' || c == '\014');

    /* Set start of current token */
    tok->start = tok->cur - 1;

    /* Skip comment */
    if (c == '#')
        while (c != EOF && c != '\n')
            c = tok_nextc(tok);

    /* Check for EOF and errors now */
    if (c == EOF) {
        return tok->done == E_EOF ? ENDMARKER : ERRORTOKEN;
    }

First, if the tokenizer’s atbol (which stands for ‘at begin of line’) flag is set, spaces and tabs are counted. This is done by calling tok_nextc repeatedly to get the next char from the tokenizer, until a char other than a space or a tab is encountered, and then calling tok_backup to restore the extra char consumed by tok_nextc.
If any erroneous indentation is spotted, ERRORTOKEN is returned (I have removed those checks).
Otherwise, if the indentation of this line is bigger or smaller than the last one, either INDENT or DEDENT is returned respectively.

After that, tok_nextc is again called repeatedly in order to skip spaces and tabs. There are some states in which we might or might not reach this spaces-skipping code:

  1. This token is at the beginning of a line:
    1. This line’s indentation is invalid, and so ERRORTOKEN is returned before we reach here.
    2. This line’s indentation is valid but different than the previous line, and so either INDENT or DEDENT is returned before we reach here.
    3. This line’s indentation is the same as the previous line, so we reach here after consuming all indentation spaces, and there aren’t any more spaces to skip.
  2. This token is in the middle or at the end of a line.

Indeed, if we encounter any spaces here, we must be in the middle or at the end of a line, where spaces are meaningless, and thus they are just skipped.

Later, everything from a ‘#’ char until a new line or until the end of the file is skipped, as it is simply a comment.
Finally (for this brief dive), if EOF is reached, tok_get returns.

Just to make sure it does what we think it does, we search for ‘tok_nextc’, and find its definition and tok_backup‘s definition next to each other, also in Parser\tokenizer.c. tok_nextc is quite long, but its comment is good enough for us:

/* Get next char, updating state; error code goes into tok->done */

static int
tok_nextc(struct tok_state *tok)
{
    ...
}


/* Back-up one character */

static void
tok_backup(struct tok_state *tok, int c)
{
    if (c != EOF) {
        if (--tok->cur < tok->buf)
            Py_FatalError("tok_backup: beginning of buffer");
        if (*tok->cur != c)
            *tok->cur = c;
    }
}

tok_backup is kind of straight forward. tok->cur is decremented, but if it was already pointing to the beginning of the buffer, something, obviously, is terribly wrong, so a fatal error is raised. Now, in case the previous char is not already the char we wanted to restore, it is overwritten. We are having trouble figuring out why that should ever happen, but whatever.

Back to tok_get, it seems like we are finally starting to deal with chars that aren’t white-spaces:

    ...
    /* Identifier (most frequent token!) */
    ...
    if (is_potential_identifier_start(c)) {
        /* Process b"", r"", u"", br"" and rb"" */
        ...
        while (1) {
            if (!(saw_b || saw_u) && (c == 'b' || c == 'B'))
                ...
            else if (!(saw_b || saw_u || saw_r) && (c == 'u' || c == 'U'))
                ...
            else if (!(saw_r || saw_u) && (c == 'r' || c == 'R'))
                ...
            else
                break;
            c = tok_nextc(tok);
            if (c == '"' || c == '\'')
                goto letter_quote;
        }
        while (is_potential_identifier_char(c)) {
            ...
            c = tok_nextc(tok);
        }
        tok_backup(tok, c);
        ...
        *p_start = tok->start;
        *p_end = tok->cur;
        ...
        return NAME;
    }

We take a quick look at is_potential_identifier_start and is_potential_identifier_char, which turn out to be two simple macros (also defined in Parser\tokenizer.c), that do exactly as their names claim.

#define is_potential_identifier_start(c) (\
              (c >= 'a' && c <= 'z')\
               || (c >= 'A' && c <= 'Z')\
               || c == '_'\
               || (c >= 128))

#define is_potential_identifier_char(c) (\
              (c >= 'a' && c <= 'z')\
               || (c >= 'A' && c <= 'Z')\
               || (c >= '0' && c <= '9')\
               || c == '_'\
               || (c >= 128))

Back to tok_get, if is_potential_identifier_start returns true, a clever while loop checks whether it is actually some combination of a string or bytes literal prefix followed by an apostrophe or a quotation mark. If it is, it could only be a string or bytes literal, so we jump to letter_quote, which would treat the token as a potential string or bytes literal.

Now that we know this token must be an identifier or a keyword, we consume chars until we reach the end of the token. This is done by calling tok_nextc and is_potential_identifier_char repeatedly, until is_potential_identifier_char returns false. Subsequently, tok_backup is called to restore the extra char that was consumed.

At this point, we have the whole token, so we can determine whether this is a valid ‘async’ or ‘await’ keyword (which is done by some checks I have removed). Otherwise, it must be an identifier or another keyword, so NAME is returned.

We wonder why ‘async’ and ‘await’ receive such a special treatment, as it seems any other keyword (e.g. ‘if’, ‘else’) would be classified as a NAME token. We could probably find some smart answer in a PEP, but we would leave that for another time.

Anyway, we continue exploring tok_get:

    /* Newline */
    if (c == '\n') {
        ...
        return NEWLINE;
    }

    /* Period or number starting with period? */
    if (c == '.') {
        ...
        return DOT;
    }

    /* Number */
    if (isdigit(c)) {
        if (c == '0') {
            /* Hex, octal or binary -- maybe. */
            c = tok_nextc(tok);
            if (c == '.')
                goto fraction;
            if (c == 'j' || c == 'J')
                goto imaginary;
            if (c == 'x' || c == 'X') {

                /* Hex */
                c = tok_nextc(tok);
                if (!isxdigit(c)) {
                    tok->done = E_TOKEN;
                    tok_backup(tok, c);
                    return ERRORTOKEN;
                }
                do {
                    c = tok_nextc(tok);
                } while (isxdigit(c));
            }
            else if (c == 'o' || c == 'O') {
                /* Octal */
                c = tok_nextc(tok);
                if (c < '0' || c >= '8') {
                    tok->done = E_TOKEN;
                    tok_backup(tok, c);
                    return ERRORTOKEN;
                }
                do {
                    c = tok_nextc(tok);
                } while ('0' <= c && c < '8');
            }
            else if (c == 'b' || c == 'B') {
                /* Binary */
                c = tok_nextc(tok);
                if (c != '0' && c != '1') {
                    tok->done = E_TOKEN;
                    tok_backup(tok, c);
                    return ERRORTOKEN;
                }
                do {
                    c = tok_nextc(tok);
                } while (c == '0' || c == '1');
            }
            else {
                int nonzero = 0;
                /* maybe old-style octal; c is first char of it */
                /* in any case, allow '0' as a literal */
                while (c == '0')
                    c = tok_nextc(tok);
                while (isdigit(c)) {
                    nonzero = 1;
                    c = tok_nextc(tok);
                }
                if (c == '.')
                    goto fraction;
                else if (c == 'e' || c == 'E')
                    goto exponent;
                else if (c == 'j' || c == 'J')
                    goto imaginary;
                else if (nonzero) {
                    tok->done = E_TOKEN;
                    tok_backup(tok, c);
                    return ERRORTOKEN;
                }
            }
        }

Next, if the token is a new line or a period, NEWLINE or DOT is returned, respectively.

And then…
Unbelievable.
We actually got to where tok_get identifies a NUMBER token.
isdigit is called to check whether the first char of the token is a digit. If it is, then it could only be a number. First thing first, if this char is a zero, we would check for some special cases of number literals that start with a zero.

We call tok_nextc to get the next char, and check whether it is a dot. If it is, it could only be a fraction, so we jump to the code that handles fraction literals.
Then, we check whether the char following the leading zero is the letter ‘j’. If it is, it is the imaginary number zero, so we jump the code that handles imaginary number literals (which probably does kind of nothing, as the letter ‘j’ must be the last char in an imaginary number literal).

Later, we check whether our number token starts with any of the three prefixes: Hex, octal or binary. If indeed it starts with any of those prefixes, tok_nextc is called again, and the next char of the token is checked. If the char is invalid in that number base, tok_backup is called to restore the invalid char (it is not a part of this token, even though it is invalid), and ERRORTOKEN is returned. Otherwise, it must be a valid NUMBER token, so tok_nextc is called repeatedly to consume all following digits (in that number base), and reach the end of the token.

Now we have the required knowledge to understand the following behavior:

>>> 0x123g
  File "<stdin>", line 1
    0x123g
         ^
SyntaxError: invalid syntax
>>> 0xg
  File "<stdin>", line 1
    0xg
     ^
SyntaxError: invalid token 

In the first one, the tokenizer identified the NUMBER token ‘0x123’ and the NAME token ‘g’. Then CPython tried to make sense of the syntax, but failed, and so raised an error saying ‘invalid syntax’.
In the second one, the tokenizer identified a token starting with a hex prefix (‘0x’), and concluded it must be a NUMBER token, but then realized the hex prefix is followed by a char which is not a hex digit. Therefore, it raised an error saying ‘invalid token’.

Back to tok_get.
If the starting zero is not of a prefix, it is a leading zero in a NUMBER token, which is exactly the same as multiple leading zeros in a NUMBER token, so we might as well call tok_nextc repeatedly to consume all leading zeros.

After that, tok_nextc and isdigit are called repeatedly to consume all decimal digits, until a dot (which means it is a fraction), the letter ‘e’ (which means it is a number with an exponent part) or the letter ‘j’ (which means it is an imaginary number). If the decimal digits are followed by any of these 3, we jump to the appropriate code.

Wait a moment… We have already checked for a dot and the letter ‘j’ earlier! Looks like the first time was completely redundant. (I have opened an issue about that in CPython’s bug tracker.)

At last, if the token is a non-zero number that starts with leading zeros, and it is not any of those 3 special cases, tok_backup is called to restore the extra char that was consumed, and ERRORTOKEN is returned.
This sounds a little weird, so we try it out in our interpreter, and realize that indeed everything works exactly like that:

>>> 00000004
  File "<stdin>", line 1
    00000004
           ^
SyntaxError: invalid token
>>> 00000004e3
4000.0
>>> 00000004j
4j
>>> 00000004.
4.0
>>> 00000004.3
4.3
>>> 0000000
0
>>> 0000000.0
0.0

Maybe the ‘maybe old-style octal’ comment is related to that odd behavior. We google ‘python PEP octal’, and the first result is PEP 3127, which explains that in the ancient Python 2 (the wording is mine, of course), leading zeros in a number literal were the same as adding the ‘0o’ octal prefix. The old and wise core developers had decided this behavior had been confusing, and deprecated it.
It seems a little weird that numbers with an exponent part, fractions and imaginary numbers are still allowed to start with leading zeros, but whatever.

All right, so we are done with numbers that start with a zero. Let’s go back to tok_get, and examine the way other numbers are treated:

        else {
            /* Decimal */
            do {
                c = tok_nextc(tok);
            } while (isdigit(c));
            {
                /* Accept floating point numbers. */
                if (c == '.') {
        fraction:
                    /* Fraction */
                    do {
                        c = tok_nextc(tok);
                    } while (isdigit(c));
                }
                if (c == 'e' || c == 'E') {
                    int e;
                  exponent:
                    e = c;
                    /* Exponent part */
                    c = tok_nextc(tok);
                    if (c == '+' || c == '-') {
                        c = tok_nextc(tok);
                        if (!isdigit(c)) {
                            tok->done = E_TOKEN;
                            tok_backup(tok, c);
                            return ERRORTOKEN;
                        }
                    } else if (!isdigit(c)) {
                        tok_backup(tok, c);
                        tok_backup(tok, e);
                        *p_start = tok->start;
                        *p_end = tok->cur;
                        return NUMBER;
                    }
                    do {
                        c = tok_nextc(tok);
                    } while (isdigit(c));
                }
                if (c == 'j' || c == 'J')
                    /* Imaginary part */
        imaginary:
                    c = tok_nextc(tok);
            }
        }
        tok_backup(tok, c);
        *p_start = tok->start;
        *p_end = tok->cur;
        return NUMBER;
    }
    ...
}

If this else block is reached, the token starts with a decimal digit other than zero, which means it could only be a decimal number. So tok_nextc and isdigit are called repeatedly to consume all following decimal digits.
If the next char is a dot, it must be a fraction, and so tok_nextc and isdigit are again called repeatedly to consumed all decimal digits of the fractional part.

Then, if the next char is the letter ‘e’, it might be a NUMBER token with an exponent part. Now, there are some options:

  1. The letter ‘e’ is followed by a plus or a minus, which means it must be a number with an exponent part:
    1. The plus or minus is followed by a decimal digit, i.e. this token is definitely a NUMBER token with a valid exponent part.
    2. The plus or minus is followed by a char which is not a decimal digit. This is considered illegal, so that char is restored (as it is not a part of the invalid token), and ERRORTOKEN is returned.
  2. The letter ‘e’ is followed by a char which is neither a sign symbol nor a decimal digit. This means the NUMBER token didn’t have an exponent part after all. Thus, tok_backup is called twice, to restore both that char and the letter ‘e’, and NUMBER is returned.
  3. The letter ‘e’ is followed by a decimal digit, which means it is indeed a NUMBER token with a valid exponent part.

If we reach the do-while loop after the else-if block, it is already known to be a NUMBER token with a valid exponent part, so tok_nextc and isdigit are called repeatedly to consume all of the decimal digits of the exponent part.

Now we have the necessary knowledge to unravel the difference between the following errors:

>>> 123expelliarmus
  File "<stdin>", line 1
    123expelliarmus
                  ^
SyntaxError: invalid syntax
>>> 123e+xpelliarmus
  File "<stdin>", line 1
    123e+xpelliarmus
        ^
SyntaxError: invalid token

In the first one, the tokenizer determines it is the NUMBER token ‘123’ followed by the NAME token ‘expelliarmus’ (it is only later that CPython realizes this is a syntax error).
In the second one, the tokenizer identifies the potential NUMBER token ‘123e+’, and then determines it is an ERRORTOKEN, because the plus is not followed by a decimal digit.

At last, if the NUMBER token (whatever kind of a NUMBER token it is) ends with the letter ‘j’, it is an imaginary number. After confirming the NUMBER token really is an imaginary number, tok_nextc is called to consume another char. This is done because all other flows reach the shared return code with an extra char consumed, so in order to make the call to tok_backup also a part of the shared code, the imaginary number flow must align with all other flows, and consume an extra char.
And then, finally, NUMBER is returned.

Phew.
Tokenizing is not an easy task, and that was only a NUMBER token.

Hmmm… after all that exploration, we realize CPython happily accepts some strange number literals, so to make sure we didn’t get it all wrong, we try them out in the interpreter:

>>> 243.j
243j
>>> 123.e2
12300.0

Whatever…

Anyway, it looks like we are ready for our first patch.
We want the tokenizer to identify a hex integer literal without any prefix as a NUMBER token. Also, we don’t want to mix hex integer literals with fractions or imaginary numbers (we don’t have to worry about mixing with numbers that have an exponent part, as the letter ‘e’ would be treated as a hex digit anyway).

Therefore, if our patched tokenizer identifies a hex integer literal without a prefix, it shouldn’t accept a dot or the letter ‘j’ as part of the token. However, if it identifies a decimal integer literal without a prefix, it should treat it as a decimal integer literal (i.e. accept a fraction and or an imaginary number).

Someway, this turned out to be quite a small patch:

        else {
            /* origComment: Decimal */
            /* orenmnComment: Hex or Decimal */
            int orenmn_is_hex_int_literal = 0;
            do {
                c = tok_nextc(tok);
                if (isxdigit(c) && !isdigit(c))
                    orenmn_is_hex_int_literal = 1;
            // origLine: } while (isdigit(c));
            } while (isxdigit(c));  // orenmnLine
            // origLine: {
            if (!orenmn_is_hex_int_literal) {
                /* Accept floating point numbers. */
                if (c == '.') {
        fraction:
                    /* Fraction */
                    do {
                        c = tok_nextc(tok);
                    } while (isdigit(c));
                }
                if (c == 'e' || c == 'E') {
                    int e;
                  exponent:
                    e = c;
                    /* Exponent part */
                    c = tok_nextc(tok);
                    if (c == '+' || c == '-') {
                        c = tok_nextc(tok);
                        if (!isdigit(c)) {
                            tok->done = E_TOKEN;
                            tok_backup(tok, c);
                            return ERRORTOKEN;
                        }
                    } else if (!isdigit(c)) {
                        tok_backup(tok, c);
                        tok_backup(tok, e);
                        *p_start = tok->start;
                        *p_end = tok->cur;
                        return NUMBER;
                    }
                    do {
                        c = tok_nextc(tok);
                    } while (isdigit(c));
                }
                if (c == 'j' || c == 'J')
                    /* Imaginary part */
        imaginary:
                    c = tok_nextc(tok);
            }
        }

We build our patched CPython, and get the following behavior:

>>> 2f3
ValueError: could not convert string to float: 2f3
>>> 2f3j
  File "<stdin>", line 1
    2f3j
       ^
SyntaxError: invalid syntax
>>> 243j
243j
>>> 2f3.
  File "<stdin>", line 1
    2f3.
       ^
SyntaxError: invalid syntax
>>> 243.
243.0
>>> 2f3.j
ValueError: could not convert string to float: 2f3
>>> 243.j
243j
>>> 2f3.123e2j
  File "<stdin>", line 1
    2f3.123e2j
             ^
SyntaxError: invalid syntax
>>> 243.123e2j
24312.3j
>>> 3e8
300000000.0
>>> 3e8a
ValueError: could not convert string to float: 3e8a

Well, at least we have got some of it right (looks like hex integer literals actually don’t mix with fractions and imaginary numbers).
But why did CPython try to convert ‘2f3’ and ‘3e8a’ into floats?
This has probably happened because the functions that do the parsing (or those that do the transforming of the CST into an AST) had received a supposedly valid NUMBER token, which is not really that valid. Yet.

And thus, again, we must end this post abruptly, as it too became longer than it had any right to be. As usual, we would continue our journey on the next post.

part 4

Understanding CPython by Patching – part 2

In the previous post, we patched CPython to change the base of the representation of an ‘int’ object from decimal to hexadecimal.

In this post we would further explore CPython, in order to add some more patches to our patched CPython version. This time, our goal is to turn the default base of integer literals in Python source code from decimal to hexadecimal, i.e. we want the following behavior:

>>> hex(2f)
0x2f
>>> 100 - 1
ff

We watched From Source to Code: How CPython’s Compiler Works by Brett Cannon (Now that’s a cool name) some time ago, so we kind of know the steps taken by the CPython interpreter to run a piece of Python source code:

  1. decoding
  2. tokenizing
  3. parsing
  4. transforming the CST into an AST (in addition to Brett’s talk, you might refer to Eli’s post for more info)
  5. compiling
  6. executing the bytecode

Somewhere along these steps, there are two things we must change.

The first is the way the interpreter determines whether some characters are a number (or anything according to syntax) or a syntax error. The current behavior:

>>> 2f
  File "<stdin>", line 1
    2f
     ^
SyntaxError: invalid syntax

The second is the way the interpreter converts an integer literal in the source into a Python ‘int’ object. The current behavior:

>>> 100
64
>>> 0x100
100

First thing first, we would go over the interpreter’s steps, and find out which of them are relevant to us.

Decoding is about converting the Python source code bytes from any encoding into the right format for the tokenizer and parser. But what is the right format for them? We google ‘python source encoding’, and find PEP 3120, which says that utf-8 is Python’s default source encoding. Is it because CPython internally stores the characters of a ‘str’ object as a utf-8 encoded string? We google ‘python internal string representation’, and find PEP 0393, which says that for each ‘str’, CPython checks what is the max number of bytes needed to encode any of the chars in the string. If the max number of bytes needed is 2, the internal representation of the characters of the ‘str’ would be an array of Py_UCS2 (a 16 bits char type). Similarly, if that max was 1 or 4 bytes, it would be an array of Py_UCS1 (an 8 bits char type) or Py_UCS4 (a 32 bits char type) respectively. By the way, reading PEP 0393 gives us another important hint: CPython’s equivalent of Python’s ‘str’ is PyUnicode_Type.
Anyway, this internal representation is definitely not utf-8. With our newly acquired knowledge, we might guess that decoding is about converting Python source code bytes (which is assumed to be a utf-8 encoded string unless explicitly specified otherwise) into that cool representation described in PEP 0393.
Back to business, while decoding, the interpreter doesn’t care about syntax yet, and definitely has nothing to do with converting anything into an ‘int’ object, so we won’t explore the code that does the decoding deliberately (but if we are lucky, we might accidentally find out whether our guess was right).

Tokenizing is about splitting the decoded source into tokens. According to Brett’s aforementioned talk, it seems like identifying a token isn’t like splitting ‘x=3+2’ into [‘x’, ‘=’, ‘3’, ‘+’, ‘2’], but also identifying ‘x’ as a NAME token, ‘3’ and ‘2’ as NUMBER tokens, etc. Could it be that as part of the tokenizing, a literal such as 2f is identified as a syntax error, because it doesn’t look like any valid token?

Parsing is about constructing a CST (AKA a parse tree) out of the tokens the tokenizer has produced. Could it be that at this point an integer NUMBER token is converted and stored as an ‘int’ object in a node in the CST?

Transforming the CST into an AST is quite self explanatory. Could it be that at this point an integer NUMBER node in the CST is converted and stored as an ‘int’ object in a node in the AST?
We can’t believe an integer NUMBER token would be stored in the AST in any form other than an ‘int’ object (Certainly NUMBER tokens don’t suffer from the ‘Pikachu Syndrome’ and refuse to evolve into an ‘int’ object, right?). But could we be certain about it? We google ‘python ast example’, and find yet another useful post by Eli, this time about the ast module.
Now that we know how, we use the ast module to simulate a very simple AST:

>>> import ast
>>> ast.dump(ast.parse('0x10 + 0b111', mode='eval'))
'Expression(body=BinOp(left=Num(n=16), op=Add(), right=Num(n=7)))'

Great! As we thought, integers are not stored in the AST as the strings they once were in the source. Well, we still don’t really know those integers are stored as ‘int’ objects, but we know they are parsed and stored as numbers (instead of strings), and this is enough for our purpose.

Now, how can we find the pieces of code in CPython that do the tokenizing, the parsing, and the transforming of the CST into an AST?
In the previous post we found the C implementation of the builtin ‘hex’ function in Python\bltinmodule.c. This time, we search in bltinmodule.c for ‘eval’, and find builtin_eval_impl. This function must eventually take all of the aforementioned interpreter’s steps, maybe except for decoding (if our guess about the decoding was right). All right, it seems like we just have to follow carefully the execution of builtin_eval_impl, and we would certainly find the tokenizing, the parsing, and the transforming of the CST into an AST functions.
We start looking at builtin_eval_impl. Most of it deals with preparing the locals and globals, but the end of the function looks interesting:

static PyObject *
builtin_eval_impl(PyModuleDef *module, PyObject *source, PyObject *globals,
                  PyObject *locals)
/*[clinic end generated code: output=7284501fb7b4d666 input=11ee718a8640e527]*/
{
    PyObject *result, *source_copy;
    const char *str;
    ...
    if (PyCode_Check(source)) {
        ...
        return PyEval_EvalCode(source, globals, locals);
    }
    ...
    str = source_as_string(source, "eval", "string, bytes or code", &cf, &source_copy);
    ...
    result = PyRun_StringFlags(str, Py_eval_input, globals, locals, &cf);
    ...
    return result;
}

The first time the source parameter is being accessed is when it is passed to PyCode_Check. We search for ‘PyCode_Check’, and find out it is a simple macro defined in Include\code.h:

#define PyCode_Check(op) (Py_TYPE(op) == &PyCode_Type)

The macro just tells us if a CPython object’s type is PyCode_Type. This is probably the type of the ‘code’ object, but we search for ‘PyCode_Type’ just in case. Indeed, we find it in Objects\codeobject.c, and it is as we thought:

PyTypeObject PyCode_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "code",
    ...
};

Back to builtin_eval_impl, if ‘eval’ received a ‘code’ object as the source parameter, it just calls PyEval_EvalCode. But we are interested in Python source code that contains integer literals, so we assume the source parameter is a ‘str’ object.
The source ‘str’ object is converted into a C string by source_as_string. Wait a moment… Convert a ‘str’ object into a C string? That means either taking the bytes of its Py_UCS1/Py_UCS2/Py_UCS4 array as is, or converting the ‘str’ into some encoding. We search for ‘source_as_string’, and find it also in bltinmodule.c:

static const char *
source_as_string(PyObject *cmd, const char *funcname, const char *what, PyCompilerFlags *cf, PyObject **cmd_copy)
{
    const char *str;
    ...
    if (PyUnicode_Check(cmd)) {
        ...
        str = PyUnicode_AsUTF8AndSize(cmd, &size);
        ...
    }
    ...
    return str;
}

First, the cmd parameter (the source) is passed to PyUnicode_Check. Earlier, we realized CPython’s equivalent of Python’s ‘str’ is PyUnicode_Type. We google ‘PyUnicode_Type’, and find Unicode Objects and Codecs – Python 3.5.1 documentation. We search there for PyUnicode_Check, and find:

int PyUnicode_Check(PyObject *o)

Return true if the object o is a Unicode object or an instance of a Unicode subtype.

So if the received source is a ‘str’ object (or an instance of a ‘str’ subclass), it is converted into utf-8, and the utf-8 encoded string is returned. Hmmm… Looks like our guess was wrong. It seems like the right format for the tokenizer and parser is actually a utf-8 encoded string.

Back to builtin_eval_impl, the ‘str’ source is converted into a utf-8 encoded string, and passed to PyRun_StringFlags, along with the globals and locals. Which means PyRun_StringFlags still has to do all of the job following the decoding. We search for ‘PyRun_StringFlags’, and find it in Python\pythonrun.c:

PyObject *
PyRun_StringFlags(const char *str, int start, PyObject *globals,
                  PyObject *locals, PyCompilerFlags *flags)
{
    ...
    mod = PyParser_ASTFromStringObject(str, filename, start, flags, arena);
    if (mod != NULL)
        ret = run_mod(mod, filename, globals, locals, flags, arena);
    ...
    return ret;
}

If we had to guess, we would say PyParser_ASTFromStringObject does everything from decoded source to AST, and run_mod receives the AST and does all the rest.
We search for ‘PyParser_ASTFromStringObject’ and find it too in Python\pythonrun.c:

/* Preferred access to parser is through AST. */
mod_ty
PyParser_ASTFromStringObject(const char *s, PyObject *filename, int start,
                             PyCompilerFlags *flags, PyArena *arena)
{
    ...
    node *n = PyParser_ParseStringObject(s, filename,
                                         &_PyParser_Grammar, start, &err,
                                         &iflags);
    ...
    if (n) {
        ...
        mod = PyAST_FromNodeObject(n, flags, filename, arena);
        PyNode_Free(n);
    }
    ...
    return mod;
}

We guess that the PyParser_ParseStringObject does everything from decoded source to CST, and the PyAST_FromNodeObject transforms the CST into an AST. We search for ‘PyAST_FromNodeObject’, find it in Python\ast.c, and give it a quick look that confirms our guess:

/* Transform the CST rooted at node * to the appropriate AST
*/

mod_ty
PyAST_FromNodeObject(const node *n, PyCompilerFlags *flags,
                     PyObject *filename, PyArena *arena)
{
    ...
}

Awesome! Even though diving right into PyAST_FromNodeObject is quite tempting, it would probably be better for us to explore the interpreter’s steps in order. Thus, we would first complete our quest of finding the tokenizer and the parser, and only then go back to PyAST_FromNodeObject.

We search for ‘PyParser_ParseStringObject’, and find it in Parser\parsetok.c:

node *
PyParser_ParseStringObject(const char *s, PyObject *filename,
                           grammar *g, int start,
                           perrdetail *err_ret, int *flags)
{
    struct tok_state *tok;
    ...
    if (*flags & PyPARSE_IGNORE_COOKIE)
        tok = PyTokenizer_FromUTF8(s, exec_input);
    else
        tok = PyTokenizer_FromString(s, exec_input);
    ...
    return parsetok(tok, g, start, err_ret, flags);
}

Seems like we have found the functions that do the tokenizing, right? We search for ‘PyTokenizer_FromUTF8’, and find its definition right next to the definition of PyTokenizer_FromString, in Parser\tokenizer.c:

/* Set up tokenizer for string */

struct tok_state *
PyTokenizer_FromString(const char *str, int exec_input)
{
    struct tok_state *tok = tok_new();
    ...
    str = decode_str(str, exec_input, tok);
    ...
    tok->buf = tok->cur = tok->end = tok->inp = (char*)str;
    return tok;
}

struct tok_state *
PyTokenizer_FromUTF8(const char *str, int exec_input)
{
    struct tok_state *tok = tok_new();
    ...
    tok->input = str = translate_newlines(str, exec_input, tok);
    ...
    tok->str = str;
    ...
    strcpy(tok->encoding, "utf-8");
    ...
    tok->buf = tok->cur = tok->end = tok->inp = (char*)str;
    return tok;
}

We are disappointed to find out that each of these two just sets up a tok_state (i.e. a tokenizer struct), that would only be used later to do the tokenizing.
Also, it seems like our second guess about the decoding (i.e. the right format for the tokenizer and parser is a utf-8 encoded string) was right. In PyTokenizer_FromString the source string isn’t already a utf-8 encoded string, so decode_str is called, and in PyTokenizer_FromUTF8 it is already a utf-8 encoded string, so no decoding is needed.
To be on the safe side, we take a quick look at decode_str (which we also find in Parser\tokenizer.c):

/* Decode a byte string STR for use as the buffer of TOK.
   Look for encoding declarations inside STR, and record them
   inside TOK.  */

static const char *
decode_str(const char *input, int single, struct tok_state *tok)
{
    PyObject* utf8 = NULL;
    const char *str;
    ...
    tok->input = str = translate_newlines(input, single, tok);
    ...
    if (tok->enc != NULL) {
        utf8 = translate_into_utf8(str, tok->enc);
        ...
        str = PyBytes_AsString(utf8);
    }
    ...
    tok->decoding_buffer = utf8; /* CAUTION */
    return str;
}

Indeed, decode_str converts a C string into a utf-8 encoded string. This ‘CAUTION’ comment is kind of scary, but we assume it doesn’t really matter to us.

Back to PyParser_ParseStringObject, we must conclude that parsetok (also defined in parsetok.c) does the tokenizing and the parsing on its own.

This post has become much longer than I have expected, so we would end this post with a short recap, and continue our journey on the next post.

We found some things that would help us reach our goal:

  1. Sometime before the construction of the AST is finished, integer literals are parsed and stored as numbers (specifically, as Python ‘int’ objects, we guess).
  2. PyAST_FromNodeObject in Python\ast.c transforms the CST into an AST.
  3. parsetok in Parser\parsetok.c does the tokenizing and the parsing.

We also learned some other interesting things:

  1. Python’s default source encoding is utf-8, and decoding is actually converting source code encoded in any other encoding into a utf-8 encoded string.
  2. The internal representation of the characters of a ‘str’ object’s is a Py_UCS1/Py_UCS2/Py_UCS4 array, according to the max number of bytes needed to encode any of the chars in the string (as described best in PEP 0393).

part 3

Understanding CPython by Patching – part 1

During the last year or so, I have made some attempts to become familiar with CPython (henceforth, I would refer to the CPython implementation of the awesome Python 3 simply as CPython). I have tried different tactics:

Along the way, I have noticed some smart looking people say that the best way to understand CPython is to make some changes to it, and then see what happens. Of course I felt I was capable enough to figure it all out by simply reading stuff. Thus it turned out that a full year after starting my quest of understanding CPython, I have decided to make some patches to it.

I found the process of searching and messing with CPython sources while having a well defined purpose to be a great experience, which was fun and also quite educating. I have decided to write this walkthrough in hopes of inspiring you to try making your own patches (which would probably be much better than mine).

It would probably be easier to follow the walkthrough while looking at my patched CPython version, but I would try to make things clear even without it.

Let us start with that walkthrough right away.

Our first goal is to change the base of the representation of an ‘int’ object from decimal to hexadecimal, i.e.:

>>> 31
1f
>>> 32
20

We know the ‘int’ type is implemented in C, so we have to find its C equivalents of ‘__repr__’ and ‘__str__’ in the type declaration. There must be a file in the ‘Objects’ directory for the ‘int’ type. Ah? No such file? We can easily spot floatobject.c, dictobject.c, listobject.c, but where is intobject.c?? Let’s try another approach. Search every *.c or *.h file in the code base for ‘”int”‘. Two results look quite interesting:

In Python\bltinmodule.c:

SETBUILTIN("int",           &PyLong_Type);

and in Objects\longobject.c:

PyTypeObject PyLong_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "int",            /* tp_name */

Seems like the ‘int’ type is referred to as PyLong_Type in CPython. To make sure, we google ‘int long in python 3’. One of the first results is Integer Objects – Python 3.5.1 documentation, which says very clearly:

PyTypeObject PyLong_Type

This instance of PyTypeObject represents the Python integer type. This is the same object as int in the Python layer.

This site looks really useful. We should better remember it.

Anyway, seems like we found the initialization of the ‘int’ PyTypeObject in Objects\longobject.c:

PyTypeObject PyLong_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "int",                                      /* tp_name */
    ...
    long_to_decimal_string,                     /* tp_repr */
    ...
    long_to_decimal_string,                     /* tp_str */
    ...
};

Great! Seems like replacing tp_repr and tp_str with our function would do the job. We search for ‘long_to_decimal_string’ to realize how our function should look. We find it in Objects\longobject.c:

static PyObject *
long_to_decimal_string(PyObject *aa)
{
    PyObject *v;
    if (long_to_decimal_string_internal(aa, &v, NULL) == -1)
        return NULL;
    return v;
}

long_to_decimal_string receives an ‘int’ and returns a ‘str’, as expected. So we have to find a function similar to the builtin ‘hex’ function, but one that won’t add the ‘0x’ prefix. We search for ‘”hex”‘, and find in Python\clinic\bltinmodule.c.h:

#define BUILTIN_HEX_METHODDEF \
    {"hex", (PyCFunction)builtin_hex, METH_O, builtin_hex__doc__},

All right. We search for ‘builtin_hex’, and find in Python\bltinmodule.c:

static PyObject *
builtin_hex(PyModuleDef *module, PyObject *number)
/*[clinic end generated code: output=618489ce3cbc5858 input=e645aff5fc7d540e]*/
{
    return PyNumber_ToBase(number, 16);
}

Ah? What’s that ‘clinic’ thing? We google ‘python clinic’ and find PEP-0436 right away. This is some code generator to make the CPython developer’s life easier. Nothing we should worry about. Anyway, builtin_hex is just a wrapper of PyNumber_ToBase, which we search for and find in Objects\abstract.c:

PyObject *
PyNumber_ToBase(PyObject *n, int base)
{
    PyObject *res = NULL;
    PyObject *index = PyNumber_Index(n);

    if (!index)
        return NULL;
    if (PyLong_Check(index))
        res = _PyLong_Format(index, base);
    else
        /* It should not be possible to get here, as
        PyNumber_Index already has a check for the same
        condition */
        PyErr_SetString(PyExc_ValueError, "PyNumber_ToBase: index not int");
    Py_DECREF(index);
    return res;
}

Some more work is done here, but still it seems like most of the work is not done here, but in _PyLong_Format, which we find in Objects\longobject.c:

PyObject *
_PyLong_Format(PyObject *obj, int base)
{
    PyObject *str;
    int err;
    if (base == 10)
        err = long_to_decimal_string_internal(obj, &str, NULL);
    else
        err = long_format_binary(obj, base, 1, &str, NULL);
    if (err == -1)
        return NULL;
    return str;
}

In our case, base is 16, so long_format_binary is the relevant function, and we find it also in Objects\longobject.c:


/* Convert an int object to a string, using a given conversion base,
   which should be one of 2, 8 or 16.  Return a string object.
   If base is 2, 8 or 16, add the proper prefix '0b', '0o' or '0x'
   if alternate is nonzero. */

static int
long_format_binary(PyObject *aa, int base, int alternate,
                   PyObject **p_output, _PyUnicodeWriter *writer)
{
...

‘add the proper prefix … if alternate is nonzero’?!?

Perfect!

Looks like we know enough to write our own tp_repr and tp_str for ‘int’:

static PyObject *
orenmn_long_to_hex_string(PyObject *longObjPtr)
{
    PyObject *hexStrReprPtr;

    if (-1 == long_format_binary(
        longObjPtr, 0x10,
        0,  // alternate = 0 to exclude the '0x' prefix
        &hexStrReprPtr, NULL))
    {
        return NULL;
    }
    return hexStrReprPtr;
}

We put it in Objects\longobject.c, just as long_to_decimal_string. Now we just replace the tp_repr and tp_str:

PyTypeObject PyLong_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "int",                                      /* tp_name */
...
    // origLine: long_to_decimal_string,        /* tp_repr */
    orenmn_long_to_hex_string,                  /* tp_repr */
...
    // origLine: long_to_decimal_string,        /* tp_str */
    orenmn_long_to_hex_string,                  /* tp_str */

And that’s it! We build our patched CPython, and now the representation of ‘int’ is really in hex 🙂

part 2