Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alpha renaming in many languages

I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.

For example, here is some python code:

def foo(x):
    def bar(y):
        return x+y
    return bar

An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:

def foo(y):
    def bar(y1):
        return y+y1
    return bar

See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.

I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.

Any tips?

like image 947
luqui Avatar asked Jan 28 '26 16:01

luqui


1 Answers

To do this safely, you need to be able to to determine

  • all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
  • the scopes of validity for each identifer
  • the ability to substitute a new identifier for an old one in the text
  • the ability to determine if renaming an identifier causes another name to be shadowed

To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.

To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.

To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.

To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved. (Your other choice is to keep track in the AST of where the text for the string is, and the read/patch/write the file.)

Before you update the file, you need to check that you haven't shadowed something. Consider this code:

 {  local x;
     x=1;
    {local y;
     y=2;
      {local z;
         z=y
         print(x);
      }
    }
 }

We agree this code prints "1". Now we decide to rename y to x. We've broken the scoping, and now the print statement which referred conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).

This is a lot of machinery.

Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.

We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.

like image 191
Ira Baxter Avatar answered Jan 30 '26 05:01

Ira Baxter



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!