php software - What kinds of patterns could I enforce on the code to make it easier to translate to another programming language?
pdf introduction (6)
There are a couple answers telling you not to bother. Well, how helpful is that? You want to learn? You can learn. This is compilation. It just so happens that your target language isn't machine code, but another high-level language. This is done all the time.
There's a relatively easy way to get started. First, go get http://sourceforge.net/projects/lime-php/ (if you want to work in PHP) or some such and go through the example code. Next, you can write a lexical analyzer using a sequence of regular expressions and feed tokens to the parser you generate. Your semantic actions can either output code directly in another language or build up some data structure (think objects, man) that you can massage and traverse to generate output code.
You're lucky with PHP and Python because in many respects they are the same language as each other, but with different syntax. The hard part is getting over the semantic differences between the grammar forms and data structures. For example, Python has lists and dictionaries, while PHP only has assoc arrays.
The "learner" approach is to build something that works OK for a restricted subset of the language (such as only print statements, simple math, and variable assignment), and then progressively remove limitations. That's basically what the "big" guys in the field all did.
Oh, and since you don't have static types in Python, it might be best to write and rely on PHP functions like "python_add" which adds numbers, strings, or objects according to the way Python does it.
Obviously, this can get much bigger if you let it.
I am setting out to do a side project that has the goal of translating code from one programming language to another. The languages I am starting with are PHP and Python (Python to PHP should be easier to start with), but ideally I would be able to add other languages with (relative) ease. The plan is:
This is geared towards web development. The original and target code will be be sitting on top of frameworks (which I will also have to write). These frameworks will embrace an MVC design pattern and follow strict coding conventions. This should make translation somewhat easier.
I am also looking at IOC and dependency injection, as they might make the translation process easier and less error prone.
From then on I can build the AST, symbol tables and control flow.
Then I believe I can start outputting code. I don't need a perfect translation. I'll still have to review the generated code and fix problems. Ideally the translator should flag problematic translations.
Before you ask "What the hell is the point of this?" The answer is... It'll be an interesting learning experience. If you have any insights on how to make this less daunting, please let me know.
I am more interested in knowing what kinds of patterns I could enforce on the code to make it easier to translate (ie: IoC, SOA ?) the code than how to do the translation.
Writing a translator isn't impossible, especially considering that Joel's Intern did it over a summer.
If you want to do one language, it's easy. If you want to do more, it's a little more difficult, but not too much. The hardest part is that, while any turing complete language can do what another turing complete language does, built-in data types can change what a language does phenomenally.
word = 'This is not a word' print word[::-2]
takes a lot of C++ code to duplicate (ok, well you can do it fairly short with some looping constructs, but still).
That's a bit of an aside, I guess.
Have you ever written a tokenizer/parser based on a language grammar? You'll probably want to learn how to do that if you haven't, because that's the main part of this project. What I would do is come up with a basic Turing complete syntax - something fairly similar to Python bytecode. Then you create a lexer/parser that takes a language grammar (perhaps using BNF), and based on the grammar, compiles the language into your intermediate language. Then what you'll want to do is do the reverse - create a parser from your language into target languages based on the grammar.
The most obvious problem I see is that at first you'll probably create horribly inefficient code, especially in more powerful* languages like Python.
But if you do it this way then you'll probably be able to figure out ways to optimize the output as you go along. To summarize:
- read provided grammar
- compile program into intermediate (but also Turing complete) syntax
- compile intermediate program into final language (based on provided grammar)
*by powerful I mean that this takes 4 lines:
myinput = raw_input("Enter something: ") print myinput.replace('a', 'A') print sum(ord(c) for c in myinput) print myinput[::-1]
Show me another language that can do something like that in 4 lines, and I'll show you a language that's as powerful as Python.
Everything I tried, from beginning to the best solution, even if it looks like Pythonium marketing it really isn't (don't hesitate to tell me if something doesn't seem correct to the netiquette):
Still I think it's possible to do API->API (or framework->framework) translation and that's basicly what I do in Pythonium but at lower level. Probably Pyjamas use the same algorithm as Pythonium...
- function with full parameters semantic both in definition and calling. This is the part I am most proud of.
- "var" are automatically handled by the translator. (very nice finding from Brett (PythonJS contributor).
- global keyword
- list comprehensions
- imports are supported via requirejs
- single class inheritance + mixin via classyjs
The generated JS is perfect ie. there is no overhead, it can not be improved in terms of performance by further editing it. If you can improve the generated code, you can do it from the Python source file too. Also, the compiler did not rely on any JS tricks that you can find in .js written by http://superherojs.com/, so it's very readable.
The direct descendant of this part of PythonJS is the Pythonium Veloce mode. The full implementation can be found @ https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/veloce/veloce.py?at=master 793 SLOC + around 100 SLOC of shared code with the other translator.
An adapted version of pystones.py can be translated in Veloce mode cf. https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pystone/?at=master
spam.eggin Python is always translated to
getattribute(spam, "egg")I did not profile this in particular but I think that where it loose a lot of time and I'm not sure I can improve upon it with asm.js or anything else.
- method resolution order: even with the algorithm written in Python, translating it to Python Veloce compatible code was a big endeavour.
- getattributre: the actual getattribute resolution algorithm is kind of tricky and it still doesn't support data descriptors
- metaclass class based: I know where to plug the code, but still...
- last bu not least: some_callable(...) is always transalted to "call(some_callable)". AFAIK the translator doesn't use inference at all, so every time you do a call you need to check which kind of object it is to call it they way it's meant to be called.
This part is factored in https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/compliant/runtime.py?at=master It's written in Python compatible with Python Veloce.
Doing python ast to python ast in my case at least would maybe be a performance improvement since I sometime inspect the content of a block before generating the code associated with it, for instance:
- var/global: to be able to var something I must know what I need to and not to var. Instead of generating a block tracking which variable are created in a given block and inserting it on top of the generated function block I just look for revelant variable assignation when I enter the block before actually visiting the child node to generate the associated code.
- yield, generators have, as of yet, a special syntax in JS, so I need to know which Python function is a generator when I want to write the "var my_generator = function"
So I don't really visit each node once for each phase of the translation.
The overall process can be described as:
Python builtins are written in Python code (!), IIRC there is a few restrictions related to bootstraping types, but you have access to everything that can translate Pythonium in compliant mode. Have a look at https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/compliant/builtins/?at=master
Reading JS code generated from pythonium compliant can be understood but source maps will greatly help.
The valuable advice I can give you in the light of this experience are kind old farts:
- extensively review the subject both in literature and existing projects closed source or free. When I reviewed the different existing projects I should have given it way more time and motivation.
- failure is experience
- a small step is a step
- start small
- dream big
- do demos
About what Ira Baxter response:
The estimations are not helpful at all. I took me more or less 6 month of free time for both PythonJS and Pythonium. So I can expect more from full time 6 month. I think we all know what 100 man-year in an enterprise context can mean and not mean at all...
When someone says something is hard or more often impossible, I answer that "it only takes time to find a solution for a problem that is impossible" otherwise said nothing is impossible except if it's proven impossible in this case a math proof...
If it's not proven impossible then it leaves room for imagination:
- finding a proof proving it's impossible
- If it is impossible there may be an "inferior" problem that can have a solution.
- if it's not impossible, finding a solution
Most people that say that a thing is "hard" or "impossible" don't provide the reasons. C++ is hard to parse? I know that, still they are (free) C++ parser. Evil is in the detail? I know that. Saying it's impossible alone is not helpful, It's even worse than "not helpful" it's discouraging, and some people mean to discourage others. I heard about this question via https://.com/questions/22621164/how-to-automatically-generate-a-parser-code-to-code-translator-from-a-corpus.
What would be perfection for you? That's how you define next goal and maybe reach the overall goal.
I am more interested in knowing what kinds of patterns I could enforce on the code to make it easier to translate (ie: IoC, SOA ?) the code than how to do the translation.
I see no patterns that can not be translated from one language to another language at least in a less than perfect way. Since language to language translation is possible, you'd better aim for this first. Since, I think according to http://en.wikipedia.org/wiki/Graph_isomorphism_problem, translation between two computer languages is a tree or DAG isomorphism. Even if we already know that they are both turing complete, so...
Framework->Framework which I better visualize as API->API translation might still be something that you might keep in mind as a way to improve the generated code. E.g: Prolog as very specific syntax but still you can do Prolog like computation by describing the same graph in Python... If I was to implement a Prolog to Python translator I wouldn't implement unification in Python but in a C library and come up with a "Python syntax" that is very readable for a Pythonist. In the end, syntax is only "painting" for which we give a meaning (that's why I started scheme). Evil is in the detail of the language and I'm not talking about the syntax. The concepts that are used in the language getattribute hook (you can live without it) but required VM features like tail-recursion optimisation can be difficult to deal with. You don't care if the initial program doesn't use tail recursion and even if there is no tail recursion in the target language you can emulate it using greenlets/event loop.
For target and source languages, look for:
- Big and specific ideas
- Tiny and common shared ideas
From this will emerge:
- Things that are easy to translate
- Things that are difficult to translate
You will also probably be able to know what will be translated to fast and slow code.
There is also the question of the stdlib or any library but there is no clear answer, it depends of your goals.
Idiomatic code or readable generated code have also solutions...
Targeting a platform like PHP is much more easy than targeting browsers since you can provide C-implementation of slow and/or critical path.
You can also have a look at those libraries:
Also you might be interested by this blog post (and comments): https://www.rfk.id.au/blog/entry/pypy-js-poc-jit/
- This Google Tech Talk from Ira Baxter is interesting https://www.youtube.com/watch?v=C-_dw9iEzhA
My answer will address the specific task of parsing Python in order to translate it to another language, and not the higher-level aspects which Ira addressed well in his answer.
In short: do not use the parser module, there's an easier way.
ast module, available since Python 2.6 is much more suitable for your needs, since it gives you a ready-made AST to work with. I've written an article on this last year, but in short, use the
parse method of
ast to parse Python source code into an AST. The
parser module will give you a parse tree, not an AST. Be wary of the difference.
Now, since Python's ASTs are quite detailed, given an AST the front-end job isn't terribly hard. I suppose you can have a simple prototype for some parts of the functionality ready quite quickly. However, getting to a complete solution will take more time, mainly because the semantics of the languages are different. A simple subset of the language (functions, basic types and so on) can be readily translated, but once you get into the more complex layers, you'll need heavy machinery to emulate one language's core in another. For example consider Python's generators and list comprehensions which don't exist in PHP (to my best knowledge, which is admittedly poor when PHP is involved).
To give you one final tip, consider the
2to3 tool created by the Python devs to translate Python 2 code to Python 3 code. Front-end-wise, it has most of the elements you need to translate Python to something. However, since the cores of Python 2 and 3 are similar, no emulation machinery is required there.
I've been building tools (DMS Software Reengineering Toolkit) to do general purpose program manipulation (with language translation being a special case) since 1995, supported by a strong team of computer scientists. DMS provides generic parsing, AST building, symbol tables, control and data flow analysis, application of translation rules, regeneration of source text with comments, etc., all parameterized by explicit definitions of computer languages.
The amount of machinery you need to do this well is vast (especially if you want to be able to do this for multiple languages in a general way), and then you need reliable parsers for languages with unreliable definitions (PHP is perfect example of this).
There's nothing wrong with you thinking about building a language-to-language translator or attempting it, but I think you'll find this a much bigger task for real languages than you expect. We have some 100 man-years invested in just DMS, and another 6-12 months in each "reliable" language definition (including the one we painfully built for PHP), much more for nasty languages such as C++. It will be a "hell of a learning experience"; it has been for us. (You might find the technical Papers section at the above website interesting to jump start that learning).
People often attempt to build some kind of generalized machinery by starting with some piece of technology with which they are familiar, that does a part of the job. (Python ASTs are great example). The good news, is that part of the job is done. The bad news is that machinery has a zillion assumptions built into it, most of which you won't discover until you try to wrestle it into doing something else. At that point you find out the machinery is wired to do what it originally does, and will really, really resist your attempt to make it do something else. (I suspect trying to get the Python AST to model PHP is going to be a lot of fun).
The reason I started to build DMS originally was to build foundations that had very few such assumptions built in. It has some that give us headaches. So far, no black holes. (The hardest part of my job over the last 15 years is to try to prevent such assumptions from creeping in).
Lots of folks also make the mistake of assuming that if they can parse (and perhaps get an AST), they are well on the way to doing something complicated. One of the hard lessons is that you need symbol tables and flow analysis to do good program analysis or transformation. ASTs are necessary but not sufficient. This is the reason that Aho&Ullman's compiler book doesn't stop at chapter 2. (The OP has this right in that he is planning to build additional machinery beyond the AST). For more on this topic, see Life After Parsing.
The remark about "I don't need a perfect translation" is troublesome. What weak translators do is convert the "easy" 80% of the code, leaving the hard 20% to do by hand. If the application you intend to convert are pretty small, and you only intend to convert it once well, then that 20% is OK. If you want to convert many applications (or even the same one with minor changes over time), this is not nice. If you attempt to convert 100K SLOC then 20% is 20,000 original lines of code that are hard to translate, understand and modify in the context of another 80,000 lines of translated program you already don't understand. That takes a huge amount of effort. At the million line level, this is simply impossible in practice. (Amazingly there are people that distrust automated tools and insist on translating million line systems by hand; that's even harder and they normally find out painfully with long time delays, high costs and often outright failure.)
What you have to shoot for to translate large-scale systems is high nineties percentage conversion rates, or it is likely that you can't complete the manual part of the translation activity.
Another key consideration is size of code to be translated. It takes a lot of energy to build a working, robust translator, even with good tools. While it seems sexy and cool to build a translator instead of simply doing a manual conversion, for small code bases (e.g., up to about 100K SLOC in our experience) the economics simply don't justify it. Nobody likes this answer, but if you really have to translate just 10K SLOC of code, you are probably better off just biting the bullet and doing it. And yes, that's painful.
I consider our tools to be extremely good (but then, I'm pretty biased). And it is still very hard to build a good translator; it takes us about 1.5-2 man-years and we know how to use our tools. The difference is that with this much machinery, we succeed considerably more often than we fail.
The answer would be using Google_Http_MediaFileUpload through the Google PHP client libraries.
Here's the sample code: https://github.com/youtube/api-samples/blob/master/php/resumable_upload.php