Friday, November 30, 2007

04. "Postfix" instead of "Infix"

So, now in this chapter we are going to transform the output from the infix notation to the postfix one. Further we will replace the sign characters '+' and '-' with the strings "pos" and "neg" (to avoid confusion with the arithmetic operators).

In the postfix notation parenthesis are not needed, so we have to disable them globally by applying the following changes in "Printer.H":
#define _L_PAREN "" // instead of '('
#define _R_PAREN "" // instead of ')'
Those two lines will get rid of most of the undesired parenthesis. Now, replace the PrintAbsyn::visitTerm(..) and PrintAbsyn::visitFact(..) functions in "Printer.C" with the following code:
void PrintAbsyn::visitTerm(Term* p) {

int oldi = _i_;
if (oldi > 0) render(_L_PAREN);

_i_ = 0; p->exp_1->accept(this); //
_i_ = 1; p->exp_2->accept(this); // order changed from in2po
_i_ = 0; p->termop_->accept(this); //

if (oldi > 0) render(_R_PAREN);
_i_ = oldi;
}

void PrintAbsyn::visitFact(Fact* p) {

int oldi = _i_;
if (oldi > 1) render(_L_PAREN);

_i_ = 1; p->exp_1->accept(this); //
_i_ = 2; p->exp_2->accept(this); // order changed from in2po
_i_ = 0; p->factop_->accept(this); //

if (oldi > 1) render(_R_PAREN);
_i_ = oldi;
}
As you can see, we have just changed the order from the infix notation (operand-operator-operand) to the postfix notation (operand-operand-operator). Compile the changes and make a test, to see how they are going to affect to output:

$ make && ./testformula formula.inp

» Parse Succesful!
»
» [Abstract Syntax]
» (Program [(Formula [(Fact (Term (Var "a") [AddOp] (Var "b")) [MulOp] (Func "sin" [(NVar "c")] ))])])
»
» [Linearized Tree]
» a b + sin (- c) *

Wow! That was a quick start; but what still needs to be done is to render functions also in the postfix order (and without parenthesis). To do that, please make in PrintAbsyn::visitFunc(..) the following changes:
void PrintAbsyn::visitFunc(Func* p) {

int oldi = _i_;
if (oldi > 2) render(_L_PAREN);

//render('('); // do not render parameter list parenthesis
if(p->listexp_) {_i_ = 0; p->listexp_->accept(this);}
//render(')'); // do not render parameter list parenthesis
visitIdent(p->ident_); // order changed from in2po

if (oldi > 2) render(_R_PAREN);
_i_ = oldi;
}
Now since we have also positively and negatively signed functions, we have to apply similar changes to PrintAbsyn::visitPFunc(..) and PrintAbsyn::visitNFunc(..):
void PrintAbsyn::visitPFunc(PFunc* p) {

int oldi = _i_;
if (oldi > 2) render(_L_PAREN);

//render('('); // do not render parameter list parenthesis
if(p->listexp_) {_i_ = 0; p->listexp_->accept(this);}
//render(')'); // do not render parameter list parenthesis
visitIdent(p->ident_); // order changed from in2po

render("pos"); // in2po & '+' replaced with "pos"

if (oldi > 2) render(_R_PAREN);
_i_ = oldi;
}

void PrintAbsyn::visitNFunc(NFunc* p) {

int oldi = _i_;
//if (oldi > 2) render(_L_PAREN);

//render('('); // do not render parameter list parenthesis
if(p->listexp_) {_i_ = 0; p->listexp_->accept(this);}
//render(')'); // do not render parameter parenthesis
visitIdent(p->ident_); // order changed from in2po

render("neg"); // in2po & '-' replaced with "neg"

//if (oldi > 2) render(_R_PAREN);
_i_ = oldi;
}
As you have noticed, we replaced also the sign characters '-' and '+' with the strings "neg" and "pos". You could argue, that it is not strictly necessary to put the "pos" string, since this is mathematically superfluous. Yes, you are right! But I have kept it, for the sake of symmetry; if you prefer you can out-comment the rendering of the positive signs.

Well, now again compile and test with our example input:

$ make && ./testformula formula.inp

» Parse Succesful!
»
» [Abstract Syntax]
» (Program [(Formula [(Fact (Term (Var "a") [AddOp] (Var "b")) [MulOp] (Func "sin" [(NVar "c")] ))])])
»
» [Linearized Tree]
» a b + - c sin *

Cool! Now, we have even transformed the functions from the infix to the postfix notation; but there is still a slight problem: Where do we know from, how many parameters the function e.g. "sin" is going to take? Since we do not render the parenthesis anymore we cannot tell from the postfix notation the number of parameters for an arbitrary function!

So we have to cheat; we just assume that we know it, period. Well in the case of the sine it is easy, since it is a standard mathematical function which we know from high school of having just one parameter. But in other cases like for the function "f(..)" we have to assume away the problem.

I could offer you a simple solution by attaching immediately after the function`s name the number of parameters it takes, e.g. "f(a,b)" would translate to "a b f#2". Since this solution is really simple, you are free to implement this yourself ;).

Because our input example is insufficient, we have missed another detail; change it to "(a+b) * sin(-c,+d)" (in formula.inp) and run a test:

$ ./testformula formula.inp

» Parse Succesful!
»
» [Abstract Syntax]
» (Program [(Formula [(Fact (Term (Var "a") [AddOp] (Var "b")) [MulOp] (Func "sin" [(NVar "c"), (PVar "d")] ))])])
»
» [Linearized Tree]
» a b + - c, + d sin *

Well ... the comma after the 'c' should not really be there! To suppress it apply the following changes to PrintAbsyn::visitListExp(..) (it renders the parameter list of functions):
void PrintAbsyn::visitListExp(ListExp *listexp) {
while(listexp!= 0) {
if (listexp->listexp_ == 0) {
listexp->exp_->accept(this);
listexp = 0;
} else {
listexp->exp_->accept(this);
//render(','); // do not render the comma
listexp = listexp->listexp_;
}
}
}
Compile and make the test and you will discover that the comma is gone! Here is the output:

$ ./testformula formula.inp

» Parse Succesful!
»
» [Abstract Syntax]
» (Program [(Formula [(Fact (Term (Var "a") [AddOp] (Var "b")) [MulOp] (Func
» "sin" [(NVar "c"), (PVar "d")] ))])])
»
» [Linearized Tree]
» a b + - c + d sin *

Mmh, but we have still work to do, since what we really want is "a b + c neg d pos sin *". So again we need to replace the '-'/'+' signs with "neg"/"pos" strings; here are the changes for PrintAbsyn::visitPVar(..) and PrintAbsyn::visitNVar(..):
void PrintAbsyn::visitPVar(PVar* p) {

int oldi = _i_;
if (oldi > 2) render(_L_PAREN);

visitIdent(p->ident_);
render("pos");

if (oldi > 2) render(_R_PAREN);
_i_ = oldi;
}

void PrintAbsyn::visitNVar(NVar* p) {

int oldi = _i_;
if (oldi > 2) render(_L_PAREN);

visitIdent(p->ident_);
render("neg");

if (oldi > 2) render(_R_PAREN);
_i_ = oldi;
}
It is always the same game: Disable any eventual parenthesis rendering (if not already disable globally) and replace the signs by strings! That`s it! So compile and make the test:

$ make && ./testformula formula.inp

» Parse Succesful!
»
» [Abstract Syntax]
» (Program [(Formula [(Fact (Term (Var "a") [AddOp] (Var "b")) [MulOp] (Func "sin" [(NVar "c"), (PVar "d")] ))])])
»
» [Linearized Tree]
» a b + c neg d pos sin *

Now that looks fine! Well remember, we had positively and negatively signed version of the natural and real numbers, variables, functions and POEs. Together, we have applied the changes for the variables and functions, but you have still to change the implementation for the numbers and POEs.

Since we have already gone twice through such similar changes, I`m not showing you all necessary details anymore, but I will tell you where you have to apply them; you need to adapt the following functions:
  • PrintAbsyn::visitPNat(..)
  • PrintAbsyn::visitNNat(..)

  • PrintAbsyn::visitPReal(..)
  • PrintAbsyn::visitNReal(..)

  • PrintAbsyn::visitPOE(..)
  • PrintAbsyn::visitPos(..)
  • PrintAbsyn::visitNeg(..)
To avoid numbers to get rendered without a space in between we have also to apply the following changes:
void PrintAbsyn::visitInteger(Integer i) {

char tmp[16];
sprintf(tmp, "%d ", i); // space after %d appended
bufAppend(tmp);
}

void PrintAbsyn::visitDouble(Double d) {

char tmp[16];
sprintf(tmp, "%g ", d); // space after %g appended
bufAppend(tmp);
}
This should do it! Test with a few input examples your adaptions to find out if you have made everything correct! ;D In the next chapter we will talk about how the preprocessor has been implemented (to get rid of multiple signs).

03. The "BNF Converter" Tool

To use the words of M. Forsberg and A. Ranta from [2]:
We will demonstrate BNFC (the BNF Converter), a multi-lingual compiler tool. BNFC takes as its input a grammar written in LBNF (Labelled BNF) notation, and generates a compiler front-end (an abstract syntax, a lexer, and a parser). Furthermore, it generates a case skeleton usable as the starting point of back-end construction, a pretty printer, a test bench, and a Latex document usable as language specification.
Now, the real fun starts: Prepare an empty directory, where you should copy the "formula.bnf" grammar into. Further ensure that the BNF Converter tool is installed on your Unix-like system; see [5] for further details.

Well, at first you have to rename the file "formula.bnf" to "formula.cf" since the BNF Converter seems to work only with *.cf files. I prefer it to be suffixed as *.bnf since it is more descriptive.

$ mv formula.bnf formula.cf

Now with the following command, you will create the lexer, parser, printer and skeleton in the C++ language for our arithmetic expressions:

$ bnfc -m -cpp formula.cf

The "-m" options ensures that a make file is created; after the command the following files should be in your directory:

$ ls

» Absyn.C formula.cf formula.tex Makefile Printer.C Skeleton.C Test.C Absyn.H formula.l formula.y Parser.H Printer.H Skeleton.H

In the "Absyn.*" files the AST classes are given; "formula.l" is the required by the lexer generator; similarly, "formula.y" is required by the parser generator. The "Printer.*" files are responsible to traverse the AST, whereas "Test.C" encapsulates a main function and handles file input. At the moment, we do not need the "Skeleton.*" files, but leave them in place.

Now, please compile the sources with the "make" utility:

$ make

To make this step work flawlessly please ensure, that the following tools are installed correctly: g++, flex, bison, latex and dvips. The latter two are not strictly necessary, but it is imperative to have the "g++" compiler, the "flex" lexical analyzer generator and the "bison" parser generator tools to have installed correctly!

Now, your directory should look like this:

$ ls

» Absyn.C formula.cf formula.ps Lexer.o Parser.o Skeleton.C Test.o Absyn.H formula.dvi formula.tex Makefile Printer.C Skeleton.H Absyn.o formula.l formula.y Parser.C Printer.H Test.C formula.aux formula.log Lexer.C Parser.H Printer.o testformula

The executable "testformula" expects an input file with our arithmetic expressions in it; please prepare such a file, e.g.:

$ echo "(a+b) * sin(-c)" > formula.inp

Now, feed to the executable the input file containing our test formula; if everything goes well, you should get the following output:

$ ./testformula

» Parse Succesful!
»
» [Abstract Syntax]
» (Program [(Formula [(Fact (Term (Var "a") [AddOp] (Var "b")) [MulOp] (Func "sin" [(NVar "c")] ))])])
»
» [Linearized Tree]
» (a + b) * sin (- c)

Bravo! You have created a parser which can interpret arithmetic expressions correctly and which produces the appropriate abstract syntax tree; further this tree is traversed such that a "linearized" version of it is printed, which produces the original input.

Now, it is up to us (you!) to make the necessary modifications in "Printer.C" such that instead of the "linearized" AST a "postfixed" AST is produced. By doing this, we will achieve our primary goal, of creating a first rough version of our infix to postfix converter.

02. The "labelled BNF" Formalism

Since there is plenty of information related to the "labelled BNF" formalism, I would like to refer to [1]: It explains what it is about and how you can use it; since it is only 13 pages long it does not take much time to work through. But if you prefer not to, it is also fine - provided you know what BNF is about! Basically, "labelled BNF" grammars are the same as "BNF" grammars with the exception that every rule has got a label attached. I cite the following short information from [1]:
... the grammar formalism Labelled BNF (LBNF), which is used in the compiler construction tool BNF Converter. Given a grammar written in LBNF, the BNF Converter produces a complete compiler front end (up to, but excluding, type checking), i.e. a lexer, a parser, and an abstract syntax definition. Moreover, it produces a pretty-printer ...
So, it is about what we already know from the introduction; below you see the grammar for the arithmetic expressions:
--Begin LBNF file

Program. Program ::= [Formula] ;
Formula. Formula ::= Exp ;

Term. Exp ::= Exp TermOp Exp1 ;
Fact. Exp1 ::= Exp1 FactOp Exp2 ;

Nat. Exp2 ::= Integer ;
PNat. Exp2 ::= "+" Integer ;
NNat. Exp2 ::= "-" Integer ;

Real. Exp2 ::= Double ;
PReal. Exp2 ::= "+" Double ;
NReal. Exp2 ::= "-" Double ;

Var. Exp2 ::= Ident ;
PVar. Exp2 ::= "+" Ident ;
NVar. Exp2 ::= "-" Ident ;

Func. Exp2 ::= Ident "(" [Exp] ")" ;
PFunc. Exp2 ::= "+" Ident "(" [Exp] ")" ;
NFunc. Exp2 ::= "-" Ident "(" [Exp] ")" ;

POE. Exp2 ::= PreOp Exp ;
Pos. PreOp ::= "+" ;
Neg. PreOp ::= "-" ;

AddOp. TermOp ::= "+" ;
SubOp. TermOp ::= "-" ;

MulOp. FactOp ::= "*" ;
DivOp. FactOp ::= "/" ;
ModOp. FactOp ::= "%" ;

-- extra stuff

separator Formula ";" ;
separator Exp "," ;
coercions Exp 2 ;
comment "#" ;

entrypoints Program ;

--LBNF file ends here
As you see, the input is treated as a 'program' of formulas, where every formula is made of expressions. There are terms and factors, natural and real numbers, variables and functions. Every number, variable and function can have a positive or a negative sign pre-attached.

Further a "POE" is an expression with a sign in front; it was necessary to split the handling of signs between the primitive expressions (numbers, variables and functions) and the others, since otherwise the BNF Converter had difficulties to produce the desired parser for arithmetic expressions we look for. Please do not dismiss this rule, since we will have to deal with it later (for the 'unifier') quite a lot!

Then there are the operators and some extra stuff which I do not wish to discuss here. But if you are interested to learn more about those rules please have a look at [1] or [2].

I`m not sure if this is the most simple grammar to capture arithmetic expressions, but it did the job for me. But I think it might be possible to avoid the positive and negative versions of the primitive expressions: This would reduce the amount of code produced for the parser, the number of AST classes, and finally we would need to apply less changes to it, to get an infix to postfix converter.

01. Introduction to the "in2po" library's implementation

Hello! So you would like to learn, how I was able to write this small library which takes an arithmetic formula in the infix notation and transforms it into the postfix notation.

To give you a very rough overview I performed the following steps, to get what I got:
  1. Define the language of arithmetic expressions within the "labelled BNF" syntax. This is a general tool to define programming languages; it may look like an overkill to deal with arithmetic expressions by using concepts from the compiler construction theory, but it is not! That`s because the language of arithmetic expressions itself is not that easy as it looks like; we just get this impression since we`re so used to it. So to tackle such expressions we have to borrow from the concepts of compiler construction.
  2. Now, apply the "BNF Converter" tool to this syntax; it will produce a lexer and a parser for the language we seek for. Along with the parser the classes to model the "abstract syntax tree" (AST) are also produced.
  3. Since we have got the parser and the AST classes, we are now able to traverse the AST. This is done in the 'printer', which has also been produced by the "BNF Converter" tool. The standard printer produces a so called 'linearized view' of the AST: i.e. if we input an arithmetic expression in the infix notation we get the same expression again in the infix notation back. To change the output to the postfix notation, we just need to modify a few lines in the printer`s implementation (just change the visit order of the AST). Voila!
You may still ask yourself why it is necessary to go through such an hassle, just to produce an arithmetic expression in the postfix order; cannot we just use one of the stack based algorithms (e.g. the "Shunting yard algorithm" from Dijkstra) to solve this 'simple' problem?

Well, we could ... but we have to fulfill other constraints as well, which are probably harder to implement around a stack based algorithm: While transforming from the one notation to the other, we have to ensure things like:
  • The stack size of the postfix expressions should stay rather minimal; e.g. we do not wish expressions like "a b c + *" but we prefer "b c + a *".
  • Double, triple negations and the like should be handled with; well I cheated a little bit to solve this requirement. Instead of providing a solution which would work on the AST, I just run a preprocessor on the input which eliminates such nasty things.
  • Constant sub-expression should be evaluated, since we wish our output to be as simple as possible. Well, the actual implementation does not do everything humanly possible and there is still room to improve this part, but more on that later in this how to.
  • Function expression should be dealt with; this means that things like the expression "f(sin(a) + cos(b))" should be translated into "a sin b cos + f".
Well, you could still argue that a stack based implementation could solve those requirements; probably. But, with the actual approach by treating arithmetic expressions as a tiny programming language we gain generality; if you wish you could easily include further additions like a special power symbol "^" to deal with e.g. exponentiation. And as you will discover, it is really easy to adapt the labelled BNF syntax and to work through the mentioned steps one to three.

So, this was the introductory part; in the following, we will look together on the syntax of our arithmetic expressions; then I will explain you how to use the BNF Converter on this syntax and finally show you which changes you have to do to the printer to get outputs in the postfix notation.

This will be the minimal part; later on we will look at how the mentioned constraints have been implemented. To understand those parts we need to have a look at the "preprocessor", "unifier", "optimizer" ("simplifier") and last but not least the "evaluator'.