Compiler Designing W05L0902

Scope

  • Scope of an identifier is the span of the program in which the identifier is accessible.
  • In this way same identifier name can actually be different identifiers depending on their scope.
  • Different scope of an identifier can not overlap.
  • Static scoping will restrict the scope of a variable to its block while in dynamic scoping it depends on the most recent declaration.
  • Static Scoping languages: C/C++, COOL; Dynamic Scoping languages: LISP, SNOBOL
  • Lets take an example from stackoverflow on dynamic and static scoping.
int x;

     int main()
     {
           x = 14;
           f(); 
           g();
  }
     void f()
     {
      int x = 13;
             h();
  }
  void g()
  {
      int x = 12;
             h();
  }
  void h() 
    {
              printf("%d\n",x);  
  }

If Static scope is applied the 14 would be printed twice while in the case of dynamic scoping 13 and 12 would be printed

Compiler Designing W0207

Finite Automata I

  • Finite automata is an implementation model fr regex which is a specification language for Lexical Analysis.
  • A finite automata consists of:
∑: set of input alphabets

S: set of states

n: start state

F ⊆ S: accepting state(s)

transition function: state1—>(input) state2

  • L(FA) = set of accepted strings
  • In FA, string is matched when at the end if the input, pointer for the states is on the one of the accepting state(s), otherwise like in case if pointer is stuck or at the end the pointer is not on any accepting state(s) then matching fails.
  • Notations are as follows:
  • STATE -> CIRCLE
  • START STATE -> CIRCLE having edge with NO SOURCE
  • ACCEPTING STATE -> TWO CONCENTRIC CIRCLES
  • TRANSITION -> EDGE CONNECTING TWO STATES

Compiler Designing W0206

Lexical Specifications II

Some equivalences of regex:

  • At least one A: AA* ≡ A+
  • Union: A ∪ B ≡ A | B
  • Optional A: A + ∈ ≡ A?
  • Range: ‘a’ + ‘b’ + … ‘z’ ≡ [a-z] and similarly for other ranges
  • Excluded Range: complement of [a-z] ≡ [^a-z]

How Lexical Specification works:

  1. Write down the regexp for each syntactic unit like for keywords, open paren, identifiers etc.
  2. Construct Language R = keywords + identifiers + number …

REPEAT

inputString = x1 … xn, for some i ∈ [1,n] and x1 … xi ∈ R

then we know x1… xi ∈ Rj>

delete x1 … xi

UNTIL input is EMPTY

Questions:-

  • How big should be the input in one go?

Analyzer reads ‘c’ and stops there but the input was `char`. To overcome this, one of the tokenizer’s rules is called “maximal munch”, which says that the tokenizer should keep reading characters from the source file until adding one more character causes the current token to stop making sense.

  • Which token should be used if input matches to more than one token classes?

if is in keyword class as well as in identifier class (depending upon the regex). For this, priority ordering is done in the file storing the tokens. In this case keyword is listed before identifier.

  • If input doesn’t match?

All such kind of strings would go to Error String category listed in last of priority-wise ordered file of tokens.

Hence for no ambiguity, maximal munch priority ordering are used.(some tricky situations) and in the case of no match error is handled as written in the text.

Compiler Designing W0205

Lexical Specification I

Here lets see how to use regex for different token classes.

  • Keywords:- for example if, else, the regular expressions are 'i''f' and 'e''l''s''e'respectively, in shorthand, 'if'and 'else'.
  • Digits or Integers:- singleDigit=’0′ + ‘1’ + ‘2’ … + ‘9’ and multipleDigits = (singleDigit)((singleDigit)*, in shorthand multipleDigits = singleDigit+ (AA*=A+)
  • Identifier:- Starts with a letter followed by letter(s) or digits(s). in regex, letter=’a’ + ‘b’ + ‘c’ … + ‘z’ + ‘A’ + ‘B’ … + ‘Z’, shorthand for letter is [a-zA-Z]. Hence and identifier = (letter)(letter + digit)*
  • Whitespace:- space, tab and newline, whiteSpace = (‘ ‘ + ‘\n’ + ‘\t’)

An example from PASCAL for floating point number.

floatingPointNumber = digits + optionalFraction + optionalExponent
here, digits = digit+
optionalFraction = ( ‘.’ digits ) + ∈, in shorthand it is equal to ( ‘.’ digits)?
optionalExponent = ( (‘E + e’ )( ‘+’ + ‘-‘ + ‘∈’ ) digits ) + ∈, in shorthand it is equal to
( ( ‘E + e’ ) ( ‘+’ + ‘-‘ )? digits)?

  • regex can define many languages like email, phone numbers etc. and also defines language specifications but not implementations.

regex cheatsheet.

Compiler Designing W0204

Formal Language

  • Suppose there is a set of alphabets ∑, the formal language is a subset of ∑* defined by a formal grammar. In the simple words, take some alphabets define a rule to make a word from them, these set of words are called formal language.
  • There is a function called meaning, defined on alphabets to words such that L(∑)=Meaning. For example, L(ASCII)=C Language means that C-language is made of words which in turn made up of ASCII characters.
  • Regular language is a formal language that can be expressed in regular expression(pattern).
  • Why use meaning function:
    • It separates syntax and semantic as L(sytnax)=semantic
    • Various convertible notations can be used. For example, using Arabic and Roman numbers, they can be converted into each other, because they have the same Meaning. Hence, one can be converted into other depending upon the comfort for an operation.
    • Meaning function is Many-to-One, means different expressions can have same meaning. This is very important for optimization.

Compiler Designing W02L02 and 03

/*This is an unprepared blog, prepared till tomorrow’s night*/

Lecture 02

It contained theory of look-ahead in LA. I will complete this section till tomorrow’s night as the content in the video is not sufficient to write down here. (Why I am not reading it now? It’s 2AM!).

Regular Language

To be explored more...
  • Regular languages are formal languages those can be represented by REGEX(patterns).
  • There are two basic expressions, namely single character and epsilon, and three compound expressions, namely union; concatenation; and iteration.
    • Single character:- ‘c’={“c” | c ∈ ∑}
    • Epsilon is empty string language.
    • Union of two languages A and B is A ∪ B such that A+B={a+b | a∈A ∧ b∈B}
    • Concatenation of two languages A and B is AB={ab| a∈A ∧ b∈B}
    • Iteration of language is such that A*=∪(i ≥ 0) power(A, i)
  • Regular expressions are closely related to NFA and DFA. As both of these recognizes regular languages.
  • regex is fed into the machine in the form of NFAand then converted to DFA. The conversion examples can seen here and here.