Sunday, October 26, 2008

Experimenting with automatons in Ruby, part 1

A finite automaton is an abstract computing device, which more often than not is used to recognize formal languages. One can look at an automaton as a tape reader which consists of a initial state, a set of final states and a set of transitions between states. The automaton accepts its input if it has reached the end of the input tape and is in a final state.

For the purpose of language processing we will implement different kinds of automatons in Ruby, starting with the simplest one we can conceive. We'll start out with a deterministic finite-state automaton (DFSA). Such an automaton is called deterministic, since it in any given state only has one possible transition for a given input symbol.

First we need an example language to work with, so I've created the "Ruby-language". It can be described with the following regular expression: /Ru+by/ (yes, DFSA's correspond to the languages that can be described by (pure) regular expressions). I admit this is a quite boring language, but we have to start somewhere, right? The alphabet for this language is the set {R,u,b,y}.

To create an automaton for this language we need four states (visualize this as a directed graph):

  • q0: 'R' -> q1
  • q1: 'u' -> q2
  • q2: 'u' -> q2, 'b' -> q3
  • q3: 'y' -> q4
The initial state is q0 and the set of final states is {q4}. As you can see there is a cycle at q2 to allow for an (theoretically) infinite amount of 'u' characters. Also note that there is no ambiguity of which state is the next one at any given point for any given input. Hence, the automaton is deterministic.

On to the implementation considerations. One of the most important design decisions is how we should represent the state transitions. For a DFSA almost anything would work, but we want to choose something flexible. Let's try with a nested hash:


{
:q0 => {"R" => :q1},
:q1 => {"u" => :q2},
:q2 => {"u" => :q2, "b" => :q3},
:q3 => {"y" => :q4}
}


I think this will work quite well. If we look up a given state in the hash, we will get another hash of all the possible transitions. Matching this up with a input tape should be quite easy. I've sketched up a spec for how this should work in practice:


describe "the Ruby-language" do
before(:each) do
@r = DFSA.new({
:q0 => {"R" => :q1},
:q1 => {"u" => :q2},
:q2 => {"u" => :q2, "b" => :q3},
:q3 => {"y" => :q4}
}, :q0, [:q4])
end

it "should recognize valid sentences" do
@r.should recognize %w(R u b y)
@r.should recognize %w(R u u u u b y)
end

it "should reject invalid sentences" do
@r.should_not recognize %w(r u b y)
@r.should_not recognize %w(R u b)
end
end


Note that we give the input to the automaton as a list (I've picked the %w() syntax since it feels kinda Lispy). This is called tokenization, and for now we'll do it by hand. The second argument to the DFSA constructor is the symbol for the initial state, and the third one is the set of final states. We won't worry about the actual implementation yet, as we want to let this sink in for a bit.

The next article in this series will be to create a DSL for setting up the automaton, then I'll write up on non-deterministic automatons and pushdown automatons. I'll also show more actual code (I'll create a GitHub repos for the automaton library). Possibly I'll get to transducers as well. If you have any suggestions, please leave a comment or ship me an e-mail.

No comments: