High-Level Structure
of DSLs: Three Patterns
Over the last couple of
years, we have identified three typical ways of building DSLs. In this post I
introduce all three, and elaborate a bit on their respective trade-offs.
Formalising an Existing
Notation
In the first case, the domain
for which you intend to build a language already has an established notation.
You job is essentially the
construction of an IDE. Often the notations are used with Word, Excel or
Powerpoint, and there is no real language definition underlying the notation.
Thus, no consistency checking is performed by the tool, it is easy to make
mistakes.
In the attempt of formalising
the language, you will likely encounter that the existing notation wasn’t all
that well defined up to now — again, not surprising, because without a language implementation and
a way to play with it, it requires a lot of discipline to use a notation
consistently. You will clear things up along the way.
Likely, as you do this, you
will recognise things that could be improved. Thus, once the language is
formalised and you have developed an IDE, you will probably start evolving it.
Either by actually changing the language definition (and migrating models as
you go), or by providing modular language extensions that add more and/or
better abstractions.
In some sense, this one is
the simplest way of building a DSL — you are essentially reduced to a tool
builder. While this sounds uninspiring, the undeniable benefit is that you know
very well what language to build, at least initially. Here is an example of an
insurance system where we formalised a highly structured Word notation into an
MPS language. Note the (intended) strong similarity between the two.
Extending a Base Language
In this case you take an
existing language, usually a general-purpose programming language such as Java
or C, and incrementally add domain-specific abstractions.
Also in this case you usually
know very well what to build, because the typical way of finding out which
extensions to add centers around analysing existing base language code, looking
for idioms, patterns or random duplication, and trying to come up with good
abstractions for those. You do have to find a set of representative samples in
order to abstract things in the right way. Semantics are usually defined by
desugaring to the base language, which is a a fancy way of saying that you
generate that same base language code from which you created the abstracted
extensions in the first place.
One big plus is that users of
the language and its extensions always also have the base language available.
So in contrast to the previous pattern, it is not necessary that you understand
and capture the whole domain — users can always fall back to coding things in
the base language. You can literally “grow” the extensions bottom up, as you
see the need. Check out Guy Steele’s wonderful keynote Growing
a Language. Same idea, probably not in Lisp :-)
The approach has a few of its own
challenges though. The mix of DSL code and base language code may make things
hard to analyse; the usually Turing-complete base language can be used (more or
less) everywhere, which makes programs complex. Also, you typically won’t want
to add the extensions by modifying the base language invasively, because this
will result in a Frankenstein-Monsterlanguage. A modular set of (composable)
extensions is more elegant, more maintainable and more usable. But this
requires your language workbench to support language extension and extension
composition. This is relatively rare.
The example for incremental,
modular language extension is mbeddr, a set of 38 extensions of C, plus a whole bunch of
additional languages. mbeddr supports state machines, interfaces + components,
physical units and many other C extensions. Each of them is modular, but they
still can be composed (i.e., used together) in a single mbeddr program.
mbeddr is built on top of MPS. One of MPS’ strong
features is modular language extension, which mbeddr exploits to the max.
You’ll read much more about mbeddr and MPS on this channel.
A completely new DSL
This is probably what pops
into your mind initially when you think about “developing a DSL”. However, as
you might have guessed from reading about this one last, it’s not necessarily
the most typical case; and it’s for sure not the simplest one.
To build a DSL “from
scratch”, you first have to analyse the domain. You can’t just exploit an
existing notation, or analyse a code base for repetitive or idiomatic code. You
actually have to identify the experts in the domain, as well as artifacts that
represent the domain and design suitable abstractions. This requires lots of
experience. I will talk about domain analysis in a future post.
Even though a DSL built this
way is — almost by definition — unique to a particular domain, we have
identified a typical structure:
At the core are expressions.
Essentially all languages need number literals, basic arithmetic operations and
comparison operators. No need to reinvent the wheel here. Many languages also
need things like functions or even higher order functions. Again, exploit
what’s there in functional programming. Note that the core are expressions,
not statements. At this level, side effects are considered evil.
Making our way from the
inside out, we have the domain-specific behaviors layer. This layer represents
the coarse-grained behaviors (in contrast to the fine-grained stuff represented
by the expressions). Here you typically want to rely on something that has
well-defined semantics, i.e., grab an existing programming
paradigm and adapt it to your domain. Particularly common candidates
are imperative/procedural, object-oriented, state-based, dataflow-oriented and
of course, you can stick with functions here as well. You will adapt things, of
course. For example, instead of calling something a state, you
might want to call it a step or process activity or work
item or whatever, depending on the domain. The important thing is that
you stick to the established semantics of one of the selected paradigm. That’s
much easier than coming up with your own one — and making sure it is sound!
Finally, the outermost layer
consists of domain-specific structures. These are usually really specific to a
particular domain. The set of structures is one of the main outcomes of the
domain analysis. Get those right, and your users will accept the language as
representing the domain. Get those wrong, and your language is rather useless.
Of course, these structures might also be borrowed from programming. For example,
in a project 10 years ago we modelled insurance products essentially like
classes: the class was the product, product characteristics were like fields,
calculations were like methods, we used OO’s inheritance to express variants,
and we even had something that resembled the Template
Method pattern. Again, reuse of an existing paradigm
(object-orientation), this time for structures.
Now, because of almost
complete overlap of the stuff in the expression level among DSLs, it makes
sense to actually reuse a language module here (if your language workbench
supports modular language composition). What you want is a modular, embeddable,
extendible and restrictable functional language that can flexibly be integrated
with DSLs. Stay tuned. I will get back to this.
The language extension pattern
described before also has an impact here. When building a new DSL, I tend to
initially start with a relatively generic language (such as the expression
core). This is a good idea for several reasons:
· Initially, when you start developing a
language, you might not yet know exactly which abstractions are relevant and
justify their own first-class language concepts. So you use less abstract ones,
which, however, can be “composed” to create the required behavior. As you
better understand the domain, you create additional, more abstract concepts.
· The beauty of building your language bottom
up is that if users encounter some kind of corner case for which no first-class
abstraction exists in the language, they can “drop one level down”. So they can always express
things, they are never blocked.
· Another reason for building a language
bottom up is that in many organisations, even if the original plan is to build one DSL,
it turns out that there are multiple subdomains that can benefit from their own
specific language concepts. When you use the bottom-up approach, you can
provide several language extensions that specifically address each of the
subdomains (did I mention that modular language extension is a very very useful
capability :-) ?
Wrap Up
Not every DSL is the same,
but a couple of patterns can be identified, based on our work. The three above
are the most common ones. Do you know of others? Let me know.