High-Level Structure of DSLs: Three Patterns

Over the last couple of years, we have identified three typical ways of building DSLs. In this post I introduce all three, and elaborate a bit on their respective trade-offs.

Formalising an Existing Notation

In the first case, the domain for which you intend to build a language already has an established notation.

You job is essentially the construction of an IDE. Often the notations are used with Word, Excel or Powerpoint, and there is no real language definition underlying the notation. Thus, no consistency checking is performed by the tool, it is easy to make mistakes.

In the attempt of formalising the language, you will likely encounter that the existing notation wasn’t all that well defined up to now — again, not surprising, because without a language implementation and a way to play with it, it requires a lot of discipline to use a notation consistently. You will clear things up along the way.

Likely, as you do this, you will recognise things that could be improved. Thus, once the language is formalised and you have developed an IDE, you will probably start evolving it. Either by actually changing the language definition (and migrating models as you go), or by providing modular language extensions that add more and/or better abstractions.

In some sense, this one is the simplest way of building a DSL — you are essentially reduced to a tool builder. While this sounds uninspiring, the undeniable benefit is that you know very well what language to build, at least initially. Here is an example of an insurance system where we formalised a highly structured Word notation into an MPS language. Note the (intended) strong similarity between the two.

Extending a Base Language

In this case you take an existing language, usually a general-purpose programming language such as Java or C, and incrementally add domain-specific abstractions.

Also in this case you usually know very well what to build, because the typical way of finding out which extensions to add centers around analysing existing base language code, looking for idioms, patterns or random duplication, and trying to come up with good abstractions for those. You do have to find a set of representative samples in order to abstract things in the right way. Semantics are usually defined by desugaring to the base language, which is a a fancy way of saying that you generate that same base language code from which you created the abstracted extensions in the first place.

One big plus is that users of the language and its extensions always also have the base language available. So in contrast to the previous pattern, it is not necessary that you understand and capture the whole domain — users can always fall back to coding things in the base language. You can literally “grow” the extensions bottom up, as you see the need. Check out Guy Steele’s wonderful keynote Growing a Language. Same idea, probably not in Lisp :-)

The approach has a few of its own challenges though. The mix of DSL code and base language code may make things hard to analyse; the usually Turing-complete base language can be used (more or less) everywhere, which makes programs complex. Also, you typically won’t want to add the extensions by modifying the base language invasively, because this will result in a Frankenstein-Monsterlanguage. A modular set of (composable) extensions is more elegant, more maintainable and more usable. But this requires your language workbench to support language extension and extension composition. This is relatively rare.

The example for incremental, modular language extension is mbeddr, a set of 38 extensions of C, plus a whole bunch of additional languages. mbeddr supports state machines, interfaces + components, physical units and many other C extensions. Each of them is modular, but they still can be composed (i.e., used together) in a single mbeddr program.

mbeddr is built on top of MPS. One of MPS’ strong features is modular language extension, which mbeddr exploits to the max. You’ll read much more about mbeddr and MPS on this channel.

A completely new DSL

This is probably what pops into your mind initially when you think about “developing a DSL”. However, as you might have guessed from reading about this one last, it’s not necessarily the most typical case; and it’s for sure not the simplest one.

To build a DSL “from scratch”, you first have to analyse the domain. You can’t just exploit an existing notation, or analyse a code base for repetitive or idiomatic code. You actually have to identify the experts in the domain, as well as artifacts that represent the domain and design suitable abstractions. This requires lots of experience. I will talk about domain analysis in a future post.

Even though a DSL built this way is — almost by definition — unique to a particular domain, we have identified a typical structure:

At the core are expressions. Essentially all languages need number literals, basic arithmetic operations and comparison operators. No need to reinvent the wheel here. Many languages also need things like functions or even higher order functions. Again, exploit what’s there in functional programming. Note that the core are expressions, not statements. At this level, side effects are considered evil.

Making our way from the inside out, we have the domain-specific behaviors layer. This layer represents the coarse-grained behaviors (in contrast to the fine-grained stuff represented by the expressions). Here you typically want to rely on something that has well-defined semantics, i.e., grab an existing programming paradigm and adapt it to your domain. Particularly common candidates are imperative/procedural, object-oriented, state-based, dataflow-oriented and of course, you can stick with functions here as well. You will adapt things, of course. For example, instead of calling something a state, you might want to call it a step or process activity or work item or whatever, depending on the domain. The important thing is that you stick to the established semantics of one of the selected paradigm. That’s much easier than coming up with your own one — and making sure it is sound!

Finally, the outermost layer consists of domain-specific structures. These are usually really specific to a particular domain. The set of structures is one of the main outcomes of the domain analysis. Get those right, and your users will accept the language as representing the domain. Get those wrong, and your language is rather useless. Of course, these structures might also be borrowed from programming. For example, in a project 10 years ago we modelled insurance products essentially like classes: the class was the product, product characteristics were like fields, calculations were like methods, we used OO’s inheritance to express variants, and we even had something that resembled the Template Method pattern. Again, reuse of an existing paradigm (object-orientation), this time for structures.

Now, because of almost complete overlap of the stuff in the expression level among DSLs, it makes sense to actually reuse a language module here (if your language workbench supports modular language composition). What you want is a modular, embeddable, extendible and restrictable functional language that can flexibly be integrated with DSLs. Stay tuned. I will get back to this.

The language extension pattern described before also has an impact here. When building a new DSL, I tend to initially start with a relatively generic language (such as the expression core). This is a good idea for several reasons:

· Initially, when you start developing a language, you might not yet know exactly which abstractions are relevant and justify their own first-class language concepts. So you use less abstract ones, which, however, can be “composed” to create the required behavior. As you better understand the domain, you create additional, more abstract concepts.

· The beauty of building your language bottom up is that if users encounter some kind of corner case for which no first-class abstraction exists in the language, they can “drop one level down”. So they can always express things, they are never blocked.

· Another reason for building a language bottom up is that in many organisations, even if the original plan is to build one DSL, it turns out that there are multiple subdomains that can benefit from their own specific language concepts. When you use the bottom-up approach, you can provide several language extensions that specifically address each of the subdomains (did I mention that modular language extension is a very very useful capability :-) ?

Wrap Up

Not every DSL is the same, but a couple of patterns can be identified, based on our work. The three above are the most common ones. Do you know of others? Let me know.