Notes and experiments on the D language,
part I

By leonardo maffi, V.1.3, Jan 9 2007
Keywords: D language, Python, programming

[Go back to the article index]

Software of this article: anagrams.zip

See the successive article too: d_notes2.html

The D language:
http://www.digitalmars.com/d/
http://en.wikipedia.org/wiki/D_programming_language

D looks good, but I think it can't teach much to a person that already knows Java or C++ languages. It's not revolutionary, it just improves and cleans C++ a bit. A programmer that already knows C++ and wants to learn some more computer science probably has to chose a very different language, like Haskell, Oz, Python, Ruby, Scheme, Lisp, OCaml, Dylan, etc.

D syntax is quite similar to C++ one, this is an advantage because lot of programmers know C and they can learn D in a very short time. But that's a disadvantage too, because D shares still some bad parts of the C++ syntax. I feel that in some cases D syntax isn't bold enough.

D has built-in strings, associative arrays, dynamic arrays and various other things. Someone may think that a programming language must be both simple and powerful, so it can load some kind of external library that implements strings, associative arrays, etc. Scheme language follows that line of thought. D language follows a more integrated approach, and Python has even more built-in things and features. Is having built-in associative arrays positive? I think so.

This is a quotation from an article that describes one of the best experimental comparisons between many programming languages languages:

http://page.mi.fu-berlin.de/~prechelt/Biblio/jccpprtTR.pdf

>Most of the programmers in the script group used the associative arrays provided by their language and stored the dictionary words to be retrieved by their number encodings. The search algorithm simply attempts to retrieve from this array, using prefixes of increasing length of the remaining rest of the current phone number as the key. Any match found leads to a new partial solution to be completed later. In contrast, essentially all of the non-script programmers chose either of the following solutions. In the simple case, they simply store the whole dictionary in an array, usually in both the original character form and the corresponding phone number representation. They then select and test one tenth of the whole dictionary for each digit of the phone number to be encoded, using only the first digit as a key to constrain the search space. This leads to a simple, but inefficient solution.<

Those Java and C++ programmers did have maps too, but they haven't used them. Most or all programmers that have used Python, Perl, etc have used associative arrays. For this program the best solution is to use associative arrays. So this study shows that if you don't have maps built into the language you may not use them even when you need them. D associative arrays have a very light and clear syntax, so probably D programmers use them to write this program, like those scripting language programmers.

Having built-in associative arrays has some other advantages:
- Resulting programs are shorter, and they have cleaner syntax; this reduces bugs and development time. A clean syntax helps to see the program instead of being distracted by syntactic noise. The syntax of the maps of the C++ STL is worse. Built-ins allow for a simpler syntax. See below for a simple code comparison example.
- Associative arrays are more integrated with the language, so all system code that's around the D compiler uses them, so they are often shorter, simpler, and more hackable.
- The compiler doesn't have to compile them every time, so compiling time is reduced (this can be felt for smaller programs). Generally the D compiler is much faster than MingGW (the Windows version of GCC). Compilation times for small programs are tents of seconds. The DMD compiler is so fast for small programs, that it has a "run" option that executes the code on the fly just after compiling it, without saving the executable. With this D can be used for certain kind of scripting duties too (the code isn't interpreted, it's compiled).
- I did study hashes in many situations, I have even implemented many kinds of them, but before using a scripting language (Python, but the same is true for other languages like Ruby or Perl, or Tcl, etc) I didn't grasp the usefulness of associative arrays, how many problems they can solve, etc. I didn't have them in the compiled languages that I did use. They are very powerful tools, they can often reduce complexity of an operation from O(n) or O(log n) to about O(1).
- Having built-in associative arrays allows the compiler to manage them at a bit higher level. This allows the user to modify, subclass them, and generally to modify them in a simpler way.
- A competent C++ programmer can write a program that uses STL strings and it's very fast, but when the program grows to larger sizes and complexity increases, the ability of the compiler to compile the growing castle of abstractions, or the programmer ability to fully understand what the compiler is doing, decreases, the result is that a program like ShedSkin (a Python => C++ compiler) often manages strings (STL ones) is a slower way than Python. You need to fully understand what the compiler is doing to write efficient C++ code.

That's why I think it's positive to have built-in strings and associative arrays for a C++-like language.
Here is an example program using dicts/maps/associative arrays, translated into D, C++ and Python versions (this example comes partially from the Wikipedia):

C++ version:

#include <iostream>
#include <map>
#include <string>
using namespace std;
int main() {
    map<string, string> phone_book;
    phone_book["Sally Smart"] = "555-9999";
    phone_book["John Doe"] = "555-1212";
    phone_book["J. Random Hacker"] = "553-1337";
    map<string, string>::const_iterator iter;
    for (iter = phone_book.begin(); iter != phone_book.end(); ++iter) {
        cout << "Number for " << iter->first << ": " << iter->second << endl;
    }
    phone_book.erase("Sally Smart");
    return 0;
}

D version:

import std.stdio;
void main() {
    char[][char[]] phone_book;
    phone_book["Sally Smart"] = "555-9999";
    phone_book["John Doe"] = "555-1212";
    phone_book["J. Random Hacker"] = "553-1337";
    foreach (key, value; phone_book)
        writefln("Number for ", key, ": ", value);
    phone_book.remove("Sally Smart");
}

Python version:

phonebook = {'Sally Smart' : '555-9999',
             'John Doe' : '555-1212',
             'J. Random Hacker' : '553-1337'}
for name, number in phonebook.iteritems():
    print "Number for", name + ":", number
del phonebook['Sally Smart']

Notes:
- Such programs are tiny. To do a more meaningful comparison it may need a program about 10000 line long.
- On my very old PC The compilation of this tiny C++ program requires about 3.6 seconds with MinGW, and about 0.27 seconds with the DMD D compiler (such small times aren't much significant).
- ShedSkin is able to translate that little Python program to essentially a version of the first C++ program, so the third program has all the semantics of the first one, with just a lighter semantics (but C++ is more powerful, it allows more control on things, so for example with C++ it's easy to use unsigned integers as dict values, while with Python that's a bit more difficult).

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

To start learning a new language I implement some basic programs of mine, like this one that finds all the anagrams inside a file of words, here is the Python 2.5 version:

from collections import defaultdict
sign_words = defaultdict(list)

for word in file("words.txt"):
    word = word.strip()
    sign_words[ "".join(sorted(word.lower())) ].append(word)

for anagrams in sign_words.itervalues():
    if len(anagrams) > 1:
        print "    ".join(anagrams)

I have implemented it in D too (this is my third version):

import std.stdio, std.stream, std.string;

void main() {
    char[][][char[]] words_sign;
    auto all_txt = cast(char[])std.file.read("words.txt");

    foreach(word; splitlines(all_txt)) {
        word = strip(word);
        words_sign[ tolower(word).sort ] ~= word;
    }

    foreach(word, anags; words_sign)
        if (anags.length > 1)
            writefln(join(anags, " "));
}

The result is quite good, some notes:
- The D version is quite short too. Try to implement that program with C only!
- I am ignorant still about D, and I am far from being an expert about C++ too, and D online documentation isn't much good, but despite that I have found little problems to implement this code. Things usually just work, and that's very good for me. That's the basic requirement for me to start trying to learn a new programming language.
- Some things are even shorter in D compared to Python ones, like tolower(word).sort instead of "".join(sorted(word.lower()))
In Python I'd like to have the sort method for strings too, but Python strings are immutable, and sort acts in place, so it can't be done. Python sort must return an iterable, so it returns a list of chars, it can't return a single object like a string. Seen from Python is this a D inconsistency.
- The D program contains very few explicit types, only the char[][][char[]] and cast(char[]). All other types are inferred by the compiler.
- all_txt contains all the input text, the input dictionary. I load all the file because I don't know how to iterate on its lines in D. forach on a textual file can be used to iterate on chars, avoiding the creation of the big list produced by splitlines. Modern Python is better here, because it allows to iterate without creating huge intermediate datasets. On the other hand I am quite ignorant of D, so maybe my program can be improved. It can be good for D to have a way to iterate on lines too, like Python files do.
- It seems that splitlines works both with and without a newline at the end of the file. Good.
- It seems D doesn't have string methods (beside length, ~ and few others), so you need to use string functions, inside std.string. I don't know how to manage namespaces to avoid name collisions yet. std.string contains a lot of string functions, some of them seem silly/rarely needed, so maybe it's not easy to remember them all. A good API is the one that doesn't have spurious functions that are less easy to remember. Adding more and more string functions/methods isn't a good way to improve an API.
- ~= is the array append. (I don't know how to extend the array with the chars of a string, but this isn't needed by this program).
- I have tried to replace foreach(word, anags; words_sign) with foreach(anags; words_sign.values) but the resulting code is slower, because words_sign.values creates a new list, while the first version iterates on the associative array itself.

D language has many little nice things, like Functions as Array Properties: if the first parameter to a function is an array, the function can be called as if it were a property of the array:

int[] array;
void foo(int[] a, int x);
foo(array, 3);
array.foo(3);   // means the same thing

This is a cute thing, and the resulting program has the same speed. I don't know if it's a good or useful thing, maybe it's bad because they aren't methods, they are functions, so some problems may occur if those functions aren't the correct ones, etc. Using that trick the look of program becomes:

import std.stdio, std.stream, std.string;

void main() {
    char[][][char[]] words_sign;
    auto all_txt = cast(char[])std.file.read("words.txt");

    foreach(word; all_txt.splitlines()) {
        word = word.strip();
        words_sign[ word.tolower().sort ] ~= word;
    }

    foreach(word, anags; words_sign)
        if (anags.length > 1)
            writefln(anags.join(" "));
}

Sort is called without () and tolower with, coming from Python I see this an inconsistency.
On a dictionary file of 400 kb, printing only nags.length>1 the running times are about 0.84 seconds for the D program, and about 3 seconds for the Python program. Moving the Python code inside a function, and using Psyco the Python code needs about 2.4 seconds. I think that a person that knows D can write the D version in about the same time needed to write the Python version (and the total line count is similar for this little program, for other programs results may differ. I think that for bigger programs D code becomes longer than the Python one). For this little test the overall results are quite good. Maybe ShedSkin has to use D instead C++ as target language.


Update V.1.1, Jan 5 2007:

I have tried to speed up a bit the D program, using some lower-level code:

import std.stdio, std.stream, std.string;

void main() {
    size_t line_start;
    char[][][char[]] words_sign;
    auto input_txt = cast(char[])std.file.read("words.txt");

    foreach (current_pos, c; input_txt) {
        if (c == '\n') {
            auto word = input_txt[line_start .. current_pos].strip();
            words_sign[ word.tolower().sort ] ~= word;
            line_start = current_pos+1;
        }
    }
    auto word = input_txt[line_start .. input_txt.length].strip();
    words_sign[ word.tolower().sort ] ~= word;

    foreach(word, anags; words_sign)
        if (anags.length > 1)
            writefln(">", anags.join(" "), "<");
}

This is faster because it doesn't create the large list of all the lines. For such small timings the version with the list is probably better. Coming from a scripting language helps you remember that only certain parts of a program may need faster (and lower-level C-like) code, for lot of other not-time-critical parts of a program you can use a higher styled code, that allows you to write less code lines, to program faster, and to debug the code in less time. That's an example of the big advantage of learning various kind of languages instead of just one/few, they teach you how to mix paradigms, and how to do things in various smarter ways more fit for every specific situation.


Update V.1.2, Jan 6 2007:

I have found how to lazily load file lines, this takes 2.67 seconds:

import std.stdio, std.stream, std.string;

void main() {
    char[][][char[]] words_sign;
    auto fin = new File("words.txt", FileMode.In); // Unbuffered
    foreach(char[] line; fin) {
        auto word = line.dup.strip(); // dup is necessary
        words_sign[ word.tolower().sort ] ~= word;
    }

    foreach(word, anags; words_sign)
        if (anags.length > 1)
            writefln(anags.join(" "));
}

Note the dup, it's necessary (I don't know why yet). This is the buffered version, it takes 0.72 second, this is probably the best compromise between speed, programming simplicity, and low line count:

import std.stdio, std.stream, std.string;

void main() {
    char[][][char[]] words_sign;
    auto fin = new BufferedFile("words.txt", FileMode.In); // Buffered
    foreach(char[] line; fin) {
        auto word = line.dup.strip(); // dup is necessary
        words_sign[ word.tolower().sort ] ~= word;
    }

    foreach(word, anags; words_sign)
        if (anags.length > 1)
            writefln(anags.join(" "));
}

Update V.1.3, Jan 9 2007:

Here are some more versions of the anagrams testing program.
Version for ShedSkin Python => C++ compiler:

sign_words = {}

for word in file("words.txt"):
    word = word.strip()
    wsignature = "".join(sorted(word.lower()))
    if wsignature in sign_words:
        sign_words[wsignature].append(word)
    else:
        sign_words[wsignature] = [word]

for anagrams in sign_words.values():
    if len(anagrams) > 1:
        print " ".join(anagrams)


C++ version 1 (for MinGW compiler):

#include <string>
#include <iostream>
#include <fstream>
#include <map>
#include <vector>
#include <algorithm>
#include <ctype.h>

using namespace std;

int main() {
    map<string, vector<string> > words_sign;
    string line;
    ifstream stream_in("words.txt");

    while(getline(stream_in, line)) {
        int start = line.find_first_not_of(" \r\n\t");
        int end = line.find_last_not_of(" \n\r\t");
        string word = line.substr(start, (end - start + 1));

        string key = string(word); // maybe this copy can be avoided
        // http://lists.debian.org/debian-gcc/2002/04/msg00092.html
        transform(key.begin(), key.end(), key.begin(), (int(*)(int))std::tolower);

        sort(key.begin(), key.end());
        words_sign[key].push_back(word);
    }

    map<string, vector<string> >::const_iterator map_iter;
    for (map_iter = words_sign.begin(); map_iter != words_sign.end(); ++map_iter)
        if (map_iter->second.size() > 1) {
            for (int i = 0; i < map_iter->second.size(); ++i) {
                cout << map_iter->second[i];
                if (i < map_iter->second.size()-1)
                    cout << " ";
            }
            cout << endl;
        }

    return 0;
}


C++ version 2, about the faster version I am currently able to create, it used hash-based maps instead of regular ones:

#include <string>
#include <fstream>
#include <ext/hash_map>
#include <vector>
#include <algorithm>
#include <ctype.h>
#include <ext/hash_map>

// http://gcc.gnu.org/ml/libstdc++/2002-04/msg00107.html
namespace __gnu_cxx {
    template<> struct hash< std::string > {
        size_t operator()( const std::string& x ) const {
            return hash< const char* >()( x.c_str() );
        }
    };
}

using namespace std;

int main() {
    __gnu_cxx::hash_map<string, vector<string> > words_sign;
    string line;
    ifstream stream_in("words.txt");

    while(getline(stream_in, line)) {
        int start = line.find_first_not_of(" \r\n\t");
        int end = line.find_last_not_of(" \n\r\t");
        string word = line.substr(start, (end - start + 1));

        string key = string(word);
        for (int pos=0; pos<word.size(); pos--)
            key[pos] = tolower(key[pos]);

        sort(key.begin(), key.end());
        words_sign[key].push_back(word);
    }

    __gnu_cxx::hash_map<string, vector<string> >::const_iterator hash_map_iter;
    for (hash_map_iter = words_sign.begin(); hash_map_iter != words_sign.end(); ++hash_map_iter)
        if (hash_map_iter->second.size() > 1) {
            vector<string> words = hash_map_iter->second;
            int nwords = words.size();
            for (int i = 0; i < nwords-1; ++i)
                printf("%s ", words[i].c_str());
            printf("%s\n", words[nwords-1].c_str());
        }

    return 0;
}

I have done some tests, using a bigger file of English words (about 2.04 MB of words, that is about 216_000 words), the timings are best of three runs:


Source name    Running   Zipped   Compiled   Compilation
               time (s)  source   size (KB)  time (s)
                         (bytes)
D (buffered)    4.23      288       158      0.50 DMD
C++ hash map    6.09      701       288      8.85 MinGW
C++ map         7.06      601       288      8.20 MinGW
Python+Psyco   11.74      209       ---      Very fast
Python         14.95      184       ---      ---
ShedSkin       25.8       190       687      23+35 SS+MinGW

Few notes:
- Despite my attempts, my C++ version is quite slower than the D version. I don't know why yet.
- Despite knowing about as much D as C++, to write the D version I have needed about 1/4 of the time needed to write the C++ version, and I have written three D versions first.
- anagrams_ccp2.cpp uses hashes so it's faster (it also contains few other optimisations). But I've had to look on the net to find the right way to use hash_map with the MinGW compiler. For Microsoft compiler the code may be different.
- D associative arrays use a hash.
- I've had to patch a kind of bug to use the transform inside the C++ versions. This is silly.
- I don't understand why D doesn't have 'true' string methods instead of string functions inside a separated module.
- For my tastes there are too many ways of loading a file in D. I think all those file open and loading/streaming can be reorganised into a single most commonly used class, with some parameters to chose the kind of file, to buffer or an-buffer the file, etc. For special situations there can be 2-3 more file/streaming ways inside some libraries.
- Designers of the C++ STL string class seem to have missed some very commonly used string methods.

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

Some more notes on D:

I don't like the "writef" and "writefln" names of the printing function, "write" e "writenl" seem better.

This D code:

import std.stdio;
void main() {
  char[][] a = ["1","2","3"];
  writefln(a, " ", 1.0);
}

prints (no "" or .0):

[1,2,3] 1

And this:

import std.stdio;
void main() {
  writefln(["1","2","3"], " ", 1.0);
}

Gives this runtime error:
[Error: Access Violation
writefln has to become smarter to avoid such error, or the compiler must catch this problem at compile time.

This works correctly:

import std.stdio;
void main() {
  int[][] c = [[1,2,3],[4,5,6]];
  writefln(c);
}

but it seems writefln() isn't able to print associative arrays:

import std.stdio;
void main() {
  int[int] a;
  a[1] = 2;
  a[3] = 4;
  writefln(a);
}

foreach() has a nice implicit enumeration capability too:

import std.stdio;
void main() {
  char[][] a = ["1","2","3"];
  foreach(el; a)
    writefln(el);
  foreach(i, el; a)
    writefln(i, " ", el);
}


D templates are quite powerful, and their syntax is simpler and much more clean than C++ ones:
http://www.digitalmars.com/d/templates-revisited.html
Using static if and other things Tomasz Stachowiak has even created a compile-time-only ray-tracer:
http://www-users.mat.uni.torun.pl/~h3r3tic/ctrace/

Templates are a purely functional language built on top of C++/D mostly used to manages types, while Lisp/Scheme languages allow to create macros using the language itself.

Software of this article: anagrams.zip

[Go back to the article index]