Digital Analogue: March 2013 Archives

4 Mar 2013

Reason #73 why C++ is a terrible intro language

This particular one is so boneheaded I think I can explain it to you even if you have no particular programming background. Imagine that I have a set of instructions that goes something like this:

Write down your name.
Make a pile of pennies with as many pennies as there are letters in your name.
Keep doing this:

Write down an 'X'
Remove one of the pennies

as long as the number of pennies is less than the number of letters in your name.

That part in bold---this obviously won't work, right? Once you run out of pennies, either you declare yourself unable to follow one of the instructions ("remove one of the pennies") next time round (which we call "crashing"), or else you cleverly let yourself go into penny-debt and keep track of negative pennies, in which case the debt keeps growing and you still never have more pennies than your name is long, and so you get stuck running these instructions forever (which we call "hanging").

The following C++ is exactly precisely the same set of instructions as written above:***

    string name;
    cin >> name;
    int pennies = name.length();
    do
    {
        cout << 'X';
        pennies = pennies - 1;
    } while (pennies < name.length());

If a student writes that, it's by accident; they need to see that there's an error in the specification, because they really wanted that last condition to be pennies >= 0, that is, "as long as you haven't yet gone into penny-debt (negative numbers of pennies)". Or perhaps something else, but definitely not what was written. Because the algorithm, as written, is obviously wrong, and needs to crash or print waaaay too many Xs or whatever.

Except, in C++, it works.

The proximate reason for this has to do with implicit type promotion, and I won't explain it in detail except to note that if you try to compare a near-zero negative number to a number that is mandatorily non-negative (which we call an "unsigned" number), the negative number is silently converted into a very large positive number. No warnings, no errors. So what happens here is that you decrease the number of pennies from zero, so you have -1 pennies; and when it asks if that number is less than the number of words in the name, it does one of these silent conversions, so it's actually asking if the number 4,294,967,295 is less than the number of letters in the name. (Or maybe, if the number 18,446,744,073,709,551,615 is less than the number of letters in the name, or even on an older system if 65,535 is less than the number of letters in the name, but the principle's the same in any case.) Since this huge number is, well, huge, the instructions say, yup we're done here.

I don't know of a single other modern programming language (other than C itself) that has this same problem.*

So, okay, this is a teaching moment, right? We can learn to see in the documentation that name.length() would give us an unsigned number, right? Let's see: docs say it would give us a size_type. What's that? That's not one of the types we've seen before.

Right, so it turns out that in C++ you can make different names for the same type. This is helpful from a software engineering perspective, since you can give more things more meaningfully distinct names, and communicate with the other programmers on your project what the role of something is. It's a little rougher on the beginners, though. Is size_type the same thing as unsigned int?

Well, maybe. If you search on this question you find a lot of answers from some very self-righteous software engineers talking about how important it is not to assume anything about what type size_type represents, because, as it turns out, it may vary by what particular thing you're getting the length of. Strings, such as in the code above, could have one kind of size_type, while dictionaries or sets or tables might, in theory, have another.

Fine, fine, I just goddamn want to know if it's signed or unsigned. Maybe I can just look up its actual definition in our current installation! A teaching moment after all. Since I want to model for the student how to discover this information, I could start at either of the two libraries they could possibly know where to look: iostream, which is the only library they're explicitly telling the compiler to include, or string, which they might notice at the top of the documentation is the place where the string-related definitions are. But from either starting point, if I look at the included definitions, what I actually find is a series of instructions to include other definitions files, which each include others, and so on, with no index and no organisation that would be evident to an intro student.

So it's another teaching moment, I guess, where I teach them about the command-line tool grep, which lets them search a whole bunch of files at once. It turns up many places that use size_type and a few different places that define it. Aha! A definition! And it turns out that size_type is an alias for... size_t.

O. M. F. G.

It really just never ends. In this case, whereas size_type is specific to the particular data type being used (strings, dictionaries, etc), size_t is a more global definition that applies throughout the language. So it's a different kind of alias for one of the basic types of the language, this one baked into the language specification. As a result, I can actually find it on a documentation site, which tells me that it is, indeed, an unsigned integral type (but it's not actually specified which integral type, of course).

So, to recap: a student with a weak understanding of an algorithm writes it in a way that is unquestionably incorrect. However, it "works" anyway, due to a quirk of C++ that requires twenty minutes of explanations involving three different library files and several websites. (And that's not even counting the explanation, which I skipped above, of why a near-zero negative number would be rendered into a very large positive number, which has to do with a binary representation called "two's complement" that we've covered but didn't sink in for all of them.)

Are we done yet? Of course not. Because although we tracked down what our library uses for size_type, there is not a specification in the standard that it must be unsigned; which means that not only does this "work" when it's not "supposed" to, its behaviour may or may not vary between different systems. Fortunately, this semester I have everyone working on the department server, so at least I can test it in the same environment they do. But still, having to explain all of these quirks to students is a) difficult given their lack of experience, and b) taking up time that I'd rather spend teaching them how to think. It's one thing to tell an advanced software engineer that in order to understand what a program would do they need a deep multi-year understanding of the language itself, every layer of abstraction it uses, and full understanding of the hardware it's running on. It's borderline malpractice to force that on someone who has been programming for less than two months.

C++ is a terrible, terrible introductory programming language.

* Though to be fair, PHP and Javascript are also notorious for having some bizarre WTF** consequences of a few of their choices for language semantics, and have whole websites devoted to cataloguing them. I wouldn't recommend them as intro languages, either.

** Worse Than Failure

***Well, as close as one could reasonably get, anyway. There's a little bit of legerdemain around "writing down" being input vs output, but none of the adaptations affect the larger point.

BELATED UPDATE: A link to this post was posted to Reddit in September, where it accumulated a few comments in addition to the ones posted here directly. Hi, Reddit!

"Gaping at the color balance on the map is ridiculous because Republicans have proven beyond any doubt in the past 30 years that they are absolutely dominant in areas where no one lives." --Ed, ginandtacos.com

Posted by blahedo at 11:03pm | Comments (7)