March 04, 2013

Reason #73 why C++ is a terrible intro language

This particular one is so boneheaded I think I can explain it to you even if you have no particular programming background. Imagine that I have a set of instructions that goes something like this:

Write down your name.
Make a pile of pennies with as many pennies as there are letters in your name.
Keep doing this:
Write down an 'X'
Remove one of the pennies
as long as the number of pennies is less than the number of letters in your name.
That part in bold---this obviously won't work, right? Once you run out of pennies, either you declare yourself unable to follow one of the instructions ("remove one of the pennies") next time round (which we call "crashing"), or else you cleverly let yourself go into penny-debt and keep track of negative pennies, in which case the debt keeps growing and you still never have more pennies than your name is long, and so you get stuck running these instructions forever (which we call "hanging").

The following C++ is exactly precisely the same set of instructions as written above:***

    string name;
    cin >> name;
    int pennies = name.length();
    do
    {
        cout << 'X';
        pennies = pennies - 1;
    } while (pennies < name.length());
If a student writes that, it's by accident; they need to see that there's an error in the specification, because they really wanted that last condition to be pennies >= 0, that is, "as long as you haven't yet gone into penny-debt (negative numbers of pennies)". Or perhaps something else, but definitely not what was written. Because the algorithm, as written, is obviously wrong, and needs to crash or print waaaay too many Xs or whatever.

Except, in C++, it works.

The proximate reason for this has to do with implicit type promotion, and I won't explain it in detail except to note that if you try to compare a near-zero negative number to a number that is mandatorily non-negative (which we call an "unsigned" number), the negative number is silently converted into a very large positive number. No warnings, no errors. So what happens here is that you decrease the number of pennies from zero, so you have -1 pennies; and when it asks if that number is less than the number of words in the name, it does one of these silent conversions, so it's actually asking if the number 4,294,967,295 is less than the number of letters in the name. (Or maybe, if the number 18,446,744,073,709,551,615 is less than the number of letters in the name, or even on an older system if 65,535 is less than the number of letters in the name, but the principle's the same in any case.) Since this huge number is, well, huge, the instructions say, yup we're done here.

I don't know of a single other modern programming language (other than C itself) that has this same problem.*

So, okay, this is a teaching moment, right? We can learn to see in the documentation that name.length() would give us an unsigned number, right? Let's see: docs say it would give us a size_type. What's that? That's not one of the types we've seen before.

Right, so it turns out that in C++ you can make different names for the same type. This is helpful from a software engineering perspective, since you can give more things more meaningfully distinct names, and communicate with the other programmers on your project what the role of something is. It's a little rougher on the beginners, though. Is size_type the same thing as unsigned int?

Well, maybe. If you search on this question you find a lot of answers from some very self-righteous software engineers talking about how important it is not to assume anything about what type size_type represents, because, as it turns out, it may vary by what particular thing you're getting the length of. Strings, such as in the code above, could have one kind of size_type, while dictionaries or sets or tables might, in theory, have another.

Fine, fine, I just goddamn want to know if it's signed or unsigned. Maybe I can just look up its actual definition in our current installation! A teaching moment after all. Since I want to model for the student how to discover this information, I could start at either of the two libraries they could possibly know where to look: iostream, which is the only library they're explicitly telling the compiler to include, or string, which they might notice at the top of the documentation is the place where the string-related definitions are. But from either starting point, if I look at the included definitions, what I actually find is a series of instructions to include other definitions files, which each include others, and so on, with no index and no organisation that would be evident to an intro student.

So it's another teaching moment, I guess, where I teach them about the command-line tool grep, which lets them search a whole bunch of files at once. It turns up many places that use size_type and a few different places that define it. Aha! A definition! And it turns out that size_type is an alias for... size_t.

O. M. F. G.

It really just never ends. In this case, whereas size_type is specific to the particular data type being used (strings, dictionaries, etc), size_t is a more global definition that applies throughout the language. So it's a different kind of alias for one of the basic types of the language, this one baked into the language specification. As a result, I can actually find it on a documentation site, which tells me that it is, indeed, an unsigned integral type (but it's not actually specified which integral type, of course).

So, to recap: a student with a weak understanding of an algorithm writes it in a way that is unquestionably incorrect. However, it "works" anyway, due to a quirk of C++ that requires twenty minutes of explanations involving three different library files and several websites. (And that's not even counting the explanation, which I skipped above, of why a near-zero negative number would be rendered into a very large positive number, which has to do with a binary representation called "two's complement" that we've covered but didn't sink in for all of them.)

Are we done yet? Of course not. Because although we tracked down what our library uses for size_type, there is not a specification in the standard that it must be unsigned; which means that not only does this "work" when it's not "supposed" to, its behaviour may or may not vary between different systems. Fortunately, this semester I have everyone working on the department server, so at least I can test it in the same environment they do. But still, having to explain all of these quirks to students is a) difficult given their lack of experience, and b) taking up time that I'd rather spend teaching them how to think. It's one thing to tell an advanced software engineer that in order to understand what a program would do they need a deep multi-year understanding of the language itself, every layer of abstraction it uses, and full understanding of the hardware it's running on. It's borderline malpractice to force that on someone who has been programming for less than two months.

C++ is a terrible, terrible introductory programming language.

* Though to be fair, PHP and Javascript are also notorious for having some bizarre WTF** consequences of a few of their choices for language semantics, and have whole websites devoted to cataloguing them. I wouldn't recommend them as intro languages, either.

** Worse Than Failure

***Well, as close as one could reasonably get, anyway. There's a little bit of legerdemain around "writing down" being input vs output, but none of the adaptations affect the larger point.

BELATED UPDATE: A link to this post was posted to Reddit in September, where it accumulated a few comments in addition to the ones posted here directly. Hi, Reddit!

"Gaping at the color balance on the map is ridiculous because Republicans have proven beyond any doubt in the past 30 years that they are absolutely dominant in areas where no one lives." --Ed, ginandtacos.com

Posted by blahedo at 11:03pm on 4 Mar 2013
Comments
So Don, I agree with you completely that C++ is a rather poor intro language (it can be a bear of a development language as well unless you can justify the performance gains). If you were king, which language(s) would you choose and why? Posted by Chris Bortz at 5:46pm on 10 Sep 2013
Lua is the best introductory programming language. Fact. Posted by Jon at 5:47pm on 10 Sep 2013
Well, any programming language that I know of has its quirks, and it's possible to write bad code in all of them. The issue you highlight should yield a compiler warning however, so I'm not entirely sure that this qualifies as a case in point for C++ being a bad language for beginners. Posted by GrandMaster789 at 6:27pm on 10 Sep 2013
I agree this is painful, and like you I would not recommend C++ to students as a first programming language. However, as I work with C++ and run into things like these, I find looking at what the standard says to be somewhat helpful in cases like these. Often more so than library files. For example, for this particular issue, it looks like X::size_type itself is baked into the specification of C++, just like size_t. The fact that size_type for std::string resolves to size_t seems to be an implementation detail in itself. I could be wrong, but it looks like size_type for all types (std::string included) is defined as a part of allocator_traits, and they specify: "X::size_type: unsigned integer type a type that can represent the size of the largest object in the allocation model." Of course, having to look and half-understand the standard for someone new to C++ is unreasonable, and it is possible it doesn't match what the compiler does. But looking at standard library implementation files could be misleading as well (besides painful). Posted by Oscar at 6:29pm on 10 Sep 2013
C and C++ are rather low level, so they definitely need a little bit more experience, but bad code can be written in any language. I would rather choose Python, Ruby or Lua as an introductory language. Posted by Studiosi at 2:30am on 11 Sep 2013
"[...] the negative number is silently converted into a very large positive number. No warnings, no errors." What are you talking about? "warning: comparison between signed and unsigned integer expressions [-Wsign-compare]". But yes, I do see your point. Posted by Johannes at 4:39pm on 17 Sep 2013
When I said "no warnings, no errors", I meant by default. If I compile -Wall (or -Wsign-compare) then yes, it does give a warning; but it gives that warning even if you are writing the correct version of the code! If you count up from zero and make the comparison "while (pennies < names.length())" you get precisely the same warning. So this is not at all helpful from a pedagogy perspective.

The standard is written that this kind of type conversion is normally silent, and the real problem here is that it's not even clear what names.length() returns (signed/unsigned), it's slightly difficult to find that out (and nearly impossible if you don't already have some idea of the potential problem), and at the point in a CS1 course where you might be doing this, the students may not even know the difference between signed and unsigned!

C++ has its place in the world. Intro CS courses are not that place.

Posted by blahedo at 8:11pm on 20 Oct 2013
Valid XHTML 1.0!