April 02, 2013

Java Scanner silent fail

I just made a discovery that will be of no interest to the non-technical folks out there.

If you use Java's builtin Scanner (as I've done hundreds of times) to read in a bunch of text, it turns out that if any of that text is not in the expected encoding, it just silently treats that as unreadable, which means hasNext() is false and it appears for all intents and purposes as if it's an end of file. Here's the catch: this happens as soon as the Scanner reads the bad character into its buffer, *not* when your cursor catches up to the bad character.

The way this manifests is that your data seems to be silently truncated for no apparent reason. If you look at the portion of the file where it stops, there appears to be nothing wrong there---and there isn't. The problem is somewhere in the next few hundred characters.

The workaround to this is, if you know what encoding your input uses (and you're sure there's no noise in it), you can specify it:

  Scanner in = new Scanner (new File (filename), "ISO-8859-1");
(similarly "UTF-8"). If you expect your data might be noisy and you don't have access to your data in advance to clean it up, I'm not sure that you can use a Scanner, although it's possible there's something involving rolling your own BufferedReader that you can do.

That took a stupid amount of time to track down, though. "What do you mean, you're at the end of the file? I can see more data RIGHT THERE."

"When judging the relative merits of programming languages, some still seem to equate "the ease of programming" with the ease of making undetected mistakes." --Edsger Dijkstra

Posted by blahedo at 4:30pm on 2 Apr 2013
Comments
Your data appears to be discreetly shortened for no obvious reason as a result. Regards from chinese marketing team. Posted by Gavin at 11:23am on 20 Oct 2025
The problem we encountered as a website programmer is somewhere in the next few hundred characters. Posted by Peter at 9:22pm on 6 Nov 2025
Wow—great catch! Issues like that are maddening precisely because everything looks perfectly normal on the surface. Silent failures are the worst, and Java’s Scanner can definitely be sneaky when it comes to encoding problems. Your explanation about the bad character being read into the buffer before the cursor reaches it makes total sense in hindsight, but it’s the kind of behavior you’d never expect unless you’ve already spent hours debugging it. The fact that it just stops as if the file ended—no error, no warning—is brutal. Specifying the encoding is a solid workaround, but you’re right: if the data might be noisy or mixed-encoding, Scanner becomes a risky tool. A custom BufferedReader setup or even using java.nio with a CharsetDecoder (which can be configured to report malformed input) might be safer for anything nontrivial. Still—kudos to you for tracking it down. That kind of bug feels like wrestling a ghost in the machine. 👻💻 - Visit: www.salinascaliforniafoundationrepaircontractors.com Posted by Salinas Foundation Repair at 3:20am on 25 Nov 2025
Post a comment









Add one to this number: 851
 [?]

Remember personal info?






Valid XHTML 1.0!