May 02, 2008

What I've been up to, Part 1/N: NLP

For the last few weeks I've been on what Knox calls "junior leave", a sort of mini-sabbatical where I don't have to teach and I'm supposed to get stuff done in preparation for submitting a tenure application next year. I've been doing verious things; I'll try to write about some of them, but the series is openended so I don't know how many parts it'll have. ;)

In terms of NLP research, I'm not much less stuck than I've been. Having come off teaching NLP just last term, I did have a couple ideas of things I'd like to play with, one involving multi-lingual Wikipedia as a highly-linked aid to various tasks, and another involving transliteration that sounded neat. I worked on the Wikipedia one a very little bit before deciding it wouldn't give me quite the connections I wanted, not least because the dumps of WP content are not synched across languages. I may come back to that at some point. The transliteration I spent a bunch more time on.

But in the end, didn't come up with much. I could basically replicate the results in another paper, but as I started poking around with improving it, I found that the only things that worked were a little too specific to the exact task at hand, and therefore not very interesting (or, probably, publishable). The really clever parts of it. involving time series analysis of news corpora in different languages, were in the other guy's paper. In the end I found I didn't have a whole lot to add; a deadline passed, and I've basically dropped that, too.

My problem is, ultimately, that I'm not nearly as creative as a lot of people seem to think I am. The kind of creativity I'm good at tends to be more in small-scale cleverness, making highly derivative stuff that is tweaked and better in various small ways, or distilling a complex thing into its simpler core. That's great for teaching, but not so much for research. It's both funny and frustrating: when I read about experimental results (not just in CS), I almost always will come up with a list of additional things I'd like to know about the data, slightly different experiments that would flesh out the finding. But they're always so small that it might be worth the original authors doing them and folding them into their next publication, but not worth me trying to pick it up and do them myself.

I think, actually, that I'd be a pretty good research lab assistant, for roughly those reasons. What I want to do, though, is teach CS to smart college students, and that, paradoxically, means I need to be a good researcher. So there I sit, trying to come up with ideas.

There's a regional conference in the field being held at MSU next weekend, and I'd been planning to go to that, to chat with other NLP folks and to bounce ideas off people. I'm not even at the level of having ideas to bounce, and it seems like every time I take a couple days to go do something, I lose a whole precious week due to all the other stuff and the distraction. So I'd just about come to the conclusion that I'd be better off skipping it.

And as I looked over the site one last time and was preparing an email to send my regrets to its organiser (who I'd previously told I'd probably go), I discovered I didn't really want to send it, I really did want to go. I'm starting to get a little embarrassed to go to NLP conferences, not having any of my own work to show, now, for several years; and yet not going feels like I'm abandoning the field, which I also really don't want to do. I still find it all very interesting and understandable, and I've been mostly keeping up with my reading—which should make it easy enough to jump back in if only I can find the right topic.

Part of the topic problem, too, is that the main topic of my thesis is pretty dead-end-y. Looking back, that was already becoming true while I was still in grad school, and even if I had condensed it into a journal article right away I'm not sure it could've gotten published. There was a certain forest-for-the-trees aspect to it, since the sort of analysis my programs were doing (function tagging) was fairly surface level and required a lot of hand-tagged data (making it hard to apply outside English) and the hand-tagged data I was using was being superseded by a different set of data (making even the English applications a little iffy). The other corpora were using a much more detailed analytical form, which I still think might be overkill for a lot of applications (it certainly is harder to get good results on), but has now become quite standard. The reviewers of anything I write to extend my thesis work (if I were even interested in doing so) would be primarily chosen from among those people who had worked with the other corpus, who with some justification look at the linguistic model I was working under as too primitive and ad hoc, making it ever harder for this sort of thing to get accepted.

Which brings me back to finding a new topic. It needs to be something interesting, and it needs (for practical reasons) to be something I can do with the corpora I already have, because new corpora typically cost a $2K membership in the Linguistic Data Consortium, plus individual corpus costs, and I can't really justify that until I get something done with what I've already got. The topic needs (for other practical reasons) to be something I can make publishable inroads in by July, the deadline for an October conference, and the last conference deadline until next January or so (which is a little late for the tenure review).

So that's one of the things I've been up to for the last month and a half. I guess I'll go ahead and go to the conference next weekend and see if it triggers any ideas. And hope it doesn't stall me too much on getting any other work done.

"A wise and benevolent dictator in particular can still fall at the opposite far end of that spectrum. Because in general we're not able to find such leaders amongst humanity, I favor democracy and consensus building in politics. But because we're so easily able to IMAGINE a leader who outshines the self-centered compromises afforded by democracy, I favor deference to that ideal God as a framework for religion." --Jonathan Prykop

Posted by blahedo at 1:46pm on 2 May 2008
Post a comment

Write this number out in numeral form: four hundred and ninety three

Remember personal info?

Valid XHTML 1.0!