Friday, January 23, 2015

Final Version of Written Corpus

Today I have uploaded and linked the final version of the Gachon Learner Corpus.

We got over 2.5 million, so I'm pretty happy about that. I felt that number was big enough and it's time to start doing some analysis myself.

I'll hopefully be presenting about the project at the 2015 KOTESOL international conference, so maybe I'll see some of you there.

The spoken corpus continues to be a monstrous task, but I'm still optimistic that I will eventually be able to get the transcripts completed and uploaded...but I'm not sure when.

Thanks for all the interest and support. Please email me with any questions.

Tuesday, October 7, 2014


So, obviously I have not gotten around to any of the things I thought I would have done by now.

The spoken learner corpus is still very far off and I haven't yet added the written data from last semester.

However, now that the KOTESOL conference is behind me and I'm settling into the routines of having a second child (yay for me!), I hope to have the new, larger version of the GLC up in a few days...maybe by the weekend.

Saturday, March 15, 2014

Status of Spoken Learner Corpus

Well, it's halfway into March, and I still haven't completed transcribing and sorting the first version of the spoken corpus yet.

Sorry to any out there who might be waiting for it.

To be honest, I'm not sure when it will be done. The two weeks during the winter break I was going to use to complete the project were lost to unforeseeable troubles completely unrelated to the project.

Now, we are full-swing into the spring semester and I'm swamped with work. I have six classes this semester with a total of around 160 students. That's a lot of marking and classroom management to be done. On top of that, I am conducting a very large DDL/Student Concordancing experiment with two other professors this semester which takes up a lot of time (but will be worth's the largest experiment of this kind ever done!). On top of that top, I've got to edit my conference proceedings submission from last year's KOTESOL conference and get my proposal done for next year's (it's gonna be on, you guessed it: DDL). Finally, I'm taking an EdX course on statistics which is hopefully gonna reinforce that I actually understand what all those numbers coming out of SPSS mean.

In any case, I don't expect to post the spoken corpus until sometime this summer.

But, there will be a new version of the written corpus out just after this semester ends. We might finally get to 2 million words. We already have the largest corpus of English produced by Korean learners (by far). My goal is to keep pushing and hopefully inspire others to keep producing larger corpora...and give them away for free.

These things really shouldn't be sold. Not when it costs so little to make them and so few of us love them enough to use them in our research.

Sunday, January 5, 2014

GLC 2.1 Text Files from Jae-Woong Choe

I've uploaded a nice gift from Jae-Woong Choe, a professor in the Department of Linguistics at Korea University.

He has created individual text files for each of the texts in the current corpus along with more accurate token counts for the overall corpus.

You can download these files here.

Many thanks to him for creating these files!

Thursday, December 19, 2013

The Gachon Learner Corpus Version 2.1

The newest version of the Gachon Learner Corpus is here!

This version includes all of the most recently collected texts.

Version 2.1 contains 1,609,517 tokens from 16,113 individual texts produced by 1607 participants.

I made sure the data entry with all the new texts and learner profiles is consistent and doesn't have any of the errors that had been pointed out in the past.

Check back for more updates soon.

Monday, December 9, 2013

How-to File and Data Cleaning

First, many thanks to the Korean Association of Corpus Linguistics for inviting me to speak about the Gachon Learner Corpus at Korea University in Seoul this last Saturday.

It was an excellent conference, and I was really excited to hear Laurence Anthony give a tutorial on AntConc. It was very nice to meet him.

So, today I finally had a chance to make a tutorial on how to use the Gachon Learner Corpus.

You can check it out here.

I also had a chance to go through ver2.0 of the corpus and fix various issues with incorrectly entered data in the learner profiles.

Finally, I was able to get a more accurate word count for the corpus (it's actually slightly larger than I had thought) and a better count on the number of participants.

Thanks for bearing with me. I'm doing all this in my spare time, so it's slow-going, but we're getting there.


1 - The newest version of the corpus should be available in one to two weeks. Once the semester has finished, I'll compile all the writing assignments for this semester and add them to the corpus. Check back for that.

2 - The Gachon Spoken Learner Corpus is coming soon! I hope to have finalized transcripts of the speaking exams uploaded by the end of February. The exams are five minute conversations between students in the same courses as those involved in the written corpus, using the same book, and responding to exactly the same questions as the writing assignments. I'll keep a better log of updates from here on out.