月底了,心中一直想著要再記錄一下,於是我終於抽空寫了十天前就結束的Text Normalization Challenge 的心得(其實只是懶 =3=)。
Example 1.
原文: A baby giraffe is 6ft tall and weighs 150lb.
轉換: A baby giraffe is six feet tall and weighs one hundred fifty pounds silExample 2.
原文: $22,750
轉換: twenty two thousand seven hundred fifty dollarsExample 3.
原文: September 5, 1895
轉換: september fifth eighteen ninety five
根據競賽的定義,在英文項目底下的文字總共可以分成17個類別,比如日期、數字、地址…,這些類別裡,文字的轉換是有跡可循的,但是偶爾會出現不明確的轉換,比如同樣是1972,在日期類別底下會轉換成nineteen seventy two,但是在數值類別的話就是one thousand nine hundred seventy two。
My solution is based on BingQing Wei’s public kernel, then I use several step to optimized it:
1. Use xgboost to predict test cases’ class:
The model is similar to XGboost With Context Label Data (ACC: 99.637%) (the author is also BingQing Wei, big thanks to his work)
In addition, I use extra xgboost model to predict a 4 digit number is ‘DATE’ or ‘CARDINAL’.
2. For some class, use customized normalize function to deal with it:
I treat MEASURE, DATE, MONEY, DECIMAL, CARDINAL, and DIGIT. Because they have specific form. Each customized normalize function can reach from 98.9% to 99.7% acc. (But my customized normalize function can’t handle some rare case, such like Sept. 21th 2017. I’m wondering that did the top team have smarter way.)
For example, to deal with the ‘DECIMAL’ class. I will use a function to normalized it.
3. Use xgboost to deal with binary ambiguous case:
Binary ambiguous case such like the ‘-‘ and the ‘:’, which have two target norm, original char and ‘to’. With xgboost model, it’s able to handel ~98% precision and ~99.3% recall.
改天有空再發俄文版的解題思路,俄文的語法真的有夠難 (╯°Д°)╯ ┻━┻ 。

To be continued..