2017-11-30

[Kaggle] Text Normalization Challenge - English Language

前言
心得

前言

月底了，心中一直想著要再記錄一下，於是我終於抽空寫了十天前就結束的Text Normalization Challenge 的心得(其實只是懶 =3=)。
這項競賽一樣是在Kaggle上發起的競賽，根據競賽描述，我們要設計能將文章語句轉換成口說語法的機器學習模型，舉個例子：

Example 1.
原文： A baby giraffe is 6ft tall and weighs 150lb.
轉換： A baby giraffe is six feet tall and weighs one hundred fifty pounds sil

Example 2.
原文： $22,750
轉換： twenty two thousand seven hundred fifty dollars

Example 3.
原文： September 5, 1895
轉換： september fifth eighteen ninety five

Text Normalization Challenge - English Language

心得

根據競賽的定義，在英文項目底下的文字總共可以分成17個類別，比如日期、數字、地址…，這些類別裡，文字的轉換是有跡可循的，但是偶爾會出現不明確的轉換，比如同樣是1972，在日期類別底下會轉換成nineteen seventy two，但是在數值類別的話就是one thousand nine hundred seventy two。

因為有用英文寫了一下大概的解題思路，因此這邊也就懶的翻譯了，直接上原文說明：

My solution is based on BingQing Wei’s public kernel, then I use several step to optimized it:

1. Use xgboost to predict test cases’ class:
The model is similar to XGboost With Context Label Data (ACC: 99.637%) (the author is also BingQing Wei, big thanks to his work)

In addition, I use extra xgboost model to predict a 4 digit number is ‘DATE’ or ‘CARDINAL’.

2. For some class, use customized normalize function to deal with it:
I treat MEASURE, DATE, MONEY, DECIMAL, CARDINAL, and DIGIT. Because they have specific form. Each customized normalize function can reach from 98.9% to 99.7% acc. (But my customized normalize function can’t handle some rare case, such like Sept. 21th 2017. I’m wondering that did the top team have smarter way.)

For example, to deal with the ‘DECIMAL’ class. I will use a function to normalized it.

def decimal2word(key):
#
# 100% acc if change
#
if(len(key.split()) == 2):
    # e.g. 0.21 million
    unit_words = ['hundred', 'thousand', 'million', 'billion']
    if(not is_decimal(key.split()[0])):
        return key
    else:
        if(((key.split()[1]).lower() in unit_words):
            return decimal2word(key.split()[0]) + ' ' + (key.split()[1]).lower()
        else:
            return key
else:
    if(not is_decimal(key)):
        return key
    digit_dict = {'0': 'o', '1': 'one', '2': 'two', '3': 'three', '4': 'four', '5': 'five', '6': 'six', '7': 'seven', '8': 'eight', '9': 'nine'}
    out = []
    if(key[0] == '.'):
        # e.g. .021 to point o two one
        out.append('point')
        for v in key.replace('.',''):
            out.append(digit_dict[v])
    else:
        n1, n2 = str(int(key.split('.')[0])), key.split('.')[1]
        
        out.append(digit2word(n1))
        out.append('point')
        if(len(n2) == 1 and n2[0] == '0'):
            out.append('zero')
        else:            
            for v in n2:
                out.append(digit_dict[v])
    word = ' '.join(out)
    return word

3. Use xgboost to deal with binary ambiguous case:
Binary ambiguous case such like the ‘-‘ and the ‘:’, which have two target norm, original char and ‘to’. With xgboost model, it’s able to handel ~98% precision and ~99.3% recall.

最後我以99.32%的accuacy完成了英文項目的競賽，算是滿意了，而且透過閱讀前十名參賽者的解題思路，也學習到了很多新的、有趣的技術..不過還是要等下個機會實際應用，才能更得其精隨。

改天有空再發俄文版的解題思路，俄文的語法真的有夠難 (╯°Д°)╯ ┻━┻ 。

To be continued..

InfiniteWing

Till Dreams Come True

[Kaggle] Text Normalization Challenge - English Language

前言

心得