InfiniteWing

Till Dreams Come True

  • 2017-11-30
    Big Data►Competition

    [Kaggle] Text Normalization Challenge - English Language

    1. 前言
    2. 心得
    非關內文

    前言

    月底了,心中一直想著要再記錄一下,於是我終於抽空寫了十天前就結束的Text Normalization Challenge 的心得(其實只是懶 =3=)。
    這項競賽一樣是在Kaggle上發起的競賽,根據競賽描述,我們要設計能將文章語句轉換成口說語法的機器學習模型,舉個例子:

    Example 1.
    原文: A baby giraffe is 6ft tall and weighs 150lb.
    轉換: A baby giraffe is six feet tall and weighs one hundred fifty pounds sil

    Example 2.
    原文: $22,750
    轉換: twenty two thousand seven hundred fifty dollars

    Example 3.
    原文: September 5, 1895
    轉換: september fifth eighteen ninety five

    Text Normalization Challenge - English Language

    心得

    根據競賽的定義,在英文項目底下的文字總共可以分成17個類別,比如日期、數字、地址…,這些類別裡,文字的轉換是有跡可循的,但是偶爾會出現不明確的轉換,比如同樣是1972,在日期類別底下會轉換成nineteen seventy two,但是在數值類別的話就是one thousand nine hundred seventy two。

    因為有用英文寫了一下大概的解題思路,因此這邊也就懶的翻譯了,直接上原文說明:


    My solution is based on BingQing Wei’s public kernel, then I use several step to optimized it:

    1. Use xgboost to predict test cases’ class:
    The model is similar to XGboost With Context Label Data (ACC: 99.637%) (the author is also BingQing Wei, big thanks to his work)

    In addition, I use extra xgboost model to predict a 4 digit number is ‘DATE’ or ‘CARDINAL’.

    2. For some class, use customized normalize function to deal with it:
    I treat MEASURE, DATE, MONEY, DECIMAL, CARDINAL, and DIGIT. Because they have specific form. Each customized normalize function can reach from 98.9% to 99.7% acc. (But my customized normalize function can’t handle some rare case, such like Sept. 21th 2017. I’m wondering that did the top team have smarter way.)

    For example, to deal with the ‘DECIMAL’ class. I will use a function to normalized it.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    def decimal2word(key):
    #
    # 100% acc if change
    #
    if(len(key.split()) == 2):
    # e.g. 0.21 million
    unit_words = ['hundred', 'thousand', 'million', 'billion']
    if(not is_decimal(key.split()[0])):
    return key
    else:
    if(((key.split()[1]).lower() in unit_words):
    return decimal2word(key.split()[0]) + ' ' + (key.split()[1]).lower()
    else:
    return key
    else:
    if(not is_decimal(key)):
    return key
    digit_dict = {'0': 'o', '1': 'one', '2': 'two', '3': 'three', '4': 'four', '5': 'five', '6': 'six', '7': 'seven', '8': 'eight', '9': 'nine'}
    out = []
    if(key[0] == '.'):
    # e.g. .021 to point o two one
    out.append('point')
    for v in key.replace('.',''):
    out.append(digit_dict[v])
    else:
    n1, n2 = str(int(key.split('.')[0])), key.split('.')[1]
    out.append(digit2word(n1))
    out.append('point')
    if(len(n2) == 1 and n2[0] == '0'):
    out.append('zero')
    else:
    for v in n2:
    out.append(digit_dict[v])
    word = ' '.join(out)
    return word

    3. Use xgboost to deal with binary ambiguous case:
    Binary ambiguous case such like the ‘-‘ and the ‘:’, which have two target norm, original char and ‘to’. With xgboost model, it’s able to handel ~98% precision and ~99.3% recall.

    最後我以99.32%的accuacy完成了英文項目的競賽,算是滿意了,而且透過閱讀前十名參賽者的解題思路,也學習到了很多新的、有趣的技術..不過還是要等下個機會實際應用,才能更得其精隨。

    改天有空再發俄文版的解題思路,俄文的語法真的有夠難 (╯°Д°)╯ ┻━┻ 。

    完結灑花!
    To be continued..
    Posted at 2017-11-30 03:42:24
    Share 留言
    • Big Data
    • Kaggle
    • 經驗
    下一篇
    在GCP安裝python3,運作pytorch
    上一篇
    [Kaggle] Carvana Image Masking Challenge

    InfiniteWing

    Where Do We Come From?
    What Are We?
    Where Are We Going?

    最新文章

    • 有感而發 - 2018-08-24
    • 楓之谷APP開發日誌 ─ 小結 - 2018-08-21
    • 龍王的工作 - 2018-02-05
    • WSDM - KKBox's Churn Prediction Challenge - 2018-01-01
    • 在GCP安裝python3,運作pytorch - 2017-12-13

    分類

    • Big Data11
      • Cloud Computing3
      • Competition6
      • Deep Learning2
    • FreeBSD1
      • 系統防護1
    • Kaggle1
      • Notebook1
    • 個人作品9
      • 小說創作1
      • 楓之谷APP8
    • 生活4
      • 心情日記4
    • 網站設計2
      • Hexo2
    • 論文寫作1
    • 輕小說1
      • 心得1

    標籤雲

    Android APP8 Big Data11 CNN2 Cloud Computing3 Deep Learning2 FreeBSD1 GCP3 Hexo2 Kaggle9 Keras2 Notebook4 PCHome個人新聞台2 Visualization1 WSDM1 ssmtp1 個人作品9 夢想1 小說1 心得1 心情4 日常1 童年8 第一次3 系統防護1 經驗14 網站設計2 論文寫作1 資訊安全1 輕小說1 龍王的工作1

    彙整

    • 八月 20182
    • 二月 20181
    • 一月 20181
    • 十二月 20171
    • 十一月 20171
    • 十月 20171
    • 九月 20171
    • 八月 20174
    • 七月 20179
    • 二月 20171
    • 一月 20172
    • 十一月 20151
    • 十月 20152
    • 九月 20151
    • 三月 20141
    • 六月 20131
    © 2020 InfiniteWing all rights reserved.
    Powered by Hexo
  • Home
  • About
  • Archives
  • Work
  • Gallery