# Fairseq Hangul to IPA (2024-02-17)

Implement [Hangul to IPA](https://linguisting.tistory.com/84) using Fairseq

Fefer to fairseq [github](https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md#training-a-new-model) for documentation.



1.   Input: syllabic Hangul strings
2.   Output: sequential IPA strings
3.   When training, adjust parameters, especially those related to the number of layers.
4.   For the 'arch' parameter, try "transformer-tiny." [See the documentation](https://fairseq.readthedocs.io/en/latest/command_line_tools.html#Model%20configuration).




## Installation

In [None]:
!pip install fairseq==0.12.2

Collecting fairseq==0.12.2
  Downloading fairseq-0.12.2.tar.gz (9.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m67.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting hydra-core<1.1,>=1.0.7 (from fairseq==0.12.2)
  Downloading hydra_core-1.0.7-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.8/123.8 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting omegaconf<2.1 (from fairseq==0.12.2)
  Downloading omegaconf-2.0.6-py3-none-any.whl (36 kB)
Collecting sacrebleu>=1.4.12 (from fairseq==0.12.2)
  Downloading sacrebleu-2.4.0-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.3/106.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m


### Sentence->subwords tools

In [None]:
!git clone https://github.com/google/sentencepiece.git
!cd sentencepiece && git checkout v0.1.97

Cloning into 'sentencepiece'...
remote: Enumerating objects: 5101, done.[K
remote: Counting objects: 100% (2133/2133), done.[K
remote: Compressing objects: 100% (351/351), done.[K
remote: Total 5101 (delta 1855), reused 1851 (delta 1771), pack-reused 2968[K
Receiving objects: 100% (5101/5101), 26.81 MiB | 14.85 MiB/s, done.
Resolving deltas: 100% (3508/3508), done.
Note: switching to 'v0.1.97'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 58f256c Updated the document


In [None]:
!mkdir sentencepiece/build
!cd sentencepiece/build && cmake ..
!cd sentencepiece/build && make -j $(nproc)

  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- VERSION: 0.1.97
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Not Found TCMalloc: TCMALLOC_LIB-NOTFOUND
-- Configuring done (0.7s)
-- Generating done (0.0s)
-- Build files hav

In [None]:
%env SPM=/content/sentencepiece/build/src
!echo $SPM

env: SPM=/content/sentencepiece/build/src
/content/sentencepiece/build/src


## Train a new model

To train a new model we need few elements/steps:

1.   Parallel corpora
2.   Data pre-processing
3.   Training





### Parallel corpora

Korean input / output words

- words.ur-sr.ur

- words.ur-sr.sr

In [None]:
!wget -O words.ur-sr.ur [path to ur-sr.ur or upload it manually]
!wget -O words.ur-sr.sr [path to ur-sr.sr or upload it manually]

In [None]:
!ls words*

words.ur-sr.sr	words.ur-sr.ur


In [None]:
!head -n 5 words.ur-sr.ur

ㄴㅔ
ㅈㅔ ㅣㄹㅡㅁㅡㄴ
ㅣㄱㅗㅛ
ㅏ ㅕㄹㅕㅅㅓㅅ ㅅㅏㄹ ㅁㅏㄴ ㅕㄹㅕㅅㅓㅅ ㅅㅏㄹㅣㅂㄴㅣㄷㅏ
ㄴㅏㅁㅈㅏㅛ ㄴㅔ


In [None]:
!head -n 5 words.ur-sr.sr

n ɛ
tɕ ɛ i l ɯ m ɯ n
i k u jʌ
ɑ jʌ l l jʌ s ʌ s* ɑ l m ɑ n jʌ l l jʌ s ʌ s* ɑ l i m m i t ɑ
n ɑ m tɕ ɑ ja n ɛ


### Divide traning and development

Cut words.ur-sr.* into

1.   Training 80%
2.   Development 10%
3.   Test: 10%

Since 57,657 rows in total,
1. Training: 46,126
2. Dev: 5,765
3. Test: 5,766

In [None]:
!head -n 51891 words.ur-sr.ur > train_dev.ur-sr.ur
!head -n 51891 words.ur-sr.sr > train_dev.ur-sr.sr
!tail -n 5766 words.ur-sr.ur > test.ur-sr.ur
!tail -n 5766 words.ur-sr.sr > test.ur-sr.sr
!head -n 46126 train_dev.ur-sr.ur > train.ur-sr.ur
!head -n 46126 train_dev.ur-sr.sr > train.ur-sr.sr
!tail -n 5765 train_dev.ur-sr.ur > dev.ur-sr.ur
!tail -n 5765 train_dev.ur-sr.sr > dev.ur-sr.sr

In [None]:
!ls train*

train_dev.ur-sr.sr  train_dev.ur-sr.ur	train.ur-sr.sr	train.ur-sr.ur


### Sentencepiece model
It is important to train Sentencepiece model without validation/test data to avoid data leak.

In [None]:
!$SPM/spm_train --input="train.ur-sr.ur,train.ur-sr.sr" \
    --vocab_size=16000 \
    --character_coverage=1 \
    --num_threads=8 \
    --max_sentence_length=500 \
    --model_prefix="spm" \
    --model_type=unigram \
    --bos_id=0 --pad_id=1 --eos_id=2 --unk_id=3

sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: train.ur-sr.ur
  input: train.ur-sr.sr
  input_format: 
  model_prefix: spm
  model_type: UNIGRAM
  vocab_size: 16000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 500
  num_threads: 8
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 3
  bos_id: 0
  eos_id: 2
  pad_id: 1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential

In [None]:
!head -n 7 spm.vocab

<s>	0
<pad>	0
</s>	0
<unk>	0
▁k	-2.56554
▁ɑ	-2.58604
▁n	-2.68373


Create dictionary file. 'dict.txt' is for the '--srcdict' parameter in `fairseq-preprocess`.

In [None]:
!cut -f1 "spm.vocab" | tail -n +5 | sed "s/$/ 100/g" > "dict.txt"
!head -n 3 dict.txt

▁k 100
▁ɑ 100
▁n 100


### Dataset pre-processing

Here, we have 2 tasks to do:

1.   Using the Sentencepiece model to pre-process our data. It will split the sentences into **subwords**, adding the special symbol `▁` to the first subword of a word.
2.   Binarize the data to the Fairseq format.

In [None]:
!$SPM/spm_encode --model="spm.model" --output_format=piece < "train.ur-sr.ur" > train.ur-sr.spm.ur
!$SPM/spm_encode --model="spm.model" --output_format=piece < "train.ur-sr.sr" > train.ur-sr.spm.sr
!$SPM/spm_encode --model="spm.model" --output_format=piece < "dev.ur-sr.ur" > dev.ur-sr.spm.ur
!$SPM/spm_encode --model="spm.model" --output_format=piece < "dev.ur-sr.sr" > dev.ur-sr.spm.sr

In [None]:
!head -n 3 train.ur-sr.spm*

==> train.ur-sr.spm.sr <==
▁n ▁ɛ
▁tɕ ▁ɛ ▁i ▁l ▁ɯ ▁m ▁ɯ ▁n
▁i ▁k ▁u ▁jʌ

==> train.ur-sr.spm.ur <==
▁네
▁제 ▁ᅵ르므ᄂ
▁ᅵ 고ᅭ


`--source-lang "ur"` and `--target-lang "sr"`, respectively, As if, we are translating UR to SR.

In [None]:
!pip install tensorboardX
!fairseq-preprocess \
    --source-lang "ur" \
    --target-lang "sr" \
    --trainpref "train.ur-sr.spm" \
    --validpref "dev.ur-sr.spm" \
    --destdir "bin" \
    --joined-dictionary \
    --srcdict "dict.txt"\
    --bpe sentencepiece \
    --workers 8


Collecting tensorboardX
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl (101 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/101.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tensorboardX
Successfully installed tensorboardX-2.6.2.2
2024-02-19 03:48:15.546691: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 03:48:15.546749: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 03:48:15.548068: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for pl

In [None]:
!ls bin

dict.sr.txt	train.ur-sr.sr.bin  train.ur-sr.ur.idx	valid.ur-sr.ur.bin
dict.ur.txt	train.ur-sr.sr.idx  valid.ur-sr.sr.bin	valid.ur-sr.ur.idx
preprocess.log	train.ur-sr.ur.bin  valid.ur-sr.sr.idx


Binarizied data:

In [None]:
!head -n 3 bin/train.ur-sr.sr.*

==> bin/train.ur-sr.sr.bin <==
 	   	                   
             
                   	   	                

==> bin/train.ur-sr.sr.idx <==
MMIDIDX         .�         	         	                        	                     C                           &   '      
   !         
               '   8   5         


In [None]:
!ls

bin	      dev.ur-sr.ur  drive	 train.ur-sr.sr  words.ur-sr.sr
dev.ur-sr.sr  dict.txt	    sample_data  train.ur-sr.ur  words.ur-sr.ur


### Training

The longest part - make sure to have GPU enabled. The provided hyperparameters may be fine, but only may.

In [None]:
!fairseq-train \
        "bin" \
        --fp16 \
        --arch transformer_tiny \
        --task translation \
        --source-lang ur \
        --target-lang sr \
        --share-all-embeddings \
        --reset-optimizer \
        --optimizer adam \
        --adam-betas '(0.9, 0.98)' \
        --clip-norm 0.0 \
        --lr 5e-4 \
        --lr-scheduler inverse_sqrt \
        --warmup-updates 4000 \
        --warmup-init-lr 1e-07 \
        --dropout 0.1 \
        --weight-decay 0.0 \
        --criterion cross_entropy \
        --save-dir "model_output_transformer" \
        --log-format json \
        --log-interval 100 \
        --max-tokens 8000 \
        --max-epoch 200 \
        --seed 3921 \
        --maximize-best-checkpoint-metric \
        --batch-size 32


2024-02-19 05:15:08.930277: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 05:15:08.930325: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 05:15:08.931705: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-19 05:15:08.939169: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-19 05:15:11 | INFO | numexpr.utils | 

# Test driving

source: https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md#example-usage-torchhub

In [None]:
from fairseq.models.transformer import TransformerModel

DATA_BIN = '/content/bin/'
MODEL = '/content/model_output_transformer/'


ur2sr = TransformerModel.from_pretrained(
  MODEL,
  checkpoint_file='checkpoint200.pt',
  data_name_or_path=DATA_BIN,
  bpe='sentencepiece',
  sentencepiece_model='/content/spm.model',
  source_lang='ur',
  target_lang='sr'
)

# from training set
print(f"다키포스트:\t{ur2sr.translate('ㄷㅏㅋㅣㅍㅗㅅㅡㅌㅡ')}")  # 't ɑ kʰ i pʰ o s ɯ t ɯ'


북스:	p u k tɕh o k s ɯ
