GitHub - khangt1k25/Punctuation-Restoration: Project ML&DM

Punctuation Restoration for Vietnamese

Our Implementation of seq2punct. Our problem is restoring missing puncts in Vietnamese sentences. We consider this problem with [just comma and period].

Framework in project

Python3
Pytorch
Flask
Matplotlib

Result

Dataset format

Input: Seq of words

Output: Seq of punct

Ex: Text "Một đêm nọ tôi nằm mơ thấy em " has the label: " 0 1 0 0 0 0 0 2" with [0, 1, 2] denotes space, comma, period respectively.

You can download preprocessed data in: https://drive.google.com/drive/u/2/folders/1NfpLGRQAJPlURQa-3G6RFR2cKd1KKavt?fbclid=IwAR1-JV8NOTNGcbVtj7BkUbtNTAixxhisg4y1-qqeeOjzZAwaGcUlzAk3jtg

For using

Clone this repo
Pip install requirements
Create dumps/ folder for saving model checkpoint
You can download pretrained model in: https://drive.google.com/drive/folders/1pKeP6YGsYveJNiAhl9O8vdG_OAoIx1Gk?usp=sharing

Train

To train by yourselft

python main.py --model [RNN] --n_layers [2] -- embedding_size [256] --hidden_dim [256]

Inference

To infere: Change the path to pretrained model and run command

from infere import pipeline

text = 'Tôi là Khang tôi là sinh viên trường Bách Khoa'

res = pipeline(text)

Inference with bert

We use pretrained phobert to extract the features of text, then add RNN and MLP to classify punct. The result ~ 95% accuracy

Train with training_bert.ipynb Infere with infere_bert.py

Run demo with flask

python app.py

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
fig		fig
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
datasets.py		datasets.py
infere.py		infere.py
infere_bert.py		infere_bert.py
main.py		main.py
models.py		models.py
prepocessing.py		prepocessing.py
requirements.txt		requirements.txt
trainer.py		trainer.py
training_bert.ipynb		training_bert.ipynb
visualize.ipynb		visualize.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Punctuation Restoration for Vietnamese

Framework in project

Result

Dataset format

For using

Train

Inference

Inference with bert

Run demo with flask

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Punctuation Restoration for Vietnamese

Framework in project

Result

Dataset format

For using

Train

Inference

Inference with bert

Run demo with flask

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages