What are the data formats of `dataset` and `vocab` folder?

In the [README](https://github.com/fastnlp/CPT/blob/master/pretrain/README.md) of pre-training, it mentions that the `dataset`, `vocab` and `roberta_zh` have to be prepared before training.

Is there any example of the files in the `dataset` and `vocab` folder? 

Also, what do you mean by "Place the checkpoint of Chinese RoBERTa"? I would like to train Chinese BART.

Last, if I wish to replace `Jieba` tokenizer with my custom tokenizer, how can I do so? Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What are the data formats of `dataset` and `vocab` folder? #81

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

What are the data formats of dataset and vocab folder? #81

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

What are the data formats of `dataset` and `vocab` folder? #81