# json2binidx_tool

**Repository Path**: uniartisan2018/json2binidx_tool

## Basic Information

- **Project Name**: json2binidx_tool
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-04-16
- **Last Updated**: 2025-05-07

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# jsonl to binidx tool

This repository is greatly simplified from https://github.com/EleutherAI/gpt-neox, to ONLY convert .jsonl into .bin and .idx , can serve for dataset preparation of RWKV model (see https://github.com/BlinkDL/RWKV-LM), 

## The current RWKV models use GPT Neox tokenizer 20B_tokenizer.json
```
python tools/preprocess_data.py --input ./sample.jsonl --output-prefix ./data/sample --vocab ./20B_tokenizer.json --dataset-impl mmap --tokenizer-type HFTokenizer --append-eod
```

## The multilingual rwkv-4-world models use a new tokenizer rwkv_vocab_v20230424.txt.
```
python tools/preprocess_data.py --input ./sample.jsonl --output-prefix ./data/sample --vocab ./rwkv_vocab_v20230424.txt --dataset-impl mmap --tokenizer-type RWKVTokenizer --append-eod
```

The jsonl format sample (one line for each document):
```json
{"text": "This is the first document."}
{"text": "Hello\nWorld"}
{"text": "1+1=2\n1+2=3\n2+2=4"}
```
generated by code like this:
```python
ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False)
out.write(ss + "\n")
```