# json2binidx_tool **Repository Path**: uniartisan2018/json2binidx_tool ## Basic Information - **Project Name**: json2binidx_tool - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-16 - **Last Updated**: 2025-05-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # jsonl to binidx tool This repository is greatly simplified from https://github.com/EleutherAI/gpt-neox, to ONLY convert .jsonl into .bin and .idx , can serve for dataset preparation of RWKV model (see https://github.com/BlinkDL/RWKV-LM), ## The current RWKV models use GPT Neox tokenizer 20B_tokenizer.json ``` python tools/preprocess_data.py --input ./sample.jsonl --output-prefix ./data/sample --vocab ./20B_tokenizer.json --dataset-impl mmap --tokenizer-type HFTokenizer --append-eod ``` ## The multilingual rwkv-4-world models use a new tokenizer rwkv_vocab_v20230424.txt. ``` python tools/preprocess_data.py --input ./sample.jsonl --output-prefix ./data/sample --vocab ./rwkv_vocab_v20230424.txt --dataset-impl mmap --tokenizer-type RWKVTokenizer --append-eod ``` The jsonl format sample (one line for each document): ```json {"text": "This is the first document."} {"text": "Hello\nWorld"} {"text": "1+1=2\n1+2=3\n2+2=4"} ``` generated by code like this: ```python ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False) out.write(ss + "\n") ```