# llmcoderpk

**Repository Path**: samwan_9996/llmcoderpk

## Basic Information

- **Project Name**: llmcoderpk
- **Description**: LLM Code Generation Capability Comparison Tool - Where LLMs Judge LLMs 
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-04-05
- **Last Updated**: 2025-04-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# LLM Code Generation Comparison Tool

<p align="left">
        <a href="README_CN.md">中文</a> &nbsp｜ &nbsp English&nbsp&nbsp
</p>

A comprehensive framework for evaluating and comparing code generation capabilities of different Large Language Models (LLMs).

## Table of Contents

- [Overview](#overview)
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Output](#output)
- [Evaluation Metrics](#evaluation-metrics)
- [Visualization](#visualization)
- [Project Structure](#project-structure)
- [Contributing](#contributing)
- [License](#license)

## Overview

This tool provides a systematic approach to evaluate and compare the code generation capabilities of various Large Language Models. It allows you to submit programming tasks to multiple LLMs, collect their generated code, evaluate the code quality across multiple dimensions, and generate comprehensive reports with visualizations.

## Features

- **Multi-Model Support**: Test multiple LLMs simultaneously with the same programming tasks
- **Comprehensive Evaluation**: Assess code across multiple dimensions including correctness, quality, readability, and efficiency
- **Cross-Evaluation**: Models can evaluate each other's code for a more balanced assessment
- **Detailed Visualization**: Generate radar charts, bar graphs, and comparison tables
- **HTML Reports**: Create interactive HTML reports with detailed analysis
- **Customizable Metrics**: Adjust evaluation weights and criteria based on your needs
- **Code Execution**: Optionally execute generated code to verify functionality
- **Extensible Framework**: Easily add new models or evaluation criteria

## Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/llm_compare.git
cd llm_compare

# Install dependencies
pip install -r requirements.txt
```

## Usage
### Usage help
```bash
$ python main.py
usage: main.py [-h] [-c CONFIG] [-d RESULTS_DIR] [-e EVAL_MODELS] [-j JUDGE_MODELS] [-l LANGUAGE] [-p PROMPT] [-P PROMPT_FILE]

LLM Code Generation Capability Comparison Tool

options:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        Configuration file path
  -d RESULTS_DIR, --results-dir RESULTS_DIR
                        Results save directory
  -e EVAL_MODELS, --eval-models EVAL_MODELS
                        Models to evaluate, comma separated (format: [provider:]model_name)
  -j JUDGE_MODELS, --judge-models JUDGE_MODELS
                        Judge models, comma separated (format: [provider:]model_name), if not specified, use evaluation models
  -l LANGUAGE, --language LANGUAGE
                        Programming language
  -p PROMPT, --prompt PROMPT
                        Code generation prompt
  -P PROMPT_FILE, --prompt-file PROMPT_FILE
                        Code generation prompt file path
  -t MAX_TOKENS, --max-tokens MAX_TOKENS
                        Max tokens for Code generation (minimum 200)
```

NOTE:
- Model names should follow the format: [provider:]model_name
   - Model names and providers should match the model names in config.toml
   - For example:
     - ollama:qwen2.5-coder:0.5b represents the qwen2.5-coder model from the ollama provider, version 0.5b, the user should see the model in the ollama host by ollama list
     - siliconflow:Qwen/Qwen2.5-Coder-7B-Instruct represents the Qwen2.5-Coder-7B-Instruct model from the siliconflow provider, DeepSeek-V3-0324 model from the modelscope provider
   - If the provider is omitted, the default provider from the configuration file will be used
- If no judge models are specified, evaluation models will be used as judge models
- If both judge models and evaluation models are specified, judge models will be used to evaluate evaluation models' code

## Configuration
The tool uses TOML configuration files located in the config/ directory. The main configuration file is config.toml .
Please make a copy of config_example.toml and rename it to config.toml, then adjust it according to your needs.
[Example configuration](config/config_example.toml)


## Output
The tool generates the following outputs in the specified results directory:

- report.html : Interactive HTML report with visualizations
- raw_results.json : Raw evaluation data in JSON format
- analysis.json : Processed analysis results
- figures/ : Directory containing visualization charts
- code/ : Directory containing generated code files

## Evaluation Metrics
The tool evaluates code based on four primary dimensions:

1. Correctness (40%) : Whether the code correctly implements the required functionality
2. Quality (30%) : The overall quality of the code, including proper structure and best practices
3. Efficiency (20%) : How efficiently the code performs in terms of time and space complexity
4. Readability (10%) : How easy the code is to read and understand

Users can adjust the weights of these metrics in the config.toml file. However, it is important to ensure that all metrics are configured and their weights sum up to 1.

## Visualization
The tool generates several visualizations to help understand the comparison results:

- Overall Score Comparison : Bar chart comparing overall scores of each model
- Dimension Comparison : Bar chart comparing models across different evaluation dimensions
- Generation Time Comparison : Bar chart comparing code generation time
- Radar Charts : Individual and comparative radar charts showing model strengths and weaknesses

## Project Structure

```plaintext
llm_compare/
├── config/
│   └── config.toml
├── main.py
├── modules/
│   ├── code.py
│   ├── constants.py
│   ├── custom_task_handler.py
│   ├── data_analyzer.py
│   ├── evaluator.py
│   ├── llm_interface.py
│   ├── template_renderer.py
│   └── utils.py
├── output/
│   └── [timestamp]/
│       ├── figures/
│       ├── code/
│       ├── report.html
│       ├── raw_results.json
│       └── analysis.json
├── templates/
│   └── report_template.html
└── requirements.txt
 ```

## Supported Providers and Token Limits

### Provider Support
The tool uses the `/chat/completions` API endpoint for model interactions. Theoretically, any API provider that supports this endpoint can be used with this tool. This includes but is not limited to:

- OpenAI
- Anthropic
- Ollama
- ModelScope
- SiliconFlow
- OpenRouter
- And other compatible API providers

To add a new provider, simply configure it in the `config.toml` file with the appropriate API base URL and authentication details.

### Token Limits
Due to varying context length limitations of different providers and models, API calls may fail if the token count exceeds the model's capacity. By default, the tool uses a maximum of 2000 tokens for context, but this can be adjusted through:

1. Command-line argument:
```bash
python main.py --max-tokens 1000 ...

```

2. Configuration in `config.toml`:
```toml
max_tokens = 1000
```

If you encounter API failures, it's recommended to start with smaller token limits and gradually increase as needed. The minimum allowed value is 200 tokens.

## Examples
Here are some example reports generated by the tool:

### Example 1: Chinese Tic-Tac-Toe Game
This example compares code generation capabilities of Qwen2.5-Coder-7B-Instruct and DeepSeek-V3-0324 models for creating a tic-tac-toe game with Chinese instructions.

Command used:

```bash
python main.py -e siliconflow:Qwen/Qwen2.5-Coder-7B-Instruct,modelscope:deepseek-ai/DeepSeek-V3-0324 -j siliconflow:Qwen/Qwen2.5-Coder-7B-Instruct,modelscope:deepseek-ai/DeepSeek-V3-0324 -p '做一个人和计算机对战的tik-tok程序'
```

[Example 1 Report](Examples/20250405_150342/report.html)

### Example 2: Multiple Model Comparison with Task File
This example compares four different Ollama-hosted models using a task description from a file.

Command used:

```bash
python main.py -e "ollama:modelscope2ollama-registry.azurewebsites.net/AI-ModelScope/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_S","ollama:modelscope2ollama-registry.azurewebsites.net/qwen/qwen2.5-coder-14b-instruct-gguf:q4_k_m","ollama:DeepSeek-R1-Distill-Qwen-7B-Q4_K_M:latest","ollama:qwen2.5-coder:0.5b" -j "ollama:modelscope2ollama-registry.azurewebsites.net/AI-ModelScope/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_S","ollama:modelscope2ollama-registry.azurewebsites.net/qwen/qwen2.5-coder-14b-instruct-gguf:q4_k_m","ollama:DeepSeek-R1-Distill-Qwen-7B-Q4_K_M:latest","ollama:qwen2.5-coder:0.5b" -P taskfile.txt
```

[Example 2 Report](Examples/20250405_153552/report.html)

### Important Note on Evaluation Results
1. Even for functionally correct code, models may sometimes provide inaccurate evaluation scores.
2. Network or provider issues may cause evaluation results to fluctuate. 
3. It is always recommended that users conduct multiple tests when possible and personally review the generated code files.
4. Users should formulate more specific task requirements tailored to their own needs in order to obtain evaluation results that better align with their business requirements.
5. The example cases provided here are for demonstration purposes only and should not be considered as definitive benchmarks. 

## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch ( git checkout -b feature/amazing-feature )
3. Commit your changes ( git commit -m 'Add some amazing feature' )
4. Push to the branch ( git push origin feature/amazing-feature )
5. Open a Pull Request

## License
This project is licensed under [Apache-2.0 license](LICENSE)