VulCoCo
1. Tool Information
- Tool Name: VulCoCo
- Repository / URL: VulCoCo GitHub Repo
2. Authors and Contact
- Main Author(s): Tan Bui - ngoctanbui@smu.edu.sg
3. Overview
VulCoCo is a tool for vulnerable code clone detection that combines retrieval-based methods with LLM validation to identify code clones in software repositories.
Further details can be found in VulCoCo paper
4. Installation
Step 1: Fetch and Clone Repositories
This script fetches and clones the top repositories for analysis.Step 2: Parse Source Files
Parses the cloned repositories to extract function-level code segments.Step 3: Retrieval-based Clone Detection
python3 main.py --all_json_path 'path/to/source/data.jsonl' \
--funcs_dir 'path/to/function/json/files' \
--clones_dir 'path/to/output/directory' \
--threshold 0.7
Parameters:
- --all_json_path: Path to the JSONL source dataset
- --funcs_dir: Directory containing function JSON files from Step 2
- --clones_dir: Output directory for clone detection results
- --threshold: Similarity threshold for clone detection (default: 0.7)
Step 4: LLM Validation
python3 llm.py --results 'path/to/clone/results.json' \
--sources 'path/to/source/data.jsonl' \
--api-key 'your-anthropic-api-key' \
--output 'path/to/validated/output.json' \
--responses-dir 'path/to/llm/responses'
Parameters:
- --results: JSON file containing clone detection results from Step 3
- --sources: Path to the original JSONL source dataset
- --api-key: Your Anthropic API key for LLM validation
- --output: Output path for validated results
- --responses-dir: Directory to save raw LLM responses
#### 6. Input and Output Format
* **Input format:** Modify the hyperparameters in the get_top_repos.py, such as LANGUAGES_REPOS, GITHUB_TOKEN, MIN_PR_MERGE_RATE, DAYS_THRESHOLD
* **Output format:** The tool generates:
- Clone detection results in JSON format
- LLM validation responses
- Final validated clone pairs with confidence scores