Upload 61 files

816198f verified about 1 month ago

8.22 kB

	[中文](./README.md) \| English

	# S1-DeepResearch Inference Framework

	## Key Features

	- Multiple LLM clients: Supports vLLM, Azure OpenAI, AIHubMix, and other LLM services
	- Rich toolset: Nine tools covering search, web browsing, file parsing, code execution, multimodal Q&A, bash, and more
	- Batch inference: Concurrent batch inference with resume-from-checkpoint and periodic result saving
	- Single-query inference: Detailed debugging and testing for individual queries
	- Load balancing: Multi-node LLM load balancing and consistent scheduling
	- Detailed logging: Per-query log files for easier troubleshooting and analysis

	## Project Layout (current)

	```text
	./
	├── run_batch_inference_demo.sh # Local / vLLM script template
	├── run_batch_inference_online_demo.sh # Online platform script template
	├── inference/
	│ ├── run_batch_inference.py
	│ └── run_single_inference.py
	├── server/
	├── tool_kits/
	├── utils/
	│ └── config/
	│ ├── config.example.json
	│ └── README.md
	├── models/tokenizer/
	└── test_all_tools.py
	```

	## Quick Start

	### 1. Install dependencies

	```bash
	pip install -r requirements.txt
	```

	### 2. Configuration (JSON or environment variables recommended)

	Precedence: custom JSON > environment variables > defaults in `utils/config.py`.

	Typical workflow:

	```bash
	cp utils/config/config.example.json utils/config/config.local.json
	```

	Edit `config.local.json` as needed, for example:

	- `TOOLS_SERVER_BASE_ENDPOINT_URL`
	- `AIHUBMIX_KEY` / `AZURE_KEY` / `VOLCANO_KEY` / `ALIYUN_KEY`
	- `CLIENT_TIMEOUT`

	You can also override via environment variables, for example:

	```bash
	export S1_DR_CONFIG_JSON="utils/config/config.local.json"
	```

	### 3. Prepare input JSONL

	Each line is one JSON object. At minimum include `question`; usually also `id` and `file_path`.

	#### 3.1 JSONL example (file inputs)

	```json
	{"id":"query_001","question":"When Alibaba was founded, what was the average age of the founders whose surnames are Ma, Cai, or Zhang among the 18 co-founders? Round to one decimal place.","file_path":[]}
	{"id":"query_002","question":"According to the manual, for DJI's heaviest AIR-series drone by takeoff weight, how many mAh of battery energy remain after flying half a marathon? (Note 1: assume calm air; minimum energy use is flying at 60% of max speed. Note 2: power draw can be converted from max flight time.)","file_path":["/path/to/file.pdf"]}
	```

	#### 3.2 JSONL example (using Skills)

	```json
	{"id":"query_003","question":"Use pymatgen to build a simple TiO2 surface slab. Please generate a common low-index surface, report the Miller index, slab thickness, and vacuum size, and briefly describe the resulting surface structure.","skills":[{"name": "skill_name1", "description": "description1", "skill_path": "skill_path1"}, {"name": "skill_name2", "description": "description2", "skill_path": "skill_path2"}]}
	```

	## Recommended workflow: copy a script, then run

	### A. Local / vLLM (`run_batch_inference_demo.sh`)

	```bash
	cp run_batch_inference_demo.sh run_batch_local.sh
	mkdir -p run_logs
	# Edit parameters inside run_batch_local.sh
	bash run_batch_local.sh
	```

	Notes:

	- The script starts Python with `nohup ... &` and prints the background PID.
	- Tail logs: `tail -f run_logs/run.log`

	### B. Online platform (`run_batch_inference_online_demo.sh`)

	```bash
	cp run_batch_inference_online_demo.sh run_batch_online.sh
	mkdir -p run_logs
	# Edit parameters inside run_batch_online.sh
	bash run_batch_online.sh
	```

	Notes:

	- Focus on: `LLM_CLIENT_URLS`, `LLM_CLIENT_MODELS`, `SYSTEM_FORMAT`
	- Tail logs: `tail -f run_logs/run_batch_*.log`

	## Script parameters

	### Basic

	- `LLM_CLIENT_URLS`: Model service URLs, space-separated (paired with the model list)
	- `LLM_CLIENT_MODELS`: Model names, space-separated
	- `TEST_DATA_FILE`: Input JSONL path
	- `OUTPUT_FILE`: Output file when `ROLLOUT_NUM=1`
	- `OUTPUT_DIR`: Output directory when `ROLLOUT_NUM>1` (e.g. `rollout_01.jsonl`, …)
	- `ROLLOUT_NUM`: Number of rollouts per sample
	- `RESUME_FROM_FILE`: Resume checkpoint file (may be empty)
	- `AVAILABLE_TOOLS`: Enabled tools, space-separated
	- `TASK_TYPE`: Whether to treat input as text-only; default `input_only`

	### Inference control

	- `MAX_ROUNDS`: Max rounds per query
	- `CONCURRENCY_WORKERS`: Number of concurrent workers
	- `SAVE_BATCH_SIZE`: Flush results to disk every N samples
	- `TEMPERATURE`: Sampling temperature
	- `TOP_P`: Top-p (included in `run_batch_inference_demo.sh`)
	- `EXTRA_PAYLOAD`: Extra model payload (JSON string; included in `run_batch_inference_demo.sh`)
	- `TIMEOUT_FOR_ONE_QUERY`: Per-query timeout (seconds)
	- `LLM_API_RETRY_TIMES`: Retries after LLM failure (not counting the first attempt)
	- `SYSTEM_PROMPT`: Custom system prompt; empty uses the built-in default
	- `SYSTEM_FORMAT`: Platform format (mainly in `run_batch_inference_online_demo.sh`)

	### Context truncation

	- `DISCARD_ALL_MODE`: Enable discard-all (`true`/`false`)
	- `MODEL_MAX_CONTEXT_TOKENS`: Model max context length
	- `DISCARD_RATIO`: Threshold ratio to trigger discard
	- `TOKENIZER_PATH`: Path to tokenizer used for token counting

	### Logging

	- `LOG_LABEL`: Log label; directory shape `logs/YYYY_MM_DD_<LOG_LABEL>/`
	- `LOG_FILE`: Script log file under `run_logs/*.log`
	- `LOGGING_ROOT`: Log root (set in `run_batch_inference_demo.sh`; may be empty)

	## `SYSTEM_FORMAT` values

	`SYSTEM_FORMAT` selects platform-specific handling via keyword branches.

	- `deep_research`: Local deep-research format (vLLM deployment)
	- `azure`: Azure OpenAI
	- `aihubmix`: AIHubMix (OpenAI-compatible)
	- `aihubmix_claude`: AIHubMix Claude format
	- `aihubmix_glm`: AIHubMix GLM format
	- `volcano`: Volcano Engine
	- `aliyun`: Alibaba Cloud Bailian format

	## Currently available tools (9)

	- `wide_search`: General web search via Serp; multiple queries in one round
	- `scholar_search`: Google Scholar academic search (+ web results)
	- `image_search`: Image search; multiple queries supported
	- `wide_visit`: Visit pages and summarize toward a `goal`
	- `file_wide_parse`: Parse local/remote files (PDF, DOCX, MD, CSV, etc.)
	- `execute_code`: Run Python code
	- `ask_question_about_image`: Image understanding and Q&A
	- `ask_question_about_video`: Video understanding and Q&A
	- `bash`: Run shell commands

	Tool schemas are defined in `DEEPRESEARCH_SYSTEM_PROMPT` in `utils/prompts.py`.

	## Outputs and logs

	### Output JSONL fields

	Each line written by `run_batch_inference.py` contains:

	- `time_stamp`: Write time for that row (`YYYY-MM-DD HH:MM:SS`).
	- `query_id`: Batch-level query id (hash of `question`).
	- `query`: This row’s `question` text.
	- `result`: Detailed result object for one segment (from `run_single_inference.py`).
	- `status`: `success` / `timeout` / `error`.
	- `discard_segments`: Segments truncated by discard-all and summarized (excluding the final segment).
	- `elapsed_sec`: Total seconds for this rollout of the query.
	- `rollout_idx`: Rollout index (1-based).
	- `src`: Full original input line (often includes `id`, `question`, `file_path`, skills, etc.).
	- `segment_idx`: Current segment index (1-based).
	- `segment_total`: Total segments for this query; `0` if there is no valid `result`.

	Common fields inside `result` (`run_single_inference.py`):

	- `query_id`: Single-run instance id (includes a time suffix).
	- `tools`: Enabled tool schemas (string form).
	- `messages`: Messages for model reasoning and tool interaction.
	- `final_answer`: Answer text for this segment.
	- `transcript`: Fuller trajectory (including tool returns).
	- `rounds`: Rounds executed in this segment.
	- `stopped_reason`: Why it stopped (e.g. `no_tool_calls`, `discard_all_01`, `discard_all_final`, `max_rounds_exceeded`).
	- `error`: Present only on failure.

	### Log directories

	Default layout when `LOGGING_ROOT` is empty:

	```text
	logs/
	└── YYYY_MM_DD_<LOG_LABEL>/
	├── collect.log
	└── <query_id>/
	├── run.log
	└── result.json
	```

	## Tool tests

	Run the tool test script:

	```bash
	python test_all_tools.py
	```

	This exercises all registered tools and checks that basic behavior works.