starriver030515 commited on
Commit
d5c7574
Β·
verified Β·
1 Parent(s): df54d71

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +186 -0
README.md ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen2.5-Coder-7B-Instruct
7
+ pipeline_tag: text-generation
8
+ library_name: transformers
9
+ tags:
10
+ - chart
11
+ - code-generation
12
+ - visualization
13
+ - matplotlib
14
+ - data-visualization
15
+ - complexity-aware
16
+ datasets:
17
+ - opendatalab/ChartVerse-Coder-Data
18
+ ---
19
+
20
+ **ChartVerse-Coder** is a complexity-aware chart code generator that can autonomously synthesize diverse, high-complexity chart codes from scratch, developed as part of the **[opendatalab/ChartVerse](https://huggingface.co/collections/opendatalab/chartverse)** project. For more details about our method, datasets, and full model series, please visit our [GitHub Repository](https://github.com/starriver030515/ChartVerse) and [Project Page](https://chartverse.github.io).
21
+
22
+ Unlike prior template-based or seed-conditioned approaches, ChartVerse-Coder generates chart code via high-temperature sampling, enabling broad exploration of the long-tail chart distribution and producing diverse, realistic charts with high structural complexity.
23
+
24
+ ## πŸ”₯ Highlights
25
+
26
+ - **Autonomous Synthesis**: Generates diverse chart codes from scratch without templates or seed charts
27
+ - **Complexity-Aware**: Trained with RPE-guided filtering to master high-complexity visualizations
28
+ - **High Diversity**: Produces charts spanning 3D plots, hierarchical structures, multi-subplot layouts, and more
29
+ - **Iterative Self-Enhancement**: Progressively improves code quality through generation-filtering-retraining loops
30
+
31
+ ## πŸ”¬ Method Overview
32
+
33
+ ### Rollout Posterior Entropy (RPE)
34
+
35
+ <div align="center">
36
+ <img src="https://raw.githubusercontent.com/chartverse/chartverse.github.io/main/static/images/rpe_illustration.png" width="100%" alt="RPE Illustration">
37
+ </div>
38
+
39
+ We propose **Rollout Posterior Entropy (RPE)** to quantify intrinsic chart complexity via generative stability:
40
+
41
+ 1. **VLM Rollout**: Given a chart, prompt a VLM to generate executable code 8 times with temperature 1.0
42
+ 2. **Feature Extraction**: Extract CLIP embeddings from reconstructed images and compute Gram matrix
43
+ 3. **Spectral Entropy**: Calculate entropy from normalized singular values
44
+
45
+ **Key Insight**: Simple charts yield consistent reconstructions (low RPE), while complex charts result in divergent outcomes (high RPE). We retain only samples with **RPE β‰₯ 0.4**.
46
+
47
+ ### Training Pipeline
48
+
49
+ <div align="center">
50
+ <img src="https://raw.githubusercontent.com/chartverse/chartverse.github.io/main/static/images/pipeline.png" width="100%" alt="ChartVerse Pipeline">
51
+ </div>
52
+
53
+ **Stage 1: Difficulty-Filtered Cold Start**
54
+ - Aggregate charts from existing datasets and filter by RPE β‰₯ 0.4
55
+ - Use Claude-4-Sonnet to infer source code for high-complexity charts
56
+ - Curate **60K** high-quality seed samples
57
+
58
+ **Stage 2: Iterative Self-Enhancement**
59
+ - Generate 2M raw candidates via high-temperature sampling
60
+ - Apply tri-fold filtering:
61
+ - βœ… Valid Execution
62
+ - βœ… High Complexity (RPE β‰₯ 0.4)
63
+ - βœ… Low Similarity to existing data (Cosine Sim ≀ 0.65)
64
+ - Retrain coder on expanded dataset
65
+ - Repeat for 2 iterations
66
+
67
+ **Final Output**: Generate **700K** high-complexity chart code samples for downstream QA synthesis.
68
+
69
+ ## πŸ‹οΈ Training Details
70
+
71
+ - **Base Model**: Qwen2.5-Coder-7B-Instruct
72
+ - **Cold Start Data**: 60K high-complexity samples
73
+ - **Boost Data**: 200K iteratively filtered samples
74
+ - **Training**: Full-parameter fine-tuning with LLaMA-Factory
75
+ - **Learning Rate**: 2.0 Γ— 10⁻⁡
76
+ - **Batch Size**: 16
77
+ - **Context Length**: 4,096 tokens
78
+ - **Epochs**: 5
79
+ - **Precision**: BF16
80
+
81
+ ## πŸ“Š Synthesized Data Quality
82
+
83
+ ### Comparison with Existing Datasets
84
+
85
+ <div align="center">
86
+ <img src="https://raw.githubusercontent.com/chartverse/chartverse.github.io/main/static/images/chart_cmp.png" width="100%" alt="Dataset Comparison">
87
+ </div>
88
+
89
+ ChartVerse-Coder synthesizes charts with significantly higher complexity and diversity than all existing datasets.
90
+
91
+ ### Synthesized Chart Examples
92
+
93
+ <div align="center">
94
+ <img src="https://raw.githubusercontent.com/chartverse/chartverse.github.io/main/static/images/complex_images.png" width="100%" alt="Complex Chart Examples">
95
+ </div>
96
+
97
+ Our synthesized charts demonstrate exceptional diversity:
98
+ - **3D Visualizations**: Surface plots, 3D bar charts, scatter plots
99
+ - **Hierarchical Structures**: Treemaps, sunburst charts, dendrograms
100
+ - **Statistical Plots**: Violin plots, radar charts, box plots with annotations
101
+ - **Multi-Subplot Layouts**: Complex dashboards with mixed chart types
102
+ - **Specialized Charts**: Sankey diagrams, chord diagrams, heatmaps with clustering
103
+
104
+ ## πŸš€ Quick Start
105
+
106
+ ```python
107
+ from transformers import AutoModelForCausalLM, AutoTokenizer
108
+
109
+ # Load Model
110
+ model_path = "opendatalab/ChartVerse-Coder"
111
+ model = AutoModelForCausalLM.from_pretrained(
112
+ model_path, torch_dtype="auto", device_map="auto"
113
+ )
114
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
115
+
116
+ # System Prompt
117
+ prompt = """You are a Python visualization expert. Generate a random Python visualization code focusing on charts, tables, or diagrams.
118
+
119
+ Requirements:
120
+ - Choose any visualization type (chart, table, flowchart, diagram, etc.)
121
+ - Create sample data
122
+ - Use Python visualization library (matplotlib, graphviz, etc.)
123
+ - Make it visually appealing with proper labels, titles, and colors
124
+ - Include sufficient visual elements
125
+ - Carefully design the layout to avoid any overlapping text or elements
126
+ - Adjust figure size, margins, and spacing for optimal clarity
127
+ - Make it visually appealing with proper labels, titles, and colors
128
+
129
+ Output format: Only output the Python visualization code wrapped in ```python```
130
+ """
131
+
132
+ # Generate Chart Code
133
+ messages = [
134
+ {"role": "user", "content": prompt}
135
+ ]
136
+
137
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
138
+ inputs = tokenizer(text, return_tensors="pt").to("cuda")
139
+
140
+ # High-temperature sampling for diversity
141
+ outputs = model.generate(
142
+ **inputs,
143
+ max_new_tokens=4096,
144
+ temperature=1.0,
145
+ top_p=0.95,
146
+ top_k=20,
147
+ do_sample=True
148
+ )
149
+
150
+ generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
151
+ print(generated_code)
152
+ ```
153
+
154
+ ### Execute Generated Code
155
+
156
+ ```python
157
+ import re
158
+ import matplotlib.pyplot as plt
159
+
160
+ # Extract code from response
161
+ code_match = re.search(r'```python\n(.*?)```', generated_code, re.DOTALL)
162
+ if code_match:
163
+ code = code_match.group(1)
164
+ exec(code) # This will save the figure as 'image.png'
165
+ ```
166
+
167
+ ## πŸ“– Citation
168
+
169
+ ```bibtex
170
+ @article{chartverse2026,
171
+ title={ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch},
172
+ author={Anonymous Authors},
173
+ journal={Anonymous ACL Submission},
174
+ year={2026}
175
+ }
176
+ ```
177
+
178
+ ## πŸ“„ License
179
+
180
+ This model is released under the Apache 2.0 License.
181
+
182
+ ## πŸ™ Acknowledgements
183
+
184
+ - Base model: [Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct)
185
+ - Training framework: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)
186
+ - Code inference: Claude-4-Sonnet for cold start data generation