GrantsLLM / README.md

Update README.md

401a177 verified 3 days ago

14.8 kB

	---
	license: cc-by-4.0
	language:
	- en
	library_name: transformers
	tags:
	- grant-writing
	- research
	- STEM
	- biotech
	- fine-tuned
	- Qwen
	- text-generation
	- academic-writing
	- proposal-writing
	base_model:
	- unsloth/Qwen3-4B-GGUF
	datasets:
	- custom
	pipeline_tag: text-generation
	widget:
	- text: >-
	Write a Specific Aims section for an NIH R03 grant on developing
	CRISPR-based therapeutics for rare genetic disorders. Include 2 aims.
	example_title: Generate Specific Aims
	- text: >-
	Draft a Significance and Innovation section for an NSF grant on machine
	learning applications in protein structure prediction.
	example_title: Generate Significance
	- text: >-
	Review the following grant aims and provide feedback: Aim 1: Develop a novel
	CRISPR delivery system. Aim 2: Test efficacy in animal models.
	example_title: Review Grant Section
	model-index:
	- name: GrantsLLM
	results: []
	---

	# GrantsLLM

	<div align="center">

	[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
	[![Base Model](https://img.shields.io/badge/Base-Qwen3%201.7B-blue)](https://huggingface.co/unsloth/Qwen3-4B-GGUF)

	A specialized language model for STEM research grant writing and review

	Developed by [Evionex](https://evionex.com) \| Created by Kedar P. Navsariwala

	</div>

	---

	## Model Description

	GrantsLLM is a domain-specialized language model fine-tuned on 78 STEM research grant applications to assist researchers in drafting, refining, and reviewing grant proposals. Built on Llama 3.2 1B, this model has been trained to understand the structure, terminology, and writing style of successful research grants across NIH, NSF, and similar funding mechanisms.

	- Developed by: Kedar P. Navsariwala, CTO & Co-Founder at Evionex
	- Model type: Causal Language Model (Decoder-only Transformer)
	- Language(s): English
	- License: CC BY 4.0 (requires attribution)
	- Finetuned from: unsloth/Qwen3-4B-GGUF
	---

	## 🎯 Use Cases

	### What GrantsLLM Can Do

	- ✅ Generate complete grant proposals (NIH R03/R01/R21, NSF, etc.)
	- ✅ Draft specific sections: Specific Aims, Significance, Innovation, Approach, Research Strategy
	- ✅ Improve existing text for clarity, structure, and persuasiveness
	- ✅ Provide review feedback on grant coherence and alignment
	- ✅ Expand bullet points into full narrative sections
	- ✅ Adapt tone to academic/scientific writing standards

	### Intended Users

	- Principal Investigators (PIs) and research scientists
	- Postdoctoral researchers and graduate students
	- University grant support offices
	- Biotech and research startups
	- Academic research administrators

	### Out of Scope

	- ❌ Automated funding decisions or grant scoring
	- ❌ Legal, regulatory, or IRB compliance review
	- ❌ Generating fabricated data or citations
	- ❌ Non-STEM grants (humanities, arts, social sciences may have reduced quality)
	- ❌ Non-English grant applications

	---

	## 🚀 Quick Start

	### Installation

	```bash
	pip install transformers torch accelerate

	```
	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model and tokenizer
	model_name = "your-username/GrantsLLM" # Replace with actual repo
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	# Generate grant text
	prompt = """Write a Specific Aims section for an NIH R03 grant on developing novel CRISPR-based gene editing tools for treating sickle cell disease. Include 2-3 specific aims with clear objectives and expected outcomes."""

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	top_p=0.9,
	do_sample=True
	)

	print(tokenizer.decode(outputs, skip_special_tokens=True))
	```

	### Using with Pipeline

	```python
	from transformers import pipeline

	generator = pipeline(
	"text-generation",
	model="your-username/GrantsLLM",
	device_map="auto"
	)

	prompt = "Draft a Research Significance statement for a computational biology grant on protein folding prediction using deep learning."

	output = generator(
	prompt,
	max_new_tokens=400,
	temperature=0.7,
	top_p=0.9
	)

	print(output['generated_text'])
	```

	### Prompt Templates

	For Section Generation:
	```
	Write a [Section] for a [Funder] [Mechanism] grant on [Topic].
	Requirements: [Specific elements needed]
	Word limit: [Number] words
	```

	For Review/Feedback:
	```
	Review the following [Section] and provide feedback on clarity, structure, and alignment with [Funder] guidelines:

	[Paste text here]
	```

	Examples:
	- `"Write Specific Aims for an NIH R01 grant on cancer immunotherapy"`
	- `"Draft Innovation section for NSF CAREER award on quantum computing"`
	- `"Review this Research Strategy for logical flow and hypothesis clarity"`

	---

	## 📊 Training Data

	### Dataset Composition

	- Size: 78 research grant applications
	- Domains: Biotechnology, Molecular Biology, Computational Biology, Chemistry, Biomedical Sciences
	- Formats: NIH (R01, R03, R21), NSF, and similar federal/institutional grant formats
	- Sources: Publicly available grant examples, institutional repositories, and NIH RePORTER
	- Language: English

	### Data Processing

	Stage 1: Continued Pretraining (CPT)
	- Raw grant text extracted and cleaned from PDFs/documents
	- Structured into single-column `text` format (JSONL/Parquet)
	- Preserves section structure and domain terminology

	Stage 2: Supervised Fine-Tuning (SFT)
	- Chat-style instruction pairs using ChatML template
	- Tasks include: section generation, expansion, refinement, review
	- Format: `{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}`

	---

	## 🔧 Training Procedure

	### Training Hyperparameters

	- Base Model: unsloth/Qwen3-4B-GGUF (~4B parameters)
	- Training Framework: Unsloth + PyTorch
	- Hardware: Google Colab (single GPU, T4/V100)
	- Fine-tuning Method: LoRA/QLoRA (Parameter-Efficient Fine-Tuning)
	- Training Stages:
	1. Continued Pretraining on grant corpus
	2. Supervised Instruction Fine-Tuning on QnA pairs
	- Optimizer: AdamW
	- Learning Rate: Low rate to prevent catastrophic forgetting
	- Training monitored for: Overfitting, repetition, coherence

	### Training Details

	```yaml
	Training Type: Full fine-tuning with LoRA adapters
	Epochs: [Adjusted based on validation performance]
	Batch Size: Optimized for 1B model on single GPU
	Context Length: Inherited from base model (likely 2048-8192 tokens)
	Loss Function: Causal Language Modeling (CLM) loss
	Validation Strategy: Qualitative evaluation on held-out grant examples
	```

	---

	## 📈 Performance & Evaluation

	### Evaluation Methodology

	Qualitative Assessment:
	- Human expert review of generated grant sections
	- Evaluation criteria: coherence, structure, domain accuracy, persuasiveness
	- Practical testing on mock NIH/NSF grant prompts

	### Known Strengths

	- ✅ Strong grasp of STEM grant structure (Aims, Significance, Innovation, Approach)
	- ✅ Effective expansion of bullet points to narrative
	- ✅ Appropriate academic/scientific tone
	- ✅ Good understanding of NIH/NSF terminology and conventions
	- ✅ Maintains logical flow between sections

	### Known Limitations

	- ⚠️ Hallucination Risk: May generate plausible but incorrect citations, grant numbers, or policies
	- ⚠️ Format Bias: Optimized for NIH/NSF; other formats (European, private foundations) may be weaker
	- ⚠️ Domain Bias: Best for biotech/life sciences; physics/engineering grants may be less polished
	- ⚠️ Repetition: Can produce repetitive text if prompt lacks detail or structure
	- ⚠️ Context Limits: Long grants may need to be drafted in sections
	- ⚠️ Recency: Training data may not reflect latest funder guidelines (post-2025)

	---

	## ⚠️ Bias, Risks, and Limitations

	### Bias Sources

	Domain Bias: Model is optimized for STEM fields represented in training data (biotech, molecular biology, computational biology). Grants in underrepresented fields may receive lower quality outputs.

	Institutional Bias: Writing style may reflect patterns from R1 research universities and well-funded institutions present in training examples.

	Funding Mechanism Bias: Strongest performance on NIH R-series and NSF standard grants; less reliable for fellowships, training grants, or international formats.

	Historical Bias: May reinforce language patterns from historically funded research areas, potentially disadvantaging emerging or interdisciplinary fields.

	### Risks

	Fabrication: Model may generate convincing but false information including:
	- Non-existent citations and references
	- Incorrect grant mechanism details
	- Fabricated preliminary data or results
	- Inaccurate funder policies

	Over-reliance: Users may trust outputs without verification, risking submission of flawed proposals.

	Privacy: Users may inadvertently input confidential research ideas or unpublished data.

	### Recommendations

	1. Always verify: Check all factual claims, citations, and funder guidelines
	2. Human review required: Never submit AI-generated grants without expert review
	3. Iterative refinement: Use as drafting assistant, not final author
	4. Protect IP: Don't input confidential or proprietary information
	5. Disclose usage: Be transparent with collaborators and (when appropriate) funders about AI assistance
	6. Update manually: Cross-reference current funder guidelines and requirements

	---

	## 🔐 Ethical Considerations

	### Responsible Use

	- Transparency: Disclose AI assistance to co-authors and collaborators
	- Human oversight: Keep domain experts in the loop for all submissions
	- Academic integrity: Ensure outputs align with your institution's policies on AI use
	- Verification: Validate all scientific claims and citations independently
	- Privacy: Avoid inputting sensitive, unpublished, or identifiable information

	### Funder Policies

	As of February 2026, grant-writing AI policies vary by funder:
	- NIH: Generally permits AI assistance for writing, but PIs remain responsible for all content
	- NSF: Similar stance; emphasizes researcher accountability
	- Check specific RFAs for any AI-related restrictions or disclosure requirements

	When in doubt: Contact your program officer or sponsored research office.

	---

	## 📜 Licensing & Attribution

	### License: CC BY 4.0

	This model is licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/).

	### You Must:

	✅ Give appropriate credit to Evionex and Kedar P. Navsariwala
	✅ Provide a link to the license
	✅ Indicate if changes were made to the model
	✅ Retain attribution in any derivative works or applications

	### Citation

	If you use GrantsLLM in your research or projects, please cite:

	```bibtex
	@software{grantsllm2026,
	author = {Navsariwala, Kedar P.},
	title = {GrantsLLM: A Fine-Tuned Language Model for STEM Grant Writing},
	year = {2026},
	publisher = {Hugging Face},
	organization = {Evionex},
	howpublished = {\url{https://huggingface.co/your-username/GrantsLLM}},
	license = {CC-BY-4.0}
	}
	```

	### Attribution Example

	```
	Grant drafting assistance provided by GrantsLLM (Navsariwala, 2026),
	developed by Evionex. Available at https://huggingface.co/your-username/GrantsLLM
	```

	---

	## 🛠️ Technical Specifications

	### Model Architecture

	- Architecture: Llama 3.2 (Decoder-only Transformer)
	- Parameters: ~1 billion
	- Layers: [Inherited from base model]
	- Hidden Size: [Inherited from base model]
	- Attention Heads: [Inherited from base model]
	- Vocabulary Size: [Inherited from base model]
	- Context Window: [Typically 2048-8192 tokens]

	### Software Stack

	- Training: Unsloth, PyTorch, Hugging Face Transformers
	- Fine-tuning: LoRA/QLoRA with PEFT
	- Environment: Google Colab (GPU)
	- Export Formats:
	- Hugging Face Transformers checkpoint
	- GGUF

	### Hardware Requirements

	Inference:
	- Minimum: 8GB VRAM (GPU) or 16GB RAM (CPU with quantization)
	- Recommended: 16GB+ VRAM for optimal speed
	- CPU inference: Possible but slower; consider GGUF quantized versions

	Formats for Different Hardware:
	- Full precision: 16GB+ VRAM
	- GGUF Q4_K_M: 4-8GB VRAM or CPU
	- GGUF Q8_0: 8-12GB VRAM

	---

	## 📦 Model Variants

	\| Variant \| Size \| Use Case \| Hardware \|
	\|---------\|------\|----------\|----------\|
	\| Full precision \| ~4GB \| Maximum quality \| 16GB+ VRAM \|
	\| GGUF Q8_0 \| ~1.5GB \| Balanced quality/speed \| 8GB+ VRAM or CPU \|
	\| GGUF Q4_K_M \| ~800MB \| Fast inference \| 4GB+ VRAM or CPU \|

	---

	## 🤝 Acknowledgments

	### Built With

	- Base Model: [Llama 3.2 1B Instruct](https://huggingface.co/unsloth/Qwen3-4B-GGUF) by Qwen
	- Training Framework: [Unsloth](https://github.com/unslothai/unsloth) for efficient fine-tuning
	- ML Libraries: PyTorch, Hugging Face Transformers
	- Infrastructure: Google Colab

	### Special Thanks

	- Open-source grant examples from NIH RePORTER and NSF Award Search
	- Academic institutions sharing grant templates and examples
	- Unsloth team for efficient fine-tuning tools
	- Hugging Face for model hosting and inference infrastructure

	---

	## 📞 Contact & Support

	Developer: Kedar P. Navsariwala
	Organization: Evionex
	Website: [www.evionex.com](https://www.evionex.com)
	Model Repository: [KedarPN/GrantsLLM](https://huggingface.co/KedarPN/GrantsLLM)

	### Issues & Feedback

	- Report bugs or issues in the [Discussion tab](https://huggingface.co/KedarPN/GrantsLLM/discussions)
	- Share use cases and success stories
	- Request features or improvements
	- Contribute to model evaluation

	---

	## 📌 Disclaimer

	GrantsLLM is an assistive tool designed to support the grant writing process. It does not:
	- Guarantee grant success or funding approval
	- Replace domain expertise or scientific judgment
	- Ensure compliance with all funder requirements
	- Eliminate the need for human review and verification

	Always consult official funder guidelines and domain experts before grant submission.

	---

	## 🔄 Version History

	v1.0 (February 2026)
	- Initial release
	- Trained on 78 STEM grant applications
	- Base model: Qwen3-4B-GGUF
	- Supports NIH and NSF formats

	---

	<div align="center">

	© 2026 Evionex \| Licensed under CC BY 4.0

	Made with ❤️ for the research community

	</div>
	```

	This qwen3 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.