Sentence Similarity
sentence-transformers
English
Arabic
aviation
gacar
saudi-arabia
retrieval
rag
bilingual
embeddings
Instructions to use flygaca/CaptAdel with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use flygaca/CaptAdel with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("flygaca/CaptAdel") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,26 +1,32 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
| 6 |
tags:
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
# CaptAdel — Fly GACA
|
| 16 |
|
| 17 |
-
Status: in development. This repository is the future home of the **GACAR retrieval embedding model** that powers Captain Adel, Fly GACA's independent, educational AI flight instructor for Saudi civil aviation. Model weights are not published here yet.
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## What Captain Adel is
|
| 20 |
|
| 21 |
-
Captain Adel is a retrieval-augmented assistant: it answers GACAR questions from a curated corpus with the relevant Part cited, and refuses rather than guess when it can't ground an answer. Its answers come from retrieval over source regulations — not from a model that has memorised them. This repo
|
| 22 |
|
| 23 |
-
It is a retrieval component, not a knowledge store. It does not "know" or generate regulations.
|
| 24 |
|
| 25 |
## Unofficial & educational
|
| 26 |
|
|
@@ -29,15 +35,86 @@ Fly GACA is independent and not affiliated with, endorsed by, or operated by the
|
|
| 29 |
## Intended use (once weights are published)
|
| 30 |
|
| 31 |
- ✅ Embedding GACAR / aviation text and queries for semantic retrieval (EN/AR)
|
| 32 |
-
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
+
- en
|
| 5 |
+
- ar
|
| 6 |
+
pipeline_tag: sentence-similarity
|
| 7 |
+
library_name: sentence-transformers
|
| 8 |
tags:
|
| 9 |
+
- aviation
|
| 10 |
+
- gacar
|
| 11 |
+
- saudi-arabia
|
| 12 |
+
- retrieval
|
| 13 |
+
- rag
|
| 14 |
+
- bilingual
|
| 15 |
+
- sentence-transformers
|
| 16 |
+
- embeddings
|
| 17 |
---
|
| 18 |
|
| 19 |
# CaptAdel — Fly GACA
|
| 20 |
|
| 21 |
+
**Status: in development.** This repository is the future home of the **GACAR retrieval embedding model** that powers Captain Adel, Fly GACA's independent, educational AI flight instructor for Saudi civil aviation. **Model weights are not published here yet.**
|
| 22 |
+
|
| 23 |
+
<!-- TODO: once the base model is chosen and weights are published, add it to the YAML above, e.g. base_model: intfloat/multilingual-e5-large -->
|
| 24 |
|
| 25 |
## What Captain Adel is
|
| 26 |
|
| 27 |
+
Captain Adel is a retrieval-augmented assistant: it answers GACAR questions from a curated corpus with the relevant Part cited, and refuses rather than guess when it can't ground an answer. Its answers come from retrieval over source regulations — not from a model that has memorised them. This repo holds the retrieval piece: a bilingual (Arabic / English) embedding model fine-tuned to find the right regulation for a query.
|
| 28 |
|
| 29 |
+
It is a **retrieval component, not a knowledge store**. It does not "know" or generate regulations.
|
| 30 |
|
| 31 |
## Unofficial & educational
|
| 32 |
|
|
|
|
| 35 |
## Intended use (once weights are published)
|
| 36 |
|
| 37 |
- ✅ Embedding GACAR / aviation text and queries for semantic retrieval (EN/AR)
|
| 38 |
+
- ✅ Powering the Fly GACA RAG pipeline
|
| 39 |
+
|
| 40 |
+
**Out of scope:**
|
| 41 |
+
|
| 42 |
+
- ❌ Not a source of truth for regulations
|
| 43 |
+
- ❌ Not official
|
| 44 |
+
- ❌ Not for operational decisions
|
| 45 |
+
|
| 46 |
+
## Usage
|
| 47 |
+
|
| 48 |
+
> Note: this snippet will work once weights are published.
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
from sentence_transformers import SentenceTransformer
|
| 52 |
+
|
| 53 |
+
model = SentenceTransformer("flygaca/CaptAdel")
|
| 54 |
+
|
| 55 |
+
# If the base model requires prefixes (e.g. E5-style), keep them:
|
| 56 |
+
query = "query: What are the medical requirements for a Part 61 PPL?"
|
| 57 |
+
passages = [
|
| 58 |
+
"passage: GACAR Part 67 sets out the medical standards ...",
|
| 59 |
+
"passage: GACAR Part 61 covers certification of pilots ...",
|
| 60 |
+
]
|
| 61 |
+
|
| 62 |
+
q_emb = model.encode(query, normalize_embeddings=True)
|
| 63 |
+
p_emb = model.encode(passages, normalize_embeddings=True)
|
| 64 |
+
|
| 65 |
+
scores = q_emb @ p_emb.T
|
| 66 |
+
print(scores)
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
**Model specifications**
|
| 70 |
+
|
| 71 |
+
| Property | Value |
|
| 72 |
+
|---|---|
|
| 73 |
+
| Base model | TBD |
|
| 74 |
+
| Embedding dimension | TBD |
|
| 75 |
+
| Max sequence length | TBD |
|
| 76 |
+
| Query/passage prefix required? | TBD (e.g. `query:` / `passage:`) |
|
| 77 |
+
| Similarity function | Cosine |
|
| 78 |
+
|
| 79 |
+
## Evaluation
|
| 80 |
+
|
| 81 |
+
Evaluated with the Fly GACA retrieval harness against a held-out bilingual query set. Metrics reported per language, since cross-lingual retrieval quality is typically asymmetric.
|
| 82 |
+
|
| 83 |
+
| Model | Lang | Recall@1 | Recall@5 | MRR@10 | nDCG@10 |
|
| 84 |
+
|---|---|---|---|---|---|
|
| 85 |
+
| CaptAdel | EN | TBD | TBD | TBD | TBD |
|
| 86 |
+
| CaptAdel | AR | TBD | TBD | TBD | TBD |
|
| 87 |
+
| Base model (baseline) | EN | TBD | TBD | TBD | TBD |
|
| 88 |
+
| Base model (baseline) | AR | TBD | TBD | TBD | TBD |
|
| 89 |
+
|
| 90 |
+
## Training data
|
| 91 |
+
|
| 92 |
+
Fine-tuned on query–passage pairs derived from the curated Fly GACA corpus (the 74 GACAR Parts, topical handbooks and the reference shelf). The source regulations remain GACA's; this model is trained only to retrieve over them.
|
| 93 |
+
|
| 94 |
+
- **Corpus scope:** GACAR / AIP source regulations (EN + AR)
|
| 95 |
+
- **Pair mining:** TBD (describe how query–passage pairs were generated)
|
| 96 |
+
- **Language balance:** TBD
|
| 97 |
+
- **Chunking:** TBD (chunk size / overlap)
|
| 98 |
+
|
| 99 |
+
## Limitations & bias
|
| 100 |
+
|
| 101 |
+
- May underperform on out-of-domain queries (non-GACAR aviation topics).
|
| 102 |
+
- Arabic performance may vary between Modern Standard Arabic and dialectal phrasing.
|
| 103 |
+
- Retrieval reflects the corpus snapshot it was trained on — it will not be aware of amended or superseded regulations published after that snapshot.
|
| 104 |
+
- It retrieves; it does not verify currency. Always confirm against the latest official GACA publication.
|
| 105 |
+
|
| 106 |
+
## Versioning
|
| 107 |
+
|
| 108 |
+
| Version | Corpus snapshot | Base model | Notes |
|
| 109 |
+
|---|---|---|---|
|
| 110 |
+
| TBD | TBD | TBD | Initial release |
|
| 111 |
+
|
| 112 |
+
## Links
|
| 113 |
+
|
| 114 |
+
- **Project:** https://flygaca.com
|
| 115 |
+
- **Source:** https://github.com/FlyGACA/flygaca
|
| 116 |
+
- **Contact:** hello@flygaca.com
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
© Fly GACA · Independent of GACA · Made in the Kingdom · صُنع في السعودية
|