flygaca commited on
Commit
29870ea
·
verified ·
1 Parent(s): 2ef8d84

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -23
README.md CHANGED
@@ -1,26 +1,32 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
- - en
5
- - ar
 
 
6
  tags:
7
- - aviation
8
- - gacar
9
- - saudi-arabia
10
- - retrieval
11
- - rag
12
- - bilingual
 
 
13
  ---
14
 
15
  # CaptAdel — Fly GACA
16
 
17
- Status: in development. This repository is the future home of the **GACAR retrieval embedding model** that powers Captain Adel, Fly GACA's independent, educational AI flight instructor for Saudi civil aviation. Model weights are not published here yet.
 
 
18
 
19
  ## What Captain Adel is
20
 
21
- Captain Adel is a retrieval-augmented assistant: it answers GACAR questions from a curated corpus with the relevant Part cited, and refuses rather than guess when it can't ground an answer. Its answers come from retrieval over source regulations — not from a model that has memorised them. This repo will hold the retrieval piece: a bilingual (Arabic / English) embedding model fine-tuned to find the right regulation for a query.
22
 
23
- It is a retrieval component, not a knowledge store. It does not "know" or generate regulations.
24
 
25
  ## Unofficial & educational
26
 
@@ -29,15 +35,86 @@ Fly GACA is independent and not affiliated with, endorsed by, or operated by the
29
  ## Intended use (once weights are published)
30
 
31
  - ✅ Embedding GACAR / aviation text and queries for semantic retrieval (EN/AR)
32
- - - ✅ Powering the Fly GACA RAG pipeline
33
- - - ❌ Not a source of truth for regulations · ❌ not official · ❌ not for operational decisions
34
-
35
- - ## Evaluation
36
-
37
- - Evaluated with the Fly GACA retrieval harness (recall@k, MRR, nDCG) against a held-out bilingual query set; numbers reported here when the model ships.
38
- - ## Links
39
- - Project: https://flygaca.com
40
- - - Source: https://github.com/FlyGACA/flygaca
41
- - - Contact: hello@flygaca.com
42
-
43
- - © Fly GACA · Independent of GACA · Made in the Kingdom · صُنع في السعودية
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - en
5
+ - ar
6
+ pipeline_tag: sentence-similarity
7
+ library_name: sentence-transformers
8
  tags:
9
+ - aviation
10
+ - gacar
11
+ - saudi-arabia
12
+ - retrieval
13
+ - rag
14
+ - bilingual
15
+ - sentence-transformers
16
+ - embeddings
17
  ---
18
 
19
  # CaptAdel — Fly GACA
20
 
21
+ **Status: in development.** This repository is the future home of the **GACAR retrieval embedding model** that powers Captain Adel, Fly GACA's independent, educational AI flight instructor for Saudi civil aviation. **Model weights are not published here yet.**
22
+
23
+ <!-- TODO: once the base model is chosen and weights are published, add it to the YAML above, e.g. base_model: intfloat/multilingual-e5-large -->
24
 
25
  ## What Captain Adel is
26
 
27
+ Captain Adel is a retrieval-augmented assistant: it answers GACAR questions from a curated corpus with the relevant Part cited, and refuses rather than guess when it can't ground an answer. Its answers come from retrieval over source regulations — not from a model that has memorised them. This repo holds the retrieval piece: a bilingual (Arabic / English) embedding model fine-tuned to find the right regulation for a query.
28
 
29
+ It is a **retrieval component, not a knowledge store**. It does not "know" or generate regulations.
30
 
31
  ## Unofficial & educational
32
 
 
35
  ## Intended use (once weights are published)
36
 
37
  - ✅ Embedding GACAR / aviation text and queries for semantic retrieval (EN/AR)
38
+ - ✅ Powering the Fly GACA RAG pipeline
39
+
40
+ **Out of scope:**
41
+
42
+ - ❌ Not a source of truth for regulations
43
+ - Not official
44
+ - Not for operational decisions
45
+
46
+ ## Usage
47
+
48
+ > Note: this snippet will work once weights are published.
49
+
50
+ ```python
51
+ from sentence_transformers import SentenceTransformer
52
+
53
+ model = SentenceTransformer("flygaca/CaptAdel")
54
+
55
+ # If the base model requires prefixes (e.g. E5-style), keep them:
56
+ query = "query: What are the medical requirements for a Part 61 PPL?"
57
+ passages = [
58
+ "passage: GACAR Part 67 sets out the medical standards ...",
59
+ "passage: GACAR Part 61 covers certification of pilots ...",
60
+ ]
61
+
62
+ q_emb = model.encode(query, normalize_embeddings=True)
63
+ p_emb = model.encode(passages, normalize_embeddings=True)
64
+
65
+ scores = q_emb @ p_emb.T
66
+ print(scores)
67
+ ```
68
+
69
+ **Model specifications**
70
+
71
+ | Property | Value |
72
+ |---|---|
73
+ | Base model | TBD |
74
+ | Embedding dimension | TBD |
75
+ | Max sequence length | TBD |
76
+ | Query/passage prefix required? | TBD (e.g. `query:` / `passage:`) |
77
+ | Similarity function | Cosine |
78
+
79
+ ## Evaluation
80
+
81
+ Evaluated with the Fly GACA retrieval harness against a held-out bilingual query set. Metrics reported per language, since cross-lingual retrieval quality is typically asymmetric.
82
+
83
+ | Model | Lang | Recall@1 | Recall@5 | MRR@10 | nDCG@10 |
84
+ |---|---|---|---|---|---|
85
+ | CaptAdel | EN | TBD | TBD | TBD | TBD |
86
+ | CaptAdel | AR | TBD | TBD | TBD | TBD |
87
+ | Base model (baseline) | EN | TBD | TBD | TBD | TBD |
88
+ | Base model (baseline) | AR | TBD | TBD | TBD | TBD |
89
+
90
+ ## Training data
91
+
92
+ Fine-tuned on query–passage pairs derived from the curated Fly GACA corpus (the 74 GACAR Parts, topical handbooks and the reference shelf). The source regulations remain GACA's; this model is trained only to retrieve over them.
93
+
94
+ - **Corpus scope:** GACAR / AIP source regulations (EN + AR)
95
+ - **Pair mining:** TBD (describe how query–passage pairs were generated)
96
+ - **Language balance:** TBD
97
+ - **Chunking:** TBD (chunk size / overlap)
98
+
99
+ ## Limitations & bias
100
+
101
+ - May underperform on out-of-domain queries (non-GACAR aviation topics).
102
+ - Arabic performance may vary between Modern Standard Arabic and dialectal phrasing.
103
+ - Retrieval reflects the corpus snapshot it was trained on — it will not be aware of amended or superseded regulations published after that snapshot.
104
+ - It retrieves; it does not verify currency. Always confirm against the latest official GACA publication.
105
+
106
+ ## Versioning
107
+
108
+ | Version | Corpus snapshot | Base model | Notes |
109
+ |---|---|---|---|
110
+ | TBD | TBD | TBD | Initial release |
111
+
112
+ ## Links
113
+
114
+ - **Project:** https://flygaca.com
115
+ - **Source:** https://github.com/FlyGACA/flygaca
116
+ - **Contact:** hello@flygaca.com
117
+
118
+ ---
119
+
120
+ © Fly GACA · Independent of GACA · Made in the Kingdom · صُنع في السعودية