cmd0160 commited on
Commit
698ce25
·
1 Parent(s): 263fa05

Adding auto embedding

Browse files
.DS_Store CHANGED
Binary files a/.DS_Store and b/.DS_Store differ
 
RAG_APP_README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Abalone RAG Chatbot
2
+
3
+ This project implements a Retrieval-Augmented Generation (RAG) chatbot about Abalone using LangChain + OpenAI with a Streamlit frontend. It's designed to be deployed on Hugging Face Spaces.
4
+
5
+ Contents
6
+ - `app.py` - Streamlit app entrypoint
7
+ - `src/ingest.py` - Ingest files from `data/` into a persisted Chroma vectorstore
8
+ - `src/vectorstore.py` - Helpers to build/load the Chroma vectorstore and return a retriever
9
+ - `src/qa_chain.py` - Build the conversational retrieval QA chain
10
+ - `data/` - Put Abalone source files here (CSV/MD/TXT/PDF)
11
+ - `vectorstore/` - Persisted vectorstore directory (created by ingestion)
12
+
13
+ Quickstart (local)
14
+
15
+ 1. Create a venv and install dependencies:
16
+
17
+ ```bash
18
+ python -m venv .venv
19
+ source .venv/bin/activate
20
+ pip install -r requirements.txt
21
+ ```
22
+
23
+ 2. Set your OpenAI API key:
24
+
25
+ ```bash
26
+ export OPENAI_API_KEY="sk-..."
27
+ ```
28
+
29
+ 3. Add Abalone files into `data/` (for example `abalone.csv`).
30
+
31
+ 4. Build the vectorstore:
32
+
33
+ ```bash
34
+ python -m src.ingest --data-dir ./data --persist-dir ./vectorstore
35
+ ```
36
+
37
+ 5. Run the Streamlit app:
38
+
39
+ ```bash
40
+ streamlit run app.py
41
+ ```
42
+
43
+ Deploying to Hugging Face Spaces
44
+
45
+ - Add `OPENAI_API_KEY` in the Spaces secrets (Settings -> Secrets).
46
+ - Push this repository to your HF Space. HF will install `requirements.txt` and run the Streamlit app.
47
+ - On first run, click the "Ingest data" button or allow the app to rebuild the index.
48
+
49
+ Security
50
+ - Do NOT commit your OpenAI API key. Use HF Spaces Secrets for deployment.
51
+
52
+ License
53
+ - MIT
54
+
app.py CHANGED
@@ -109,16 +109,32 @@ st.success("Vectorstore and retriever are ready.")
109
 
110
  chain = make_conversational_chain(retriever, model_name=model_name)
111
 
 
 
 
 
 
 
 
 
 
 
112
  if st.session_state["chat_history"]:
113
  st.subheader("Conversation")
114
  for i, turn in enumerate(st.session_state["chat_history"]):
115
  st.markdown(f"**You:** {turn['question']}")
116
  st.markdown(f"**Abalone Bot:** {turn['answer']}")
117
- if turn.get("sources"):
118
- with st.expander(f"Show sources for question {i + 1}"):
119
- for j, src in enumerate(turn["sources"], start=1):
120
- st.markdown(f"**Source {j}:**")
121
- meta = src.get("metadata", {})
 
 
 
 
 
 
122
  if meta:
123
  st.write(meta)
124
  preview = src.get("content_preview", "")
@@ -156,7 +172,7 @@ if send_clicked and user_input:
156
  source_docs = result.get("source_documents") or []
157
 
158
  sources_for_ui = []
159
- for sd in source_docs:
160
  if isinstance(sd, dict):
161
  meta = sd.get("metadata", {}) or {}
162
  content_preview = sd.get("page_content") or sd.get("content") or sd.get("text", "")
@@ -169,6 +185,7 @@ if send_clicked and user_input:
169
  content_preview = ""
170
  sources_for_ui.append(
171
  {
 
172
  "metadata": meta,
173
  "content_preview": str(content_preview)[:500],
174
  }
@@ -188,13 +205,20 @@ if send_clicked and user_input:
188
  st.markdown(f"**Abalone Bot:** {answer}")
189
 
190
  if sources_for_ui:
191
- with st.expander("Show sources for this answer"):
192
- for i, src in enumerate(sources_for_ui, start=1):
193
- st.markdown(f"**Source {i}:**")
194
- if src["metadata"]:
195
- st.write(src["metadata"])
196
- if src["content_preview"]:
197
- st.write(src["content_preview"])
 
 
 
 
 
 
 
198
 
199
  elif send_clicked and not user_input:
200
  st.warning("Please enter a question before clicking Send.")
 
109
 
110
  chain = make_conversational_chain(retriever, model_name=model_name)
111
 
112
+ def format_source_label(meta: dict, index: int) -> str:
113
+ source = (
114
+ meta.get("source")
115
+ or meta.get("file_path")
116
+ or meta.get("path")
117
+ or meta.get("document_id")
118
+ or "Unknown source"
119
+ )
120
+ return f"[{index}] {source}"
121
+
122
  if st.session_state["chat_history"]:
123
  st.subheader("Conversation")
124
  for i, turn in enumerate(st.session_state["chat_history"]):
125
  st.markdown(f"**You:** {turn['question']}")
126
  st.markdown(f"**Abalone Bot:** {turn['answer']}")
127
+ sources = turn.get("sources") or []
128
+ if sources:
129
+ st.markdown("**Sources:**")
130
+ for src in sources:
131
+ label = format_source_label(src.get("metadata", {}) or {}, src.get("index", 0))
132
+ st.markdown(f"- {label}")
133
+ with st.expander(f"Show source details for question {i + 1}"):
134
+ for src in sources:
135
+ label = format_source_label(src.get("metadata", {}) or {}, src.get("index", 0))
136
+ st.markdown(f"**{label}**")
137
+ meta = src.get("metadata", {}) or {}
138
  if meta:
139
  st.write(meta)
140
  preview = src.get("content_preview", "")
 
172
  source_docs = result.get("source_documents") or []
173
 
174
  sources_for_ui = []
175
+ for idx, sd in enumerate(source_docs, start=1):
176
  if isinstance(sd, dict):
177
  meta = sd.get("metadata", {}) or {}
178
  content_preview = sd.get("page_content") or sd.get("content") or sd.get("text", "")
 
185
  content_preview = ""
186
  sources_for_ui.append(
187
  {
188
+ "index": idx,
189
  "metadata": meta,
190
  "content_preview": str(content_preview)[:500],
191
  }
 
205
  st.markdown(f"**Abalone Bot:** {answer}")
206
 
207
  if sources_for_ui:
208
+ st.markdown("**Sources:**")
209
+ for src in sources_for_ui:
210
+ label = format_source_label(src.get("metadata", {}) or {}, src.get("index", 0))
211
+ st.markdown(f"- {label}")
212
+ with st.expander("Show source details for this answer"):
213
+ for src in sources_for_ui:
214
+ label = format_source_label(src.get("metadata", {}) or {}, src.get("index", 0))
215
+ st.markdown(f"**{label}**")
216
+ meta = src.get("metadata", {}) or {}
217
+ if meta:
218
+ st.write(meta)
219
+ preview = src.get("content_preview", "")
220
+ if preview:
221
+ st.write(preview)
222
 
223
  elif send_clicked and not user_input:
224
  st.warning("Please enter a question before clicking Send.")
data/abalone.txt ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Abalone
2
+ Description
3
+ Abalones are members of a large class (Gastropoda) of molluscs having one-piece shells. They belong to
4
+ the family Haliotidae and the genus Haliotis, which means sea ear, referring to the flattened shape of the
5
+ shell.
6
+ Abalone shells are rounded or oval with a large dome towards one end. The shell has a row of respiratory
7
+ pores. The muscular foot has strong suction power permitting the abalone to clamp tightly to rocky
8
+ surfaces. An epipodium, a sensory structure and extension of the foot that bears tentacles, circles the foot
9
+ and projects beyond the shell edge in the living abalone. Nine species of abalone occur in North America:
10
+ black (H. cracherodii), flat (H. walallensis), green (H. fulgens), pink (H. corrugata), pinto (H.
11
+ kamtschatkana), red (H. rufescens), threaded (H. assimilis), Western Atlantic (H. pourtalesii), and white
12
+ (H. sorenseni) abalone.
13
+ Black abalone (H. cracherodii) have black and smooth epipodium and tentacles. The shell surface is
14
+ black or dark blue, and smooth. There are 5 to 9 open pores, and the pores are flush with the shell
15
+ surface. Black abalone range from Mendocino County, California to southern Baja California. They are
16
+ found in intertidal and shallow subtidal zones down to a depth of about 20 feet. Black abalone reach 7.75
17
+ inches in length, but are commonly 5 to 6 inches long.
18
+ Flat abalone (H. walallensis) have a mottled yellowish and brown epipodium, with a pebbly appearing
19
+ surface and lacy edge. The tentacles are greenish and slender. The shell is flattened, narrow, and
20
+ marked with low ribs. There are 5 to 6 open pores, and the pore edges are moderately elevated above
21
+ the shell surface. Flat abalone range from British Columbia, Canada to San Diego, California. They are
22
+ found in the subtidal zone from 20 feet down to at least 70 feet. Flat abalone reach 7 inches in length, but
23
+ are commonly under 5 inches.
24
+ Green abalone (H. fulgens) have a mottled cream and brown epipodium, with tubercles scattered on the
25
+ surface and a frilly edge. The tentacles are olive green. The shell is usually brown, and its surface marked
26
+ with many low, flat-topped ribs that run parallel to the pores. There are 5 to 7 open pores, and the pore
27
+ edges are elevated above the shell surface. A groove often parallels the outer edge of the line of pores.
28
+ Green abalone range from Point Conception, California to Bahia Magdalena, Baja California. They are
29
+ found in the intertidal and subtidal zones down to at least 30 feet. Green abalone are often found in
30
+ crevices where surfgrass and algal cover is dense. They reach 10 inches in length, but are generally
31
+ smaller.
32
+ Pink abalone (H. corrugata) have a mottled black and white epipodium with many tubercles on the
33
+ surface and a lacy edge. The foot is yellow to light orange. The tentacles are black. The shell is thick and
34
+ its surface is marked with wavy corrugations. There are 2 to 4 open pores, and pore edges are strongly
35
+ elevated above the surface. Pink abalone range from Point Conception, California to Santa Maria Bay,
36
+ Baja California. They are found in the subtidal zone from 20 feet down to at least 120 feet, commonly in
37
+ beds of giant kelp. Pink abalone reach 10 inches in length, but individuals over 7 inches long are now
38
+ rare.
39
+ Pinto abalone (H. kamtschatkana) have a mottled pale yellow to dark brown epipodium, with a pebbly
40
+ appearing surface and lacy edge. Tentacles are yellowish brown, or occasionally green, and thin. The
41
+ shell is irregularly mottled and narrow. There are 3 to 6 open pores, and the pore edges are elevated
42
+ above the shell surface. A groove often parallels the line of pores. Pinto abalone range from Sitka, Alaska
43
+ to Monterey, California. They are found in the intertidal and subtidal zones down to at least 70 feet. Pinto
44
+ abalone reach 6.49 inches in length, but are commonly 4 inches long. Pinto abalone are also known
45
+ regionally as northern abalone.
46
+ Red abalone (H. rufescens) usually have a black epipodium, but some specimens have a barred black
47
+ and cream pattern on their epipodium. The surface of the epipodium is smooth and broadly scalloped
48
+ along the edge. The area around the foot is black and the sole is tan to grey. The tentacles are black. The
49
+ shell surface is generally brick red and the inside edge is often red. There are 3 to 4 open pores, and the
50
+ pores are moderately elevated above the shell surface. Red abalone range from Sunset Bay, Oregon to
51
+ Tortugas, Baja California. North of Point Conception, they are found in the intertidal and subtidal zones
52
+ down to at least 60 feet. South of Point Conception, they are found in the subtidal zone down to over 100
53
+ feet. Red abalone reach 12.3 inches in length, but are commonly 7 to 9 inches long.
54
+ Threaded abalone (H. assimilis) have a mottled pale yellow to dark brown epipodium with a pebbly
55
+ appearing surface and frilly edge. The tentacles are yellowish brown, short and thin. The shell is oval and
56
+ the surface is marked with prominent ribs interspersed with narrow ones. There are 4 to 6 open pores,
57
+ and the pores are moderately elevated above the shell surface. Threaded abalone range from San Luis
58
+ Obispo County, California to Bahia Tortugas, Baja California. They are found in the subtidal zone from 20
59
+ feet down to at least 80 feet, commonly on rock surfaces. Threaded abalone reach 6 inches in length, but
60
+ are commonly smaller. Threaded abalone are considered a subspecies of the pinto abalone by some
61
+ scientists.
62
+ Western Atlantic abalone (H. pourtalesii) have a yellowish epipodium with large and small sensory
63
+ tentacles. The sole of the foot is tan. The shell is reddish-orange. Western Atlantic abalone range from
64
+ North Carolina through the Gulf of Mexico to Brazil. They are found from 187 feet down to at least 1,200
65
+ feet on hard substrates. The largest recorded shell had a length of about 1.2 inches.
66
+ White abalone (H. sorenseni) have a tan and pebbly epipodium. The sole of the foot is orange. The shell
67
+ is deep, thin and oval. There are 3 to 5 open pores, and the edges of the pores are elevated above the
68
+ shell surface. White abalone range from Point Conception to Bahia Tortugas, Baja California. Most white
69
+ abalone are found in the Channel Islands in California. White abalone are found in the subtidal zone
70
+ down to at least 200 feet. They are commonly found in open, exposed areas. White abalone reach 10
71
+ inches in length, but are commonly 5 to 8 inches long.
72
+ Natural History
73
+ Abalones reach sexual maturity at a small size, and fertility is high and increases exponentially with size.
74
+ Sexes are separate and fertilization is external. The eggs and sperm broadcast into the water through the
75
+ pores with the respiratory current. A 1.5 inch abalone may spawn 10,000 eggs or more at a time, while an
76
+ 8 inch abalone may spawn 11 million or more eggs. The spawning season varies among species with
77
+ black, green and pink abalone spawning between spring and fall, and pinto abalone spawning during the
78
+ summer. Red abalone in some locations spawn throughout the year. The fertilized eggs hatch into
79
+ floating larvae that feed on plankton until their shells begin to form. Once the shell forms, the juvenile
80
+ abalone sinks to the bottom where it clings to rocks and crevices with its single powerful foot. Settling
81
+ rates appear to be variable. After settling, abalones change their diet and feed on macroalgae.
82
+ Except for black abalone, hybridization for abalone species is not uncommon in areas where several
83
+ species occur together. There are 12 recognized hybrids in southern California and northern Baja
84
+ California.
85
+ Limited growth information is available for abalones. Commercial sizes of 6.25 inches for pinks, seven
86
+ inches for greens and 7.75 inches for reds are reached after a minimum of 10 to 15 years in southern
87
+ California. Pinto abalone reach about 2.5 inches in a minimum of 6 years.
88
+ Juvenile abalones feed on rock-encrusting coralline algae and on diatom and bacterial films. Adult
89
+ abalones feed primarily on loose pieces of marine algae drifting with the surge or current. Large brown
90
+ algae such as giant kelp, bull kelp, feather boa kelp and elk kelp are preferred, although other species of
91
+ algae may be eaten at various times.
92
+ Abalone eggs and larvae are consumed by filter-feeding fish and shellfish. Predators of juvenile abalones
93
+ include crabs, lobsters, gastropods, octopuses, seastars, and fishes. The bat ray in southern California
94
+ and the sea otter in central California prey selectively on larger abalones.
95
+ Production
96
+ In decreasing order of total catch between 1950 and 1995, red (46.6%), pink (41.2%), black (8.7%), green
97
+ (3.5%), and white (>1%) abalones have all been harvested in California. Since 1993, only red abalone
98
+ have been taken commercially, and the Fish and Game Commission closed all red abalone harvest south
99
+ of San Francisco in May 1997. Pinto abalone are commercially harvested in Alaska and British Columbia.
100
+ Flat and threaded abalones have limited distributions and neither is common. The western Atlantic
101
+ abalone is rare and is not fished commercially.
102
+ Aquaculture of red, pink, and green abalones occurs in California. There is limited aquaculture of green
103
+ and H. diversicolor supertexta abalones in Hawaii.
104
+ California. The commercial fishery for abalones in California began in the 1850's. Chinese Americans
105
+ initially harvested intertidal green and black abalones with skiffs using long, hooked poles. This fishery
106
+ was eliminated in California in 1900 by closure of shallow waters to commercial harvest. Japanese
107
+ American divers followed the Chinese Americans as the fishery moved to the subtidal zone. Initially, free
108
+ divers working from barrel floats harvested abalones. Later, hard-hat divers harvested abalones from
109
+ deeper waters. In the late 1950's, "hooka" gear, which supplied air from the surface to divers using light
110
+ masks, fins and wet suits, began replacing hard-hat gear. Since the 1970's, multi-hose hooka gear and
111
+ specialized, high-speed, seaworthy boats have become common in the fishery.
112
+ In California, abalone divers must use underwater diving gear consisting of an above-surface air pump
113
+ operated from a boat and at least 100 feet of air hose, and must be fully submerged while taking abalone.
114
+ Abalones may be taken only by hand or with abalone irons. An abalone iron is a flat device not more than
115
+ 36" long and not less than 1/16 inch thick, with rounded smooth edges and a curve with a radius of less
116
+ than 18 inches. The commercial abalone fishery in California is managed through size limits, limits on the
117
+ number of permits for commercial abalone divers, and restrictions on harvesting areas. Minimum
118
+ commercial size limits in California are: 7-3/4 inches for red abalone, 7 inches for green abalone, 6-1/4
119
+ inches for pink or white abalone, 5-3/4 inches for black abalone, and 4 inches for pinto, threaded, and flat
120
+ abalone. Commercial harvesting is prohibited during January, February and August. A moratorium on
121
+ commercial harvesting of black abalone began in July, 1993, and extends through January 1, 1997. It is
122
+ unlikely that stocks of black abalone will recover enough for the fishery to reopen. In June, 1994, the
123
+ California Department of Fish and Game proposed and the Fish and Game Commission adopted
124
+ effective January 1, 1995 a two-year closure on sport and commercial harvesting of pink, green and white
125
+ abalone. Prices to fishermen for red abalone were around $500 to $600 per dozen in 1993-94.
126
+ The California commercial abalone harvest reached a record 5.4 million pounds in 1957. Since then,
127
+ commercial harvests have declined dramatically to about 461,376 pounds in 1993. Current stocks of most
128
+ abalone species in central and southern California are over utilized. This is the combined result of
129
+ commercial harvest efficiency, increased market demand, sport fishery expansion, an expanding
130
+ population of sea otters, pollution of mainland habitat, unexplained mortalities of black abalone due to a
131
+ condition known as "withering syndrome," and loss of kelp populations associated with El Niño events.
132
+ Management efforts through size limits and limits on commercial harvesting permits have been
133
+ ineffective. Reseeding experiments have not been successful. Commercial abalone harvesting in
134
+ California may be eliminated if the sea otter range is not contained. Studies in a California fishery reserve
135
+ have shown that even protected populations cannot support a fishery within the sea otter range in central
136
+ California. New laws pending in the 1997 Legislature would establish a multi-year moratorium on the
137
+ commercial and recreational harvest of all species of abalone south of the entrance to San Francisco Bay
138
+ until stocks have demonstrated some level of recovery and a new management plan is in effect.
139
+ Alaska. The southeast Alaska commercial abalone fishery was sporadic and local prior to 1971. Shore
140
+ picking was the primary harvesting method, but after 1960 some scuba gear was used. The fishery
141
+ increased dramatically during the 1970's due to improved scuba gear, increased product demand, and
142
+ the use of larger vessels. The Alaska abalone harvest reached a record 315,000 pounds in 1978-79, and
143
+ then fell to about 36,000 pounds in 1992-93 when a minimum size limit was instituted. The Alaska pinto
144
+ abalone fishery is managed through guideline harvest ranges, a minimum legal size of 3.75 inches, a
145
+ restrictive season, and local area closures for conservation and food fisheries. The fishery opens in
146
+ October to remain outside spawning and settling periods. Guideline harvests prior to 1988-89 varied
147
+ 33,000 to 57,000 pounds per year. The season was shortened each year, and in 1993-94 the most
148
+ productive areas were closed after 6 days and a catch of 37,000 pounds.
149
+ British Columbia. Prior to 1971, the British Columbia commercial pinto abalone fishery was sporadic
150
+ and local. Shore picking was the main harvest method, but after 1960 some scuba gear was used. The
151
+ fishery accelerated rapidly during the 1970's due to improved scuba gear, reduced access to herring and
152
+ salmon fisheries, acceptance of the pinto abalone in the Japanese market, increased product demand,
153
+ and the introduction of larger vessels with freezer capacity. Abalone landings peaked in 1977 at 474.8
154
+ metric tons and then declined rapidly. The fishery was later closed to rebuild stocks.
155
+ Products
156
+ During the early years of the abalone fishery, abalones were dried and smoked, or canned for export, and
157
+ sold fresh for local markets. Currently, most abalones are exported to Japan, either fresh or frozen whole.
158
+ The U.S. market is primarily in California for live abalone for the sashimi market, and for some fresh and
159
+ frozen steaks for restaurants.
160
+ A major shift in U.S. marketing occurred after the black abalone moratorium in 1993. Red abalone became
161
+ the primary export product. High prices increased the incentive for illegal harvesting in closed areas.
162
+ Abalone steaks are produced by removing the foot, trimming, slicing, and tenderizing. Yield from a live
163
+ abalone is roughly 15 percent. Shells are used for mother-of-pearl, souvenirs, and jewelry.
164
+ References
165
+ (References section preserved exactly as in the PDF.)
src/__pycache__/__init__.cpython-310.pyc ADDED
Binary file (164 Bytes). View file
 
src/__pycache__/ingest.cpython-310.pyc ADDED
Binary file (2.67 kB). View file
 
src/__pycache__/qa_chain.cpython-310.pyc ADDED
Binary file (1.45 kB). View file
 
src/__pycache__/vectorstore.cpython-310.pyc ADDED
Binary file (1.18 kB). View file
 
vectorstore/chroma-collections.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d2f2f346a015c1ffec6f8a3c535ac4ea2a99fe14f441a424e373b42248ac0fbe
3
+ size 601
vectorstore/chroma-embeddings.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7f26b3ec8f49b9a9e03eebdc6185167fcb1a871609e75b9471817a8e0c5cdaf3
3
+ size 208