AngelaColmen commited on
Commit
369a077
·
verified ·
1 Parent(s): f80bc88

Initial Commit

Browse files
Files changed (7) hide show
  1. README +115 -0
  2. app.py +66 -0
  3. build_index.py +33 -0
  4. catalog.csv +21 -0
  5. index.html +135 -0
  6. requirements.txt +5 -0
  7. search.py +31 -0
README ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📚 Semantic Library Search
2
+
3
+ An AI-powered library search engine that finds books by meaning, not just keywords. Built as a sample project for a AI/ML.
4
+
5
+ ## What Is Semantic Search?
6
+
7
+ Traditional library search looks for exact keyword matches. If you search for "books about the cosmos" you might miss books that use the word "universe" or "astronomy" instead.
8
+
9
+ Semantic search understands *meaning*. It knows that "cosmos", "universe", "space", and "astronomy" are all related concepts — and returns relevant results even when the exact words don't match.
10
+
11
+ ## How It Works
12
+
13
+ 1. **Catalog Loading** — Book metadata is loaded from a CSV catalog file
14
+ 2. **Embedding Generation** — Each book description is converted into a mathematical "meaning fingerprint" using a Sentence Transformer AI model
15
+ 3. **Vector Indexing** — These fingerprints are stored in a FAISS index for fast similarity searching
16
+ 4. **Query Processing** — When a user searches, their query is converted into a fingerprint and compared against all books in the index
17
+ 5. **Results Returned** — The closest matching books are returned ranked by semantic similarity
18
+
19
+ ## Technologies Used
20
+
21
+ - **Python 3.14** — Core programming language
22
+ - **Sentence Transformers** — AI model for generating semantic embeddings (all-MiniLM-L6-v2)
23
+ - **FAISS** — Facebook AI Similarity Search for fast vector search
24
+ - **FastAPI** — Modern Python web framework for the search API
25
+ - **Uvicorn** — ASGI server for running the API
26
+ - **Pandas** — Data manipulation and catalog management
27
+ - **HTML/CSS/JavaScript** — Frontend search interface
28
+
29
+ ## Project Structure
30
+
31
+ ```
32
+ semantic-library-search/
33
+ ├── catalog.csv # Library catalog with 20 books
34
+ ├── build_index.py # Builds the AI search index
35
+ ├── search.py # Command line search interface
36
+ ├── app.py # FastAPI search API
37
+ ├── index.html # Web search interface
38
+ ├── library.index # Generated FAISS vector index
39
+ ├── catalog_processed.csv # Processed catalog data
40
+ └── embeddings.pkl # Saved book embeddings
41
+ ```
42
+
43
+ ## Getting Started
44
+
45
+ ### Prerequisites
46
+ - Python 3.9 or higher
47
+ - pip package manager
48
+
49
+ ### Installation
50
+
51
+ 1. Clone the repository:
52
+ ```
53
+ git clone https://github.com/angelacolmen/semantic-library-search.git
54
+ cd semantic-library-search
55
+ ```
56
+
57
+ 2. Create and activate a virtual environment:
58
+ ```
59
+ python -m venv venv
60
+ venv\Scripts\activate # Windows
61
+ source venv/bin/activate # Mac/Linux
62
+ ```
63
+
64
+ 3. Install dependencies:
65
+ ```
66
+ pip install sentence-transformers faiss-cpu pandas fastapi uvicorn python-multipart
67
+ ```
68
+
69
+ 4. Build the search index:
70
+ ```
71
+ python build_index.py
72
+ ```
73
+
74
+ 5. Start the API:
75
+ ```
76
+ uvicorn app:app --reload
77
+ ```
78
+
79
+ 6. Open your browser and go to:
80
+ ```
81
+ http://127.0.0.1:8000
82
+ ```
83
+
84
+ ## Example Searches
85
+
86
+ Try these searches to see semantic search in action:
87
+
88
+ - `books about space and the universe`
89
+ - `stories about race and justice in America`
90
+ - `women who made a difference in science`
91
+ - `how governments control people`
92
+ - `survival against the odds`
93
+
94
+ Notice how results appear even when the exact search words don't appear in the book titles or descriptions!
95
+
96
+ ## Library Science Applications
97
+
98
+ This project demonstrates several real world library applications:
99
+
100
+ - **Reference Services** — Patrons can describe their research need in plain language and receive relevant resource recommendations
101
+ - **Collection Development** — Identify gaps in a collection by searching for topics and seeing what's missing
102
+ - **Catalog Enhancement** — Improve discoverability of items that may be poorly described in traditional catalog records
103
+ - **Accessibility** — Helps patrons who don't know the exact terminology used in library classification systems
104
+
105
+ ## Future Enhancements
106
+
107
+ - Connect to a live library catalog via API (e.g. WorldCat, Open Library)
108
+ - Add Library of Congress Subject Heading suggestions
109
+ - Implement user feedback to improve search results over time
110
+ - Scale to larger collections using cloud based vector databases
111
+ - Add multilingual search support
112
+
113
+ ## Author
114
+
115
+ Built by Angela Colmenares as a sample project for AI/ML.
app.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from sentence_transformers import SentenceTransformer
3
+ import faiss
4
+ import numpy as np
5
+ import pandas as pd
6
+
7
+ # Load everything
8
+ print("Loading search engine...")
9
+ model = SentenceTransformer("all-MiniLM-L6-v2")
10
+
11
+ # Build index on startup
12
+ import subprocess
13
+ subprocess.run(["python", "build_index.py"])
14
+
15
+ index = faiss.read_index("library.index")
16
+ df = pd.read_csv("catalog_processed.csv")
17
+ print("Ready!")
18
+
19
+ def search(query, num_results=3):
20
+ if not query:
21
+ return "Please enter a search query."
22
+
23
+ query_embedding = model.encode([query])
24
+ distances, indices = index.search(np.array(query_embedding), num_results)
25
+
26
+ results = ""
27
+ for i, idx in enumerate(indices[0]):
28
+ book = df.iloc[idx]
29
+ results += f"### {i+1}. {book['title']}\n"
30
+ results += f"**Author:** {book['author']}\n"
31
+ results += f"**Subject:** {book['subject']}\n"
32
+ results += f"{book['description']}\n\n"
33
+
34
+ return results
35
+
36
+ # Create Gradio interface
37
+ demo = gr.Interface(
38
+ fn=search,
39
+ inputs=[
40
+ gr.Textbox(
41
+ label="Search Query",
42
+ placeholder="e.g. books about space and the universe...",
43
+ lines=2
44
+ ),
45
+ gr.Slider(
46
+ minimum=1,
47
+ maximum=5,
48
+ value=3,
49
+ step=1,
50
+ label="Number of Results"
51
+ )
52
+ ],
53
+ outputs=gr.Markdown(label="Search Results"),
54
+ title="📚 Semantic Library Search",
55
+ description="Search by meaning, not just keywords. Try searching for 'books about space and the universe' or 'stories about race and justice in America'.",
56
+ examples=[
57
+ ["books about space and the universe", 3],
58
+ ["stories about race and justice in America", 3],
59
+ ["women who made a difference in science", 3],
60
+ ["how governments control people", 3],
61
+ ["survival against the odds", 3]
62
+ ]
63
+ )
64
+
65
+ if __name__ == "__main__":
66
+ demo.launch()
build_index.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from sentence_transformers import SentenceTransformer
3
+ import faiss
4
+ import numpy as np
5
+ import pickle
6
+
7
+ # Step 1: Load your catalog
8
+ print("Loading catalog...")
9
+ df = pd.read_csv("catalog.csv")
10
+
11
+ # Step 2: Combine the important fields into one sentence per book
12
+ df["combined"] = df["title"] + " by " + df["author"] + ". " + df["description"]
13
+
14
+ # Step 3: Load the AI model
15
+ print("Loading AI model (this may take a minute first time)...")
16
+ model = SentenceTransformer("all-MiniLM-L6-v2")
17
+
18
+ # Step 4: Turn each book description into a vector (list of numbers)
19
+ print("Creating embeddings...")
20
+ embeddings = model.encode(df["combined"].tolist())
21
+
22
+ # Step 5: Build the search index
23
+ print("Building search index...")
24
+ index = faiss.IndexFlatL2(embeddings.shape[1])
25
+ index.add(np.array(embeddings))
26
+
27
+ # Step 6: Save everything for later
28
+ faiss.write_index(index, "library.index")
29
+ df.to_csv("catalog_processed.csv", index=False)
30
+ with open("embeddings.pkl", "wb") as f:
31
+ pickle.dump(embeddings, f)
32
+
33
+ print("Done! Your library search index is ready.")
catalog.csv ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ id,title,author,subject,description
2
+ 1,The Great Gatsby,F. Scott Fitzgerald,Fiction,A story about wealth and the American dream in the 1920s
3
+ 2,A Brief History of Time,Stephen Hawking,Science,An introduction to cosmology and the nature of the universe
4
+ 3,To Kill a Mockingbird,Harper Lee,Fiction,A lawyer defends a Black man accused of a crime in the American South
5
+ 4,The Selfish Gene,Richard Dawkins,Science,An exploration of evolution from the perspective of genes
6
+ 5,Sapiens,Yuval Noah Harari,History,A sweeping history of humankind from prehistoric times to the present
7
+ 6,Cosmos,Carl Sagan,Science,A journey through the universe exploring astronomy and the nature of space
8
+ 7,The Immortal Life of Henrietta Lacks,Rebecca Skloot,Science,The story of a Black woman whose cancer cells were taken without consent and used for medical research
9
+ 8,Just Mercy,Bryan Stevenson,Law,A lawyer fights for wrongly condemned prisoners on death row in America
10
+ 9,The Origin of Species,Charles Darwin,Science,The foundational text of evolutionary biology explaining natural selection
11
+ 10,Guns Germs and Steel,Jared Diamond,History,An explanation of why some civilizations came to dominate others throughout history
12
+ 11,The Color of Law,Richard Rothstein,History,How the American government segregated the country through housing policies
13
+ 12,Hidden Figures,Margot Lee Shetterly,History,The story of Black female mathematicians who worked at NASA during the space race
14
+ 13,The Martian,Andy Weir,Fiction,An astronaut is stranded alone on Mars and must use science to survive
15
+ 14,Astrophysics for People in a Hurry,Neil deGrasse Tyson,Science,A quick guide to the biggest ideas in astrophysics and the cosmos
16
+ 15,The New Jim Crow,Michelle Alexander,Law,How mass incarceration functions as a system of racial control in America
17
+ 16,A Short History of Nearly Everything,Bill Bryson,Science,A journey through the history of science and how humans came to understand the world
18
+ 17,The Warmth of Other Suns,Isabel Wilkerson,History,The story of the Great Migration of Black Americans from the South to the North
19
+ 18,Contact,Carl Sagan,Fiction,A scientist receives a mysterious signal from deep space and must decide how humanity should respond
20
+ 19,The Sixth Extinction,Elizabeth Kolbert,Science,An investigation into how human activity is causing a mass extinction of species on Earth
21
+ 20,Between the World and Me,Ta-Nehisi Coates,History,A father's letter to his son about the history and reality of being Black in America
index.html ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Semantic Library Search</title>
7
+ <style>
8
+ body {
9
+ font-family: Arial, sans-serif;
10
+ max-width: 800px;
11
+ margin: 40px auto;
12
+ padding: 20px;
13
+ background-color: #f5f5f5;
14
+ }
15
+ h1 {
16
+ color: #2c3e50;
17
+ text-align: center;
18
+ }
19
+ p.subtitle {
20
+ text-align: center;
21
+ color: #7f8c8d;
22
+ margin-bottom: 30px;
23
+ }
24
+ .search-box {
25
+ display: flex;
26
+ gap: 10px;
27
+ margin-bottom: 30px;
28
+ }
29
+ input {
30
+ flex: 1;
31
+ padding: 12px;
32
+ font-size: 16px;
33
+ border: 2px solid #bdc3c7;
34
+ border-radius: 6px;
35
+ }
36
+ button {
37
+ padding: 12px 24px;
38
+ background-color: #2980b9;
39
+ color: white;
40
+ border: none;
41
+ border-radius: 6px;
42
+ font-size: 16px;
43
+ cursor: pointer;
44
+ }
45
+ button:hover {
46
+ background-color: #2471a3;
47
+ }
48
+ .result {
49
+ background: white;
50
+ padding: 20px;
51
+ margin-bottom: 15px;
52
+ border-radius: 8px;
53
+ border-left: 5px solid #2980b9;
54
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
55
+ }
56
+ .result h3 {
57
+ margin: 0 0 5px 0;
58
+ color: #2c3e50;
59
+ }
60
+ .result .author {
61
+ color: #7f8c8d;
62
+ font-style: italic;
63
+ margin-bottom: 8px;
64
+ }
65
+ .result .subject {
66
+ display: inline-block;
67
+ background: #eaf4fb;
68
+ color: #2980b9;
69
+ padding: 3px 10px;
70
+ border-radius: 12px;
71
+ font-size: 13px;
72
+ margin-bottom: 8px;
73
+ }
74
+ .result p {
75
+ color: #555;
76
+ margin: 0;
77
+ }
78
+ #status {
79
+ text-align: center;
80
+ color: #7f8c8d;
81
+ font-style: italic;
82
+ }
83
+ </style>
84
+ </head>
85
+ <body>
86
+ <h1>📚 Semantic Library Search</h1>
87
+ <p class="subtitle">Search by meaning, not just keywords</p>
88
+
89
+ <div class="search-box">
90
+ <input type="text" id="query" placeholder="e.g. books about space and the universe..." />
91
+ <button onclick="search()">Search</button>
92
+ </div>
93
+
94
+ <div id="status"></div>
95
+ <div id="results"></div>
96
+
97
+ <script>
98
+ document.getElementById('query').addEventListener('keypress', function(e) {
99
+ if (e.key === 'Enter') search();
100
+ });
101
+
102
+ async function search() {
103
+ const query = document.getElementById('query').value;
104
+ if (!query) return;
105
+
106
+ document.getElementById('status').textContent = 'Searching...';
107
+ document.getElementById('results').innerHTML = '';
108
+
109
+ try {
110
+ const response = await fetch('http://127.0.0.1:8000/search', {
111
+ method: 'POST',
112
+ headers: { 'Content-Type': 'application/json' },
113
+ body: JSON.stringify({ query: query, num_results: 3 })
114
+ });
115
+
116
+ const data = await response.json();
117
+ document.getElementById('status').textContent = '';
118
+
119
+ data.results.forEach(book => {
120
+ document.getElementById('results').innerHTML += `
121
+ <div class="result">
122
+ <h3>${book.title}</h3>
123
+ <div class="author">by ${book.author}</div>
124
+ <span class="subject">${book.subject}</span>
125
+ <p>${book.description}</p>
126
+ </div>
127
+ `;
128
+ });
129
+ } catch (error) {
130
+ document.getElementById('status').textContent = 'Error connecting to search engine. Make sure the API is running!';
131
+ }
132
+ }
133
+ </script>
134
+ </body>
135
+ </html>
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ sentence-transformers
2
+ faiss-cpu
3
+ pandas
4
+ gradio
5
+ numpy
search.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from sentence_transformers import SentenceTransformer
3
+ import faiss
4
+ import numpy as np
5
+
6
+ # Load everything we built
7
+ print("Loading search engine...")
8
+ model = SentenceTransformer("all-MiniLM-L6-v2")
9
+ index = faiss.read_index("library.index")
10
+ df = pd.read_csv("catalog_processed.csv")
11
+
12
+ print("Ready! Type your search question below.")
13
+ print("Type 'quit' to exit\n")
14
+
15
+ # Search loop
16
+ while True:
17
+ query = input("Search: ")
18
+ if query.lower() == "quit":
19
+ break
20
+
21
+ # Turn your search query into a meaning fingerprint
22
+ query_embedding = model.encode([query])
23
+
24
+ # Find the 3 most similar books
25
+ distances, indices = index.search(np.array(query_embedding), 3)
26
+
27
+ print("\nTop results:")
28
+ for i, idx in enumerate(indices[0]):
29
+ book = df.iloc[idx]
30
+ print(f"{i+1}. {book['title']} by {book['author']}")
31
+ print(f" {book['description']}\n")