romybeaute commited on
Commit
2137c11
·
1 Parent(s): de1be28

reorganised

Browse files
Files changed (1) hide show
  1. README.md +94 -16
README.md CHANGED
@@ -31,13 +31,31 @@ The tool is designed for consciousness researchers, phenomenologists, and qualit
31
  - **LLM topic labelling** — automatic generation of interpretable labels (full version)
32
  - **Python API** — `mosaic_core` library for programmatic use and batch processing
33
 
34
- ## Installation
35
 
36
- ### Web app (no installation)
37
 
38
- Visit [huggingface.co/spaces/romybeaute/MOSAICapp](https://huggingface.co/spaces/romybeaute/MOSAICapp)
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
- ### Local installation
 
 
 
 
 
 
 
41
 
42
  ```bash
43
  git clone https://github.com/romybeaute/MOSAICapp.git
@@ -53,37 +71,107 @@ pip install .
53
 
54
  # Download NLTK data (required for segmentation)
55
  python -c "import nltk; nltk.download('punkt')"
 
56
 
57
- # Run the app
 
 
 
 
 
 
58
  streamlit run app.py
59
  ```
60
 
61
- ### Library usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  ```python
64
  from mosaic_core.core_functions import preprocess_and_embed, run_topic_model
65
 
 
66
  docs, embeddings = preprocess_and_embed("data.csv", text_col="report")
67
 
 
68
  config = {
69
  "umap_params": {"n_neighbors": 15, "n_components": 5},
70
  "hdbscan_params": {"min_cluster_size": 10},
71
  "bt_params": {"nr_topics": "auto"}
72
  }
73
 
 
74
  model, reduced_embeddings, topics = run_topic_model(docs, embeddings, config)
75
  ```
76
 
 
 
 
 
77
  ## Input format
78
 
79
  CSV file with a text column. The app auto-detects columns named `text`, `report`, `reflection_answer`, or `reflection_answer_english`. Any column can also be selected manually.
80
 
 
 
 
 
81
  ## How it works
82
 
83
  MOSAICapp implements a BERTopic pipeline: texts are embedded using sentence transformers, reduced with UMAP, clustered with HDBSCAN, and labelled using c-TF-IDF (with optional LLM refinement). This approach captures semantic context better than older bag-of-words methods like LDA.
84
 
85
  For methodological details, see the [MOSAIC paper](https://arxiv.org/abs/2502.18318).
86
 
 
 
 
 
87
  ## Research applications
88
 
89
  MOSAICapp has been used to analyse:
@@ -109,17 +197,7 @@ MOSAICapp has been used to analyse:
109
 
110
  See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on reporting bugs, suggesting features, and contributing code.
111
 
112
- ## Tests
113
 
114
- **Run everything:**
115
- ```bash
116
- pytest tests/ -v
117
- ```
118
-
119
- **Run only fast tests:**
120
- ```bash
121
- pytest tests/test_core_functions.py -v
122
- ```
123
 
124
  ## License
125
 
 
31
  - **LLM topic labelling** — automatic generation of interpretable labels (full version)
32
  - **Python API** — `mosaic_core` library for programmatic use and batch processing
33
 
 
34
 
 
35
 
36
+ ---
37
+
38
+ ## 1. Quick Start (No Installation)
39
+
40
+ The easiest way to use MOSAICapp is via the hosted web interface. No coding or installation is required.
41
+
42
+ **[Launch MOSAICapp on Hugging Face](https://huggingface.co/spaces/romybeaute/MOSAICapp)**
43
+
44
+ *Note: The hosted version runs on shared resources. For large datasets or privacy-sensitive data, we recommend the local installation below.*
45
+
46
+
47
+ ---
48
+
49
+ ## 2. Local Installation
50
 
51
+ Run the app on your own machine to use custom GPUs, process sensitive data locally, or modify the code.
52
+
53
+ ### Prerequisites
54
+ - Python 3.9+
55
+ - Git
56
+
57
+
58
+ ### Setup steps
59
 
60
  ```bash
61
  git clone https://github.com/romybeaute/MOSAICapp.git
 
71
 
72
  # Download NLTK data (required for segmentation)
73
  python -c "import nltk; nltk.download('punkt')"
74
+ ```
75
 
76
+ ---
77
+
78
+ ## 3. Configuration & Running
79
+
80
+
81
+ ### Run the app
82
+ ```
83
  streamlit run app.py
84
  ```
85
 
86
+ ### LLM Setup (Optional)
87
+ To use the Automated Topic Labelling feature (Llama-3), you must provide a Hugging Face Access Token. The app uses this token to access the inference API.
88
+
89
+ 1. Get a Token: Log in to Hugging Face and create a token with "Read" permissions.
90
+
91
+ 2. Configure Local App:
92
+
93
+ - Create a folder named .streamlit in your root directory.
94
+
95
+ - Inside it, create a file named secrets.toml.
96
+
97
+ - Add your token in TOML file:
98
+ ```
99
+ HF_TOKEN = "hf_..."
100
+ ```
101
+
102
+ - Note: This file is ignored by Git to protect your credentials.
103
+
104
+
105
+ ---
106
+
107
+ ## 4. Running Tests
108
+ We include a test suite to verify the installation and core logic. This is useful to check if your environment is set up correctly.
109
+
110
+ **Run everything:**
111
+ ```bash
112
+ pytest tests/ -v
113
+ ```
114
 
115
+ **Run only fast tests:**
116
+ ```bash
117
+ pytest tests/test_core_functions.py -v
118
+ ```
119
+
120
+ This will automatically load a dummy dataset included in the repo and verify:
121
+
122
+ - Data loading (CSV parsing)
123
+
124
+ - Embedding generation
125
+
126
+ - Topic modelling pipeline
127
+
128
+ - Visualisation outputs
129
+
130
+ ---
131
+
132
+ ## 5. Python API (Advanced Usage)
133
+ MOSAICapp is also a Python library. You can import `mosaic_core` in your own scripts or Jupyter Notebooks for batch processing or custom analysis pipelines.
134
+
135
+ ### Library usage
136
  ```python
137
  from mosaic_core.core_functions import preprocess_and_embed, run_topic_model
138
 
139
+ # 1. Load and Preprocess
140
  docs, embeddings = preprocess_and_embed("data.csv", text_col="report")
141
 
142
+ # 2. Configure Parameters
143
  config = {
144
  "umap_params": {"n_neighbors": 15, "n_components": 5},
145
  "hdbscan_params": {"min_cluster_size": 10},
146
  "bt_params": {"nr_topics": "auto"}
147
  }
148
 
149
+ # 3. Run Model
150
  model, reduced_embeddings, topics = run_topic_model(docs, embeddings, config)
151
  ```
152
 
153
+
154
+
155
+
156
+
157
  ## Input format
158
 
159
  CSV file with a text column. The app auto-detects columns named `text`, `report`, `reflection_answer`, or `reflection_answer_english`. Any column can also be selected manually.
160
 
161
+
162
+ ---
163
+
164
+
165
  ## How it works
166
 
167
  MOSAICapp implements a BERTopic pipeline: texts are embedded using sentence transformers, reduced with UMAP, clustered with HDBSCAN, and labelled using c-TF-IDF (with optional LLM refinement). This approach captures semantic context better than older bag-of-words methods like LDA.
168
 
169
  For methodological details, see the [MOSAIC paper](https://arxiv.org/abs/2502.18318).
170
 
171
+
172
+
173
+ ---
174
+
175
  ## Research applications
176
 
177
  MOSAICapp has been used to analyse:
 
197
 
198
  See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on reporting bugs, suggesting features, and contributing code.
199
 
 
200
 
 
 
 
 
 
 
 
 
 
201
 
202
  ## License
203