htaf commited on
Commit
a67789e
·
1 Parent(s): e439243

added data extractor

Browse files
1972-03-01_llresearch.json DELETED
@@ -1,19 +0,0 @@
1
- {
2
- "session_title": "March 1, 1972",
3
- "session_date": "1972-03-01",
4
- "session_type": "Channeling Session",
5
- "synopsis": "Hatonn: I am speaking through this instrument. I will speak to you on the subject of maturity, for this you have requested. As you have already indicated, this concept may be somewhat different from that which is generally appreciated by your peoples.\n\nMaturity, my friends, is in truth a maturity of the spirit, for in truth there is nothing but the spirit. Physical illusion which you appreciate in your daily lives is of no consequence other than for its result in action upon the spiritual self.",
6
- "turns": [
7
- {
8
- "role": null,
9
- "speaker": null,
10
- "content": "[The day of this channeling is unknown.] 1\n\n(Don channeling)"
11
- },
12
- {
13
- "role": "assistant",
14
- "speaker": "Hatonn",
15
- "channeler": null,
16
- "content": "I am Hatonn. I greet you, my friends, in the love and the light of our infinite Creator. It is a great privilege to be with you once more. I am always privileged to be with you. These are the thoughts of the one known as Hatonn.\n\nI am speaking through this instrument. I will speak to you on the subject of maturity, for this you have requested. As you have already indicated, this concept may be somewhat different from that which is generally appreciated by your peoples.\n\nMaturity, my friends, is in truth a maturity of the spirit, for in truth there is nothing but the spirit. Physical illusion which you appreciate in your daily lives is of no consequence other than for its result in action upon the spiritual self.\n\nMaturity, my friends, is first the realization of this fact. Secondly, maturity is the ability to control one's own consciousness in such a way so as to propagate the continuance of this maturing process.\n\nMaturity, my friends, then is realized in the ability to control one's consciousness. Unfortunately, upon the planet which you now enjoy, there is but very little true control of the basic consciousness. And, therefore, there is very, very little maturity. It is necessary first to realize the value of each thought you have, and then to reject those of little or no value. Most of the thoughts that we are able to discern occurring in the daily lives of those who dwell upon this planet lack maturity. For this reason, you might consider the planet upon which you live a planet of children. Their daily thoughts, communicated to one another, hold them in this state. It is a self-propagating thing, communicated from one to the other.\n\nIt is necessary to reject thoughts that continually infringe upon your mind from your present environment and to carefully select each thought that you generate in order to reach a state of true mental maturity.\n\nYou might ask how it is possible to select thoughts of value from thoughts that are meaningless, or of little value. It is very simple, my friends. All that is necessary is for you to analyze the thought with respect to the real objectives of your person. If the thought has true consequence, if the thought is of a true developmental nature-that is to say, if it develops either your consciousness or the consciousness of someone else with whom you are communicating-then it is a worthwhile thought. If it does not develop the consciousness, then it is probably of very little value.\n\nNow, how will a thought or concept develop into consciousness? There are several ways. One technique of development is simply evolving the ability of analyzing the merits of your thoughts. After this has been done, the thoughts themselves will act as generators of the maturing process.\n\nEach thought you have is important. It is important either in a negative or a positive sense. If it is a thought that is of no consequence, it is important to recognize this thought as being of no value. If it is a thought of consequence, then it is necessary that you amplify it and utilize it and communicate it, or it too will be of very little value.\n\nMaturity, my friends, is first the ability to think in this manner. It is second to act in the manner in which you think.\n\nThere are millions and millions of thoughts generated by the people of your planet each day. A very, very small percentage of these thoughts have to do with maturity. That is, a very, very small percentage of them have to do with creating a better environment for the growth of the spiritual self. By this I mean actively causing spiritual development.\n\nWhat is spiritual development? It is the process of maturing; the process of maturing, the process of analyzing everything that you are aware of in a true and unbiased sense. In order to do this, one must be able to recognize truth. It is only possible for one to recognize truth by the process of allowing truth to communicate the absolute base for truth which is ever present throughout the universe. This communication is accomplished primarily through the technique of meditation.\n\nThere is a separation of maturity into primarily two aspects: intellectual and spiritual maturity. They go hand in hand, and one generates the other. However, it is not necessary to acquire intellectual maturity in order to acquire spiritual maturity. It is, however, necessary to acquire spiritual maturity in order to acquire intellectual maturity, for the intellect cannot accurately evaluate concepts without a true spiritual basis.\n\nThere are three more things which I would like to speak about concerning maturity: the concept of infantile maturity; the concept of general or induced maturity; the concept of absolute or total maturity.\n\nThe concept of infantile maturity is highly misunderstood upon this planet. An infant, upon incarnating into your environment, has a certain amount of maturity that he normally brings with him. It is not necessary to induce this quality of maturity through any system of education to the infant. It is only necessary that he be alerted to the possibility of generating a continuance, through his own intellectual processes, of his own spiritual evolution and, consequently, spiritual maturity. Unfortunately, your religious systems do not provide, for the most part, this stimulus.\n\nIt is recommended that, in order for infantile maturity to progress at an acceptable rate, the infant be made aware at the earliest age possible of his responsibility in creating an intellectual communication with his total self. This is usually done through techniques of ritual and appreciation of the natural forces of the universe. The ritual that is employed by most of your religious systems upon your planet is highly ineffective, since it is generated primarily by force, and is not freely offered, to be accepted or rejected.\n\nThose, even in an extreme infantile state, who are appreciative, due to their previous growth of the proper ritualistic communications, will accept them, and continue, at their own pace, and should not be forced to attend weekly meetings at specific hours for these purposes, since they reach a peak of spiritual attunement that is a function of their own cyclical activities, and therefore should be able to seek out, at any time, spiritual communications and should be provided with a place for seeking. And this should be the limit of that which is expected of them. Your present system drives most of your people from spiritual seeking at a very early age due to the aspect of force which should be totally removed. This is what we have experienced, and what we have found to be most beneficial.\n\nThe second aspect of which I speak is that of induced maturity, occurring in most unusual aspect among the peoples of your present society. This maturity, which is a false maturity, is induced by the social systems which are presently in effect upon your planet. Each system intellectually communicates an aspect of assumed maturity, which has nothing to do with real or absolute maturity. Therefore, much strife and confusion is realized by those who attempt to orient their thinking so as to reach the accepted state or level of the assumed concept of the mature mind. This concept is usually heavily intellectual, for your society at present is primarily an intellectual society, with very, very little awareness of the existence or function of what you would call a spiritual society.\n\nTherefore, to mature within the boundaries of your present society and be accepted as a mature person, it is necessary to be able to communicate with it in its accepted intellectual jargon, which includes primarily a ridiculously long list of totally meaningless concepts. These should be, if one is to attain true maturity, rejected as meaningless, for they are extreme transients and have nothing to do with spiritual maturity.\n\nThe last aspect of maturity upon which I wish to speak is that of real maturity. My friends, there is only one way to reach real maturity: that is through meditation. We have said this many times. You cannot get there by intellectual mechanisms. You cannot get there by analyzing each of your thoughts, and labeling it either worthwhile or worthless. All of these things are aids, but without the foundation of daily meditation you cannot use this analysis, for the result of this analysis is insulated from the total self by a boundary. This boundary is permeable, but this boundary is only permeable when the mind is conditioned through meditation. Lack of meditation reduces this boundary to an impermeable state, all intellectual functions occurring on the surface, and having little effect upon the growth of the true self.\n\nSo you see, my friends, there is a dual process occurring. However, the meditation is always of the primary and greater importance. Once, however the art of meditation has been fully mastered, the intellectual mind becomes a useful tool in the development of spirit. It is of little consequence until this state of communication between the two is mastered. Therefore, my friends, all is of no avail until receptivity is made possible through daily meditation. This not only breaks down the barrier between the intellect and the spirit, it also breaks down all other barriers between the spirit and the one great All.\n\nI hope that I have been of some service to you in this discussion. I realize that it is difficult to speak to you on this subject because it is a difficult subject if one is to use the parameters of your present society. We consider it very, very simple. Therefore, any true discussion of the subject should require no more than a few sentences.\n\nIt has been a privilege to speak with you. Adonai, my friends. I am Hatonn.\n\n[At the end of this transcript is the following announcement.]"
17
- }
18
- ]
19
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
LICENSE ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ Copyright 2025 Andrii Zvorygin
8
+
9
+ Licensed under the Apache License, Version 2.0 (the "License");
10
+ you may not use this file except in compliance with the License.
11
+ You may obtain a copy of the License at
12
+
13
+ http://www.apache.org/licenses/LICENSE-2.0
14
+
15
+ Unless required by applicable law or agreed to in writing, software
16
+ distributed under the License is distributed on an "AS IS" BASIS,
17
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ See the License for the specific language governing permissions and
19
+ limitations under the License.
cleanup.sh ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -e
3
+
4
+ echo "[cleanup] Starting distill_rag cleanup…"
5
+
6
+ # Remove backup files
7
+ echo "[cleanup] Removing *.bak files…"
8
+ find . -type f -name "*.bak" -print -delete
9
+
10
+ # Remove Node leftover logs
11
+ echo "[cleanup] Removing npm debug logs…"
12
+ find . -type f -name "npm-debug.log*" -print -delete
13
+
14
+ # Clean old Elasticsearch index (if running)
15
+ if [ -n "${ES_DISTILL_INDEX}" ]; then
16
+ echo "[cleanup] Deleting Elasticsearch index: ${ES_DISTILL_INDEX}"
17
+ curl -X DELETE "${ELASTICSEARCH_NODE:-http://localhost:9200}/${ES_DISTILL_INDEX}" || true
18
+ else
19
+ echo "[cleanup] Skip index delete: ES_DISTILL_INDEX not set"
20
+ fi
21
+
22
+ # Remove build cache
23
+ echo "[cleanup] Removing Jest cache…"
24
+ rm -rf ./node_modules/.cache || true
25
+
26
+ # Optional: remove node_modules entirely
27
+ if [[ "$1" == "--deep" ]]; then
28
+ echo "[cleanup] Deep mode: removing node_modules…"
29
+ rm -rf node_modules
30
+ fi
31
+
32
+ echo "[cleanup] Done!"
data_extraction/README.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Raw Data Extraction
2
+
3
+ Most users start with messy HTML/TXT/PDF dumps.
4
+
5
+ This module converts raw files → structured JSON sessions compatible with the distill_rag indexer.
6
+
7
+ ## Usage
8
+
9
+ 1. Put raw files into a directory, e.g.:
10
+
11
+ raw_corpus/file1.html
12
+ raw_corpus/file2.txt
13
+
14
+ 2. Run extraction:
15
+
16
+ node data_extraction/walk_and_extract.js raw_corpus extracted_json
17
+
18
+ 3. The output will be JSON files with:
19
+
20
+ - title
21
+ - session_date
22
+ - turns[]
23
+
24
+ You can now index these with:
25
+
26
+ QUO_JSON_DIR=extracted_json node indexing/index_distill_chunks.js
data_extraction/clean_html.js ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // data_extraction/clean_html.js
2
+ const cheerio = require("cheerio");
3
+
4
+ function cleanHTML(html) {
5
+ // Load HTML into cheerio
6
+ const $ = cheerio.load(html);
7
+
8
+ const kill = [
9
+ "script",
10
+ "style",
11
+ "nav",
12
+ "header",
13
+ "footer",
14
+ ".ads",
15
+ ".advertisement",
16
+ "#sidebar",
17
+ ];
18
+
19
+ // Remove unwanted elements
20
+ kill.forEach(sel => $(sel).remove());
21
+
22
+ // Extract body text, normalize whitespace
23
+ let text = $("body").text();
24
+
25
+ return text.replace(/\s+/g, " ").trim();
26
+ }
27
+
28
+ module.exports = { cleanHTML };
data_extraction/convert_raw_to_sessions.js ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ // data_extraction/convert_raw_to_sessions.js
3
+ const fs = require("fs");
4
+ const path = require("path");
5
+ const { extractFromHTML } = require("./extractor");
6
+
7
+ function convertFile(inputFile, outputFile) {
8
+ const turns = extractFromHTML(inputFile);
9
+
10
+ const session = {
11
+ title: path.basename(inputFile),
12
+ turns,
13
+ };
14
+
15
+ fs.writeFileSync(outputFile, JSON.stringify(session, null, 2));
16
+ }
17
+
18
+ // Convert all .html files in a directory
19
+ function convertDir(inputDir, outputDir) {
20
+ if (!fs.existsSync(outputDir)) {
21
+ fs.mkdirSync(outputDir, { recursive: true });
22
+ }
23
+
24
+ const files = fs.readdirSync(inputDir);
25
+
26
+ files.forEach((file) => {
27
+ if (!file.endsWith(".html")) return;
28
+
29
+ const fullIn = path.join(inputDir, file);
30
+ const outName = file.replace(/\.html$/, ".json");
31
+ const fullOut = path.join(outputDir, outName);
32
+
33
+ convertFile(fullIn, fullOut);
34
+ });
35
+ }
36
+
37
+ module.exports = {
38
+ convertFile,
39
+ convertDir,
40
+ };
data_extraction/example_corpus/examle_source_1.html ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ <html>
2
+ <body>
3
+ <h1>Q: How do I serve others?</h1>
4
+ <p>A: Service begins with small daily gestures of kindness.</p>
5
+ </body>
6
+ </html>
data_extraction/extractor.js ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // data_extraction/extractor.js
2
+ const fs = require("fs");
3
+ const cheerio = require("cheerio");
4
+
5
+ function extractFromHTML(filePath) {
6
+ const raw = fs.readFileSync(filePath, "utf8");
7
+ const $ = cheerio.load(raw);
8
+
9
+ // Select paragraphs AND headings
10
+ const blocks = $("p, h1, h2, h3, h4, h5, h6")
11
+ .map((i, el) => $(el).text().trim())
12
+ .get()
13
+ .filter((t) => t.length > 0);
14
+
15
+ // If no structured blocks exist, fallback to body text split
16
+ let lines = blocks;
17
+ if (lines.length === 0) {
18
+ lines = $("body")
19
+ .text()
20
+ .split("\n")
21
+ .map((l) => l.trim())
22
+ .filter((l) => l.length > 0);
23
+ }
24
+
25
+ // Convert each block into a turn
26
+ const turns = lines.map((text, idx) => ({
27
+ role: idx === 0 ? "user" : "assistant",
28
+ content: text,
29
+ }));
30
+
31
+ return turns;
32
+ }
33
+
34
+ module.exports = { extractFromHTML };
data_extraction/walk_and_extract.js ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // data_extraction/walk_and_extract.js
2
+
3
+ const fs = require("fs");
4
+ const path = require("path");
5
+ const { convertFile } = require("./convert_raw_to_sessions");
6
+
7
+ // Recursively walk directory and return list of full file paths
8
+ function walkDir(dir) {
9
+ let results = [];
10
+ const list = fs.readdirSync(dir);
11
+
12
+ list.forEach((file) => {
13
+ const filePath = path.join(dir, file);
14
+ const stat = fs.statSync(filePath);
15
+
16
+ if (stat && stat.isDirectory()) {
17
+ results = results.concat(walkDir(filePath));
18
+ } else {
19
+ results.push(filePath);
20
+ }
21
+ });
22
+
23
+ return results;
24
+ }
25
+
26
+ // Convert all HTML files from inputDir → outputDir
27
+ function mainCLI() {
28
+ const args = process.argv.slice(2);
29
+
30
+ // ---------------------------
31
+ // Usage print (fix for Jest)
32
+ // ---------------------------
33
+ if (args.length < 2) {
34
+ console.log("Usage: node walk_and_extract.js <input_dir> <output_dir>");
35
+ return; // no exit needed — Jest now sees output correctly
36
+ }
37
+
38
+ const inputDir = args[0];
39
+ const outputDir = args[1];
40
+
41
+ if (!fs.existsSync(inputDir)) {
42
+ console.error(`Input directory not found: ${inputDir}`);
43
+ process.exit(1);
44
+ }
45
+
46
+ if (!fs.existsSync(outputDir)) {
47
+ fs.mkdirSync(outputDir, { recursive: true });
48
+ }
49
+
50
+ const files = walkDir(inputDir).filter((f) => f.endsWith(".html"));
51
+
52
+ files.forEach((filePath) => {
53
+ const outName = path.basename(filePath).replace(/\.html$/, ".json");
54
+ const outPath = path.join(outputDir, outName);
55
+ convertFile(filePath, outPath);
56
+ });
57
+
58
+ console.log(`[walk_and_extract] Processed ${files.length} HTML files.`);
59
+ }
60
+
61
+ // Only run if executed directly
62
+ if (require.main === module) {
63
+ mainCLI();
64
+ }
65
+
66
+ module.exports = {
67
+ walkDir,
68
+ mainCLI,
69
+ };
jest.config.js CHANGED
@@ -2,5 +2,8 @@
2
  module.exports = {
3
  testEnvironment: "node",
4
  verbose: true,
5
- forceExit: true,
 
 
 
6
  };
 
2
  module.exports = {
3
  testEnvironment: "node",
4
  verbose: true,
5
+ forceExit: true, // force process exit
6
+ detectOpenHandles: false, // explicitly disable warnings
7
+ silent: true, // suppress worker chatter
8
+ testMatch: ["**/tests/**/*.test.js"]
9
  };
package-lock.json CHANGED
@@ -12,6 +12,7 @@
12
  "@elastic/elasticsearch": "^8.17.0",
13
  "@tensorflow/tfjs-node": "^4.22.0",
14
  "@types/elasticsearch": "^5.0.43",
 
15
  "dotenv": "^16.4.7",
16
  "elasticsearch": "^16.7.3",
17
  "fs-extra": "^11.3.0",
@@ -2163,6 +2164,11 @@
2163
  "baseline-browser-mapping": "dist/cli.js"
2164
  }
2165
  },
 
 
 
 
 
2166
  "node_modules/brace-expansion": {
2167
  "version": "1.1.12",
2168
  "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-1.1.12.tgz",
@@ -2335,6 +2341,65 @@
2335
  "node": ">=10"
2336
  }
2337
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2338
  "node_modules/chownr": {
2339
  "version": "2.0.0",
2340
  "resolved": "https://registry.npmjs.org/chownr/-/chownr-2.0.0.tgz",
@@ -2534,6 +2599,32 @@
2534
  "urix": "~0.1.0"
2535
  }
2536
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2537
  "node_modules/debug": {
2538
  "version": "4.4.3",
2539
  "resolved": "https://registry.npmjs.org/debug/-/debug-4.4.3.tgz",
@@ -2627,6 +2718,68 @@
2627
  "integrity": "sha512-BSHWgDSAiKs50o2Re8ppvp3seVHXSRM44cdSsT9FfNEUUZLOGWVCsiWaRPWM1Znn+mqZ1OfVZ3z3DWEzSp7hRA==",
2628
  "dev": true
2629
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2630
  "node_modules/dotenv": {
2631
  "version": "16.6.1",
2632
  "resolved": "https://registry.npmjs.org/dotenv/-/dotenv-16.6.1.tgz",
@@ -2744,6 +2897,29 @@
2744
  "resolved": "https://registry.npmjs.org/emoji-regex/-/emoji-regex-8.0.0.tgz",
2745
  "integrity": "sha512-MSjYzcWNOA0ewAHpz0MxpYFvwg6yjy1NG3xteoqz644VCo/RPgnr1/GGt+ic3iJTzQ8Eu3TdM14SawnVUmGE6A=="
2746
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2747
  "node_modules/error-ex": {
2748
  "version": "1.3.4",
2749
  "resolved": "https://registry.npmjs.org/error-ex/-/error-ex-1.3.4.tgz",
@@ -3290,6 +3466,24 @@
3290
  "integrity": "sha512-H2iMtd0I4Mt5eYiapRdIDjp+XzelXQ0tFE4JS7YFwFevXXMmOp9myNrUvCg0D6ws8iqkRPBfKHgbwig1SmlLfg==",
3291
  "dev": true
3292
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3293
  "node_modules/https-proxy-agent": {
3294
  "version": "2.2.4",
3295
  "resolved": "https://registry.npmjs.org/https-proxy-agent/-/https-proxy-agent-2.2.4.tgz",
@@ -3327,6 +3521,17 @@
3327
  "ms": "^2.0.0"
3328
  }
3329
  },
 
 
 
 
 
 
 
 
 
 
 
3330
  "node_modules/import-local": {
3331
  "version": "3.2.0",
3332
  "resolved": "https://registry.npmjs.org/import-local/-/import-local-3.2.0.tgz",
@@ -4627,6 +4832,17 @@
4627
  "set-blocking": "^2.0.0"
4628
  }
4629
  },
 
 
 
 
 
 
 
 
 
 
 
4630
  "node_modules/object-assign": {
4631
  "version": "4.1.1",
4632
  "resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz",
@@ -4745,6 +4961,51 @@
4745
  "url": "https://github.com/sponsors/sindresorhus"
4746
  }
4747
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4748
  "node_modules/path-exists": {
4749
  "version": "4.0.0",
4750
  "resolved": "https://registry.npmjs.org/path-exists/-/path-exists-4.0.0.tgz",
@@ -4995,6 +5256,11 @@
4995
  }
4996
  ]
4997
  },
 
 
 
 
 
4998
  "node_modules/secure-json-parse": {
4999
  "version": "3.0.2",
5000
  "resolved": "https://registry.npmjs.org/secure-json-parse/-/secure-json-parse-3.0.2.tgz",
@@ -5653,6 +5919,25 @@
5653
  "resolved": "https://registry.npmjs.org/webidl-conversions/-/webidl-conversions-3.0.1.tgz",
5654
  "integrity": "sha512-2JAn3z8AR6rjK8Sm8orRC0h/bcl/DqL7tRPdGZ4I1CjdF+EaMLmYxBHyXuKL849eucPFhvBoxMsflfOb8kxaeQ=="
5655
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5656
  "node_modules/whatwg-url": {
5657
  "version": "5.0.0",
5658
  "resolved": "https://registry.npmjs.org/whatwg-url/-/whatwg-url-5.0.0.tgz",
 
12
  "@elastic/elasticsearch": "^8.17.0",
13
  "@tensorflow/tfjs-node": "^4.22.0",
14
  "@types/elasticsearch": "^5.0.43",
15
+ "cheerio": "^1.1.2",
16
  "dotenv": "^16.4.7",
17
  "elasticsearch": "^16.7.3",
18
  "fs-extra": "^11.3.0",
 
2164
  "baseline-browser-mapping": "dist/cli.js"
2165
  }
2166
  },
2167
+ "node_modules/boolbase": {
2168
+ "version": "1.0.0",
2169
+ "resolved": "https://registry.npmjs.org/boolbase/-/boolbase-1.0.0.tgz",
2170
+ "integrity": "sha512-JZOSA7Mo9sNGB8+UjSgzdLtokWAky1zbztM3WRLCbZ70/3cTANmQmOdR7y2g+J0e2WXywy1yS468tY+IruqEww=="
2171
+ },
2172
  "node_modules/brace-expansion": {
2173
  "version": "1.1.12",
2174
  "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-1.1.12.tgz",
 
2341
  "node": ">=10"
2342
  }
2343
  },
2344
+ "node_modules/cheerio": {
2345
+ "version": "1.1.2",
2346
+ "resolved": "https://registry.npmjs.org/cheerio/-/cheerio-1.1.2.tgz",
2347
+ "integrity": "sha512-IkxPpb5rS/d1IiLbHMgfPuS0FgiWTtFIm/Nj+2woXDLTZ7fOT2eqzgYbdMlLweqlHbsZjxEChoVK+7iph7jyQg==",
2348
+ "dependencies": {
2349
+ "cheerio-select": "^2.1.0",
2350
+ "dom-serializer": "^2.0.0",
2351
+ "domhandler": "^5.0.3",
2352
+ "domutils": "^3.2.2",
2353
+ "encoding-sniffer": "^0.2.1",
2354
+ "htmlparser2": "^10.0.0",
2355
+ "parse5": "^7.3.0",
2356
+ "parse5-htmlparser2-tree-adapter": "^7.1.0",
2357
+ "parse5-parser-stream": "^7.1.2",
2358
+ "undici": "^7.12.0",
2359
+ "whatwg-mimetype": "^4.0.0"
2360
+ },
2361
+ "engines": {
2362
+ "node": ">=20.18.1"
2363
+ },
2364
+ "funding": {
2365
+ "url": "https://github.com/cheeriojs/cheerio?sponsor=1"
2366
+ }
2367
+ },
2368
+ "node_modules/cheerio-select": {
2369
+ "version": "2.1.0",
2370
+ "resolved": "https://registry.npmjs.org/cheerio-select/-/cheerio-select-2.1.0.tgz",
2371
+ "integrity": "sha512-9v9kG0LvzrlcungtnJtpGNxY+fzECQKhK4EGJX2vByejiMX84MFNQw4UxPJl3bFbTMw+Dfs37XaIkCwTZfLh4g==",
2372
+ "dependencies": {
2373
+ "boolbase": "^1.0.0",
2374
+ "css-select": "^5.1.0",
2375
+ "css-what": "^6.1.0",
2376
+ "domelementtype": "^2.3.0",
2377
+ "domhandler": "^5.0.3",
2378
+ "domutils": "^3.0.1"
2379
+ },
2380
+ "funding": {
2381
+ "url": "https://github.com/sponsors/fb55"
2382
+ }
2383
+ },
2384
+ "node_modules/cheerio/node_modules/parse5": {
2385
+ "version": "7.3.0",
2386
+ "resolved": "https://registry.npmjs.org/parse5/-/parse5-7.3.0.tgz",
2387
+ "integrity": "sha512-IInvU7fabl34qmi9gY8XOVxhYyMyuH2xUNpb2q8/Y+7552KlejkRvqvD19nMoUW/uQGGbqNpA6Tufu5FL5BZgw==",
2388
+ "dependencies": {
2389
+ "entities": "^6.0.0"
2390
+ },
2391
+ "funding": {
2392
+ "url": "https://github.com/inikulin/parse5?sponsor=1"
2393
+ }
2394
+ },
2395
+ "node_modules/cheerio/node_modules/undici": {
2396
+ "version": "7.16.0",
2397
+ "resolved": "https://registry.npmjs.org/undici/-/undici-7.16.0.tgz",
2398
+ "integrity": "sha512-QEg3HPMll0o3t2ourKwOeUAZ159Kn9mx5pnzHRQO8+Wixmh88YdZRiIwat0iNzNNXn0yoEtXJqFpyW7eM8BV7g==",
2399
+ "engines": {
2400
+ "node": ">=20.18.1"
2401
+ }
2402
+ },
2403
  "node_modules/chownr": {
2404
  "version": "2.0.0",
2405
  "resolved": "https://registry.npmjs.org/chownr/-/chownr-2.0.0.tgz",
 
2599
  "urix": "~0.1.0"
2600
  }
2601
  },
2602
+ "node_modules/css-select": {
2603
+ "version": "5.2.2",
2604
+ "resolved": "https://registry.npmjs.org/css-select/-/css-select-5.2.2.tgz",
2605
+ "integrity": "sha512-TizTzUddG/xYLA3NXodFM0fSbNizXjOKhqiQQwvhlspadZokn1KDy0NZFS0wuEubIYAV5/c1/lAr0TaaFXEXzw==",
2606
+ "dependencies": {
2607
+ "boolbase": "^1.0.0",
2608
+ "css-what": "^6.1.0",
2609
+ "domhandler": "^5.0.2",
2610
+ "domutils": "^3.0.1",
2611
+ "nth-check": "^2.0.1"
2612
+ },
2613
+ "funding": {
2614
+ "url": "https://github.com/sponsors/fb55"
2615
+ }
2616
+ },
2617
+ "node_modules/css-what": {
2618
+ "version": "6.2.2",
2619
+ "resolved": "https://registry.npmjs.org/css-what/-/css-what-6.2.2.tgz",
2620
+ "integrity": "sha512-u/O3vwbptzhMs3L1fQE82ZSLHQQfto5gyZzwteVIEyeaY5Fc7R4dapF/BvRoSYFeqfBk4m0V1Vafq5Pjv25wvA==",
2621
+ "engines": {
2622
+ "node": ">= 6"
2623
+ },
2624
+ "funding": {
2625
+ "url": "https://github.com/sponsors/fb55"
2626
+ }
2627
+ },
2628
  "node_modules/debug": {
2629
  "version": "4.4.3",
2630
  "resolved": "https://registry.npmjs.org/debug/-/debug-4.4.3.tgz",
 
2718
  "integrity": "sha512-BSHWgDSAiKs50o2Re8ppvp3seVHXSRM44cdSsT9FfNEUUZLOGWVCsiWaRPWM1Znn+mqZ1OfVZ3z3DWEzSp7hRA==",
2719
  "dev": true
2720
  },
2721
+ "node_modules/dom-serializer": {
2722
+ "version": "2.0.0",
2723
+ "resolved": "https://registry.npmjs.org/dom-serializer/-/dom-serializer-2.0.0.tgz",
2724
+ "integrity": "sha512-wIkAryiqt/nV5EQKqQpo3SToSOV9J0DnbJqwK7Wv/Trc92zIAYZ4FlMu+JPFW1DfGFt81ZTCGgDEabffXeLyJg==",
2725
+ "dependencies": {
2726
+ "domelementtype": "^2.3.0",
2727
+ "domhandler": "^5.0.2",
2728
+ "entities": "^4.2.0"
2729
+ },
2730
+ "funding": {
2731
+ "url": "https://github.com/cheeriojs/dom-serializer?sponsor=1"
2732
+ }
2733
+ },
2734
+ "node_modules/dom-serializer/node_modules/entities": {
2735
+ "version": "4.5.0",
2736
+ "resolved": "https://registry.npmjs.org/entities/-/entities-4.5.0.tgz",
2737
+ "integrity": "sha512-V0hjH4dGPh9Ao5p0MoRY6BVqtwCjhz6vI5LT8AJ55H+4g9/4vbHx1I54fS0XuclLhDHArPQCiMjDxjaL8fPxhw==",
2738
+ "engines": {
2739
+ "node": ">=0.12"
2740
+ },
2741
+ "funding": {
2742
+ "url": "https://github.com/fb55/entities?sponsor=1"
2743
+ }
2744
+ },
2745
+ "node_modules/domelementtype": {
2746
+ "version": "2.3.0",
2747
+ "resolved": "https://registry.npmjs.org/domelementtype/-/domelementtype-2.3.0.tgz",
2748
+ "integrity": "sha512-OLETBj6w0OsagBwdXnPdN0cnMfF9opN69co+7ZrbfPGrdpPVNBUj02spi6B1N7wChLQiPn4CSH/zJvXw56gmHw==",
2749
+ "funding": [
2750
+ {
2751
+ "type": "github",
2752
+ "url": "https://github.com/sponsors/fb55"
2753
+ }
2754
+ ]
2755
+ },
2756
+ "node_modules/domhandler": {
2757
+ "version": "5.0.3",
2758
+ "resolved": "https://registry.npmjs.org/domhandler/-/domhandler-5.0.3.tgz",
2759
+ "integrity": "sha512-cgwlv/1iFQiFnU96XXgROh8xTeetsnJiDsTc7TYCLFd9+/WNkIqPTxiM/8pSd8VIrhXGTf1Ny1q1hquVqDJB5w==",
2760
+ "dependencies": {
2761
+ "domelementtype": "^2.3.0"
2762
+ },
2763
+ "engines": {
2764
+ "node": ">= 4"
2765
+ },
2766
+ "funding": {
2767
+ "url": "https://github.com/fb55/domhandler?sponsor=1"
2768
+ }
2769
+ },
2770
+ "node_modules/domutils": {
2771
+ "version": "3.2.2",
2772
+ "resolved": "https://registry.npmjs.org/domutils/-/domutils-3.2.2.tgz",
2773
+ "integrity": "sha512-6kZKyUajlDuqlHKVX1w7gyslj9MPIXzIFiz/rGu35uC1wMi+kMhQwGhl4lt9unC9Vb9INnY9Z3/ZA3+FhASLaw==",
2774
+ "dependencies": {
2775
+ "dom-serializer": "^2.0.0",
2776
+ "domelementtype": "^2.3.0",
2777
+ "domhandler": "^5.0.3"
2778
+ },
2779
+ "funding": {
2780
+ "url": "https://github.com/fb55/domutils?sponsor=1"
2781
+ }
2782
+ },
2783
  "node_modules/dotenv": {
2784
  "version": "16.6.1",
2785
  "resolved": "https://registry.npmjs.org/dotenv/-/dotenv-16.6.1.tgz",
 
2897
  "resolved": "https://registry.npmjs.org/emoji-regex/-/emoji-regex-8.0.0.tgz",
2898
  "integrity": "sha512-MSjYzcWNOA0ewAHpz0MxpYFvwg6yjy1NG3xteoqz644VCo/RPgnr1/GGt+ic3iJTzQ8Eu3TdM14SawnVUmGE6A=="
2899
  },
2900
+ "node_modules/encoding-sniffer": {
2901
+ "version": "0.2.1",
2902
+ "resolved": "https://registry.npmjs.org/encoding-sniffer/-/encoding-sniffer-0.2.1.tgz",
2903
+ "integrity": "sha512-5gvq20T6vfpekVtqrYQsSCFZ1wEg5+wW0/QaZMWkFr6BqD3NfKs0rLCx4rrVlSWJeZb5NBJgVLswK/w2MWU+Gw==",
2904
+ "dependencies": {
2905
+ "iconv-lite": "^0.6.3",
2906
+ "whatwg-encoding": "^3.1.1"
2907
+ },
2908
+ "funding": {
2909
+ "url": "https://github.com/fb55/encoding-sniffer?sponsor=1"
2910
+ }
2911
+ },
2912
+ "node_modules/entities": {
2913
+ "version": "6.0.1",
2914
+ "resolved": "https://registry.npmjs.org/entities/-/entities-6.0.1.tgz",
2915
+ "integrity": "sha512-aN97NXWF6AWBTahfVOIrB/NShkzi5H7F9r1s9mD3cDj4Ko5f2qhhVoYMibXF7GlLveb/D2ioWay8lxI97Ven3g==",
2916
+ "engines": {
2917
+ "node": ">=0.12"
2918
+ },
2919
+ "funding": {
2920
+ "url": "https://github.com/fb55/entities?sponsor=1"
2921
+ }
2922
+ },
2923
  "node_modules/error-ex": {
2924
  "version": "1.3.4",
2925
  "resolved": "https://registry.npmjs.org/error-ex/-/error-ex-1.3.4.tgz",
 
3466
  "integrity": "sha512-H2iMtd0I4Mt5eYiapRdIDjp+XzelXQ0tFE4JS7YFwFevXXMmOp9myNrUvCg0D6ws8iqkRPBfKHgbwig1SmlLfg==",
3467
  "dev": true
3468
  },
3469
+ "node_modules/htmlparser2": {
3470
+ "version": "10.0.0",
3471
+ "resolved": "https://registry.npmjs.org/htmlparser2/-/htmlparser2-10.0.0.tgz",
3472
+ "integrity": "sha512-TwAZM+zE5Tq3lrEHvOlvwgj1XLWQCtaaibSN11Q+gGBAS7Y1uZSWwXXRe4iF6OXnaq1riyQAPFOBtYc77Mxq0g==",
3473
+ "funding": [
3474
+ "https://github.com/fb55/htmlparser2?sponsor=1",
3475
+ {
3476
+ "type": "github",
3477
+ "url": "https://github.com/sponsors/fb55"
3478
+ }
3479
+ ],
3480
+ "dependencies": {
3481
+ "domelementtype": "^2.3.0",
3482
+ "domhandler": "^5.0.3",
3483
+ "domutils": "^3.2.1",
3484
+ "entities": "^6.0.0"
3485
+ }
3486
+ },
3487
  "node_modules/https-proxy-agent": {
3488
  "version": "2.2.4",
3489
  "resolved": "https://registry.npmjs.org/https-proxy-agent/-/https-proxy-agent-2.2.4.tgz",
 
3521
  "ms": "^2.0.0"
3522
  }
3523
  },
3524
+ "node_modules/iconv-lite": {
3525
+ "version": "0.6.3",
3526
+ "resolved": "https://registry.npmjs.org/iconv-lite/-/iconv-lite-0.6.3.tgz",
3527
+ "integrity": "sha512-4fCk79wshMdzMp2rH06qWrJE4iolqLhCUH+OiuIgU++RB0+94NlDL81atO7GX55uUKueo0txHNtvEyI6D7WdMw==",
3528
+ "dependencies": {
3529
+ "safer-buffer": ">= 2.1.2 < 3.0.0"
3530
+ },
3531
+ "engines": {
3532
+ "node": ">=0.10.0"
3533
+ }
3534
+ },
3535
  "node_modules/import-local": {
3536
  "version": "3.2.0",
3537
  "resolved": "https://registry.npmjs.org/import-local/-/import-local-3.2.0.tgz",
 
4832
  "set-blocking": "^2.0.0"
4833
  }
4834
  },
4835
+ "node_modules/nth-check": {
4836
+ "version": "2.1.1",
4837
+ "resolved": "https://registry.npmjs.org/nth-check/-/nth-check-2.1.1.tgz",
4838
+ "integrity": "sha512-lqjrjmaOoAnWfMmBPL+XNnynZh2+swxiX3WUE0s4yEHI6m+AwrK2UZOimIRl3X/4QctVqS8AiZjFqyOGrMXb/w==",
4839
+ "dependencies": {
4840
+ "boolbase": "^1.0.0"
4841
+ },
4842
+ "funding": {
4843
+ "url": "https://github.com/fb55/nth-check?sponsor=1"
4844
+ }
4845
+ },
4846
  "node_modules/object-assign": {
4847
  "version": "4.1.1",
4848
  "resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz",
 
4961
  "url": "https://github.com/sponsors/sindresorhus"
4962
  }
4963
  },
4964
+ "node_modules/parse5-htmlparser2-tree-adapter": {
4965
+ "version": "7.1.0",
4966
+ "resolved": "https://registry.npmjs.org/parse5-htmlparser2-tree-adapter/-/parse5-htmlparser2-tree-adapter-7.1.0.tgz",
4967
+ "integrity": "sha512-ruw5xyKs6lrpo9x9rCZqZZnIUntICjQAd0Wsmp396Ul9lN/h+ifgVV1x1gZHi8euej6wTfpqX8j+BFQxF0NS/g==",
4968
+ "dependencies": {
4969
+ "domhandler": "^5.0.3",
4970
+ "parse5": "^7.0.0"
4971
+ },
4972
+ "funding": {
4973
+ "url": "https://github.com/inikulin/parse5?sponsor=1"
4974
+ }
4975
+ },
4976
+ "node_modules/parse5-htmlparser2-tree-adapter/node_modules/parse5": {
4977
+ "version": "7.3.0",
4978
+ "resolved": "https://registry.npmjs.org/parse5/-/parse5-7.3.0.tgz",
4979
+ "integrity": "sha512-IInvU7fabl34qmi9gY8XOVxhYyMyuH2xUNpb2q8/Y+7552KlejkRvqvD19nMoUW/uQGGbqNpA6Tufu5FL5BZgw==",
4980
+ "dependencies": {
4981
+ "entities": "^6.0.0"
4982
+ },
4983
+ "funding": {
4984
+ "url": "https://github.com/inikulin/parse5?sponsor=1"
4985
+ }
4986
+ },
4987
+ "node_modules/parse5-parser-stream": {
4988
+ "version": "7.1.2",
4989
+ "resolved": "https://registry.npmjs.org/parse5-parser-stream/-/parse5-parser-stream-7.1.2.tgz",
4990
+ "integrity": "sha512-JyeQc9iwFLn5TbvvqACIF/VXG6abODeB3Fwmv/TGdLk2LfbWkaySGY72at4+Ty7EkPZj854u4CrICqNk2qIbow==",
4991
+ "dependencies": {
4992
+ "parse5": "^7.0.0"
4993
+ },
4994
+ "funding": {
4995
+ "url": "https://github.com/inikulin/parse5?sponsor=1"
4996
+ }
4997
+ },
4998
+ "node_modules/parse5-parser-stream/node_modules/parse5": {
4999
+ "version": "7.3.0",
5000
+ "resolved": "https://registry.npmjs.org/parse5/-/parse5-7.3.0.tgz",
5001
+ "integrity": "sha512-IInvU7fabl34qmi9gY8XOVxhYyMyuH2xUNpb2q8/Y+7552KlejkRvqvD19nMoUW/uQGGbqNpA6Tufu5FL5BZgw==",
5002
+ "dependencies": {
5003
+ "entities": "^6.0.0"
5004
+ },
5005
+ "funding": {
5006
+ "url": "https://github.com/inikulin/parse5?sponsor=1"
5007
+ }
5008
+ },
5009
  "node_modules/path-exists": {
5010
  "version": "4.0.0",
5011
  "resolved": "https://registry.npmjs.org/path-exists/-/path-exists-4.0.0.tgz",
 
5256
  }
5257
  ]
5258
  },
5259
+ "node_modules/safer-buffer": {
5260
+ "version": "2.1.2",
5261
+ "resolved": "https://registry.npmjs.org/safer-buffer/-/safer-buffer-2.1.2.tgz",
5262
+ "integrity": "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg=="
5263
+ },
5264
  "node_modules/secure-json-parse": {
5265
  "version": "3.0.2",
5266
  "resolved": "https://registry.npmjs.org/secure-json-parse/-/secure-json-parse-3.0.2.tgz",
 
5919
  "resolved": "https://registry.npmjs.org/webidl-conversions/-/webidl-conversions-3.0.1.tgz",
5920
  "integrity": "sha512-2JAn3z8AR6rjK8Sm8orRC0h/bcl/DqL7tRPdGZ4I1CjdF+EaMLmYxBHyXuKL849eucPFhvBoxMsflfOb8kxaeQ=="
5921
  },
5922
+ "node_modules/whatwg-encoding": {
5923
+ "version": "3.1.1",
5924
+ "resolved": "https://registry.npmjs.org/whatwg-encoding/-/whatwg-encoding-3.1.1.tgz",
5925
+ "integrity": "sha512-6qN4hJdMwfYBtE3YBTTHhoeuUrDBPZmbQaxWAqSALV/MeEnR5z1xd8UKud2RAkFoPkmB+hli1TZSnyi84xz1vQ==",
5926
+ "dependencies": {
5927
+ "iconv-lite": "0.6.3"
5928
+ },
5929
+ "engines": {
5930
+ "node": ">=18"
5931
+ }
5932
+ },
5933
+ "node_modules/whatwg-mimetype": {
5934
+ "version": "4.0.0",
5935
+ "resolved": "https://registry.npmjs.org/whatwg-mimetype/-/whatwg-mimetype-4.0.0.tgz",
5936
+ "integrity": "sha512-QaKxh0eNIi2mE9p2vEdzfagOKHCcj1pJ56EEHGQOVxp8r9/iszLUUV7v89x9O1p/T+NlTM5W7jW6+cz4Fq1YVg==",
5937
+ "engines": {
5938
+ "node": ">=18"
5939
+ }
5940
+ },
5941
  "node_modules/whatwg-url": {
5942
  "version": "5.0.0",
5943
  "resolved": "https://registry.npmjs.org/whatwg-url/-/whatwg-url-5.0.0.tgz",
package.json CHANGED
@@ -4,7 +4,7 @@
4
  "description": "",
5
  "main": "index.js",
6
  "scripts": {
7
- "test": "jest"
8
  },
9
  "keywords": [],
10
  "author": "",
@@ -13,6 +13,7 @@
13
  "@elastic/elasticsearch": "^8.17.0",
14
  "@tensorflow/tfjs-node": "^4.22.0",
15
  "@types/elasticsearch": "^5.0.43",
 
16
  "dotenv": "^16.4.7",
17
  "elasticsearch": "^16.7.3",
18
  "fs-extra": "^11.3.0",
@@ -24,4 +25,4 @@
24
  "jest": "^30.2.0",
25
  "supertest": "^7.1.4"
26
  }
27
- }
 
4
  "description": "",
5
  "main": "index.js",
6
  "scripts": {
7
+ "test": "jest"
8
  },
9
  "keywords": [],
10
  "author": "",
 
13
  "@elastic/elasticsearch": "^8.17.0",
14
  "@tensorflow/tfjs-node": "^4.22.0",
15
  "@types/elasticsearch": "^5.0.43",
16
+ "cheerio": "^1.1.2",
17
  "dotenv": "^16.4.7",
18
  "elasticsearch": "^16.7.3",
19
  "fs-extra": "^11.3.0",
 
25
  "jest": "^30.2.0",
26
  "supertest": "^7.1.4"
27
  }
28
+ }
tests/extraction/clean_html.test.js ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ const { cleanHTML } = require("../../data_extraction/clean_html");
2
+
3
+ describe("cleanHTML", () => {
4
+ test("removes scripts, styles, headers, footers, ads", () => {
5
+ const html = `
6
+ <html>
7
+ <head><style>.x{}</style></head>
8
+ <body>
9
+ <header>HEADER</header>
10
+ <script>alert("x")</script>
11
+ <div class="ads">Buy NOW</div>
12
+ <p>Hello world</p>
13
+ <footer>FOOTER</footer>
14
+ </body>
15
+ </html>
16
+ `;
17
+
18
+ const out = cleanHTML(html);
19
+
20
+ expect(out).toContain("Hello world");
21
+ expect(out).not.toMatch(/HEADER|FOOTER|Buy NOW|alert/);
22
+ });
23
+
24
+ test("collapses whitespace", () => {
25
+ const html = `<p>Text</p>\n\n<p>More</p>`;
26
+ const out = cleanHTML(html);
27
+ expect(out).toBe("Text More");
28
+ });
29
+ });
tests/extraction/convert_raw_to_sessions.test.js ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ const fs = require("fs");
2
+ const path = require("path");
3
+ const { convertDir } = require("../../data_extraction/convert_raw_to_sessions");
4
+
5
+ describe("convert_raw_to_sessions", () => {
6
+ const rawDir = path.join(__dirname, "raw_data");
7
+ const outDir = path.join(__dirname, "out_data");
8
+
9
+ beforeAll(() => {
10
+ fs.mkdirSync(rawDir, { recursive: true });
11
+
12
+ fs.writeFileSync(
13
+ path.join(rawDir, "example.html"),
14
+ `
15
+ <h1>Q: How do I serve?</h1>
16
+ <p>A: Begin with kindness.</p>
17
+ `
18
+ );
19
+ });
20
+
21
+ afterAll(() => {
22
+ fs.rmSync(rawDir, { recursive: true });
23
+ fs.rmSync(outDir, { recursive: true });
24
+ });
25
+
26
+ test("converts raw HTML to JSON session", () => {
27
+ convertDir(rawDir, outDir);
28
+
29
+ const outFile = path.join(outDir, "example.json");
30
+ expect(fs.existsSync(outFile)).toBe(true);
31
+
32
+ const session = JSON.parse(fs.readFileSync(outFile, "utf8"));
33
+
34
+ expect(session.title).toBe("example.html");
35
+ expect(session.turns.length).toBe(2);
36
+ expect(session.turns[0].role).toBe("user");
37
+ });
38
+ });
tests/extraction/extractor.test.js ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ const fs = require("fs");
2
+ const path = require("path");
3
+ const { extractFromHTML } = require("../../data_extraction/extractor");
4
+
5
+ describe("extractor", () => {
6
+ const sampleFile = path.join(__dirname, "sample.html");
7
+
8
+ beforeAll(() => {
9
+ fs.writeFileSync(
10
+ sampleFile,
11
+ `
12
+ <h1>Q: What is service?</h1>
13
+ <p>A: Service is love made visible.</p>
14
+ `
15
+ );
16
+ });
17
+
18
+ afterAll(() => {
19
+ fs.unlinkSync(sampleFile);
20
+ });
21
+
22
+ test("extracts paragraphs and roles", () => {
23
+ const turns = extractFromHTML(sampleFile);
24
+
25
+ expect(turns.length).toBe(2);
26
+
27
+ expect(turns[0].role).toBe("user");
28
+ expect(turns[0].content).toContain("What is service");
29
+
30
+ expect(turns[1].role).toBe("assistant");
31
+ expect(turns[1].content).toContain("love made visible");
32
+ });
33
+ });
tests/extraction/walk_and_extract.test.js ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ const fs = require("fs");
2
+ const path = require("path");
3
+ const { execSync } = require("child_process");
4
+
5
+ describe("walk_and_extract CLI", () => {
6
+ const cli = path.resolve(__dirname, "../../data_extraction/walk_and_extract.js");
7
+ const rawDir = path.join(__dirname, "raw_cli");
8
+ const outDir = path.join(__dirname, "out_cli");
9
+
10
+ beforeAll(() => {
11
+ fs.mkdirSync(rawDir, { recursive: true });
12
+ fs.writeFileSync(
13
+ path.join(rawDir, "file1.html"),
14
+ `
15
+ <h1>Q: What is love?</h1>
16
+ <p>A: Love is unity.</p>
17
+ `
18
+ );
19
+ });
20
+
21
+ afterAll(() => {
22
+ fs.rmSync(rawDir, { recursive: true });
23
+ fs.rmSync(outDir, { recursive: true });
24
+ });
25
+
26
+ test("prints usage with no args", () => {
27
+ const output = execSync(`node ${cli}`, { encoding: "utf8" });
28
+ expect(output).toMatch(/Usage:/);
29
+ });
30
+
31
+ test("extracts files when given args", () => {
32
+ execSync(`node ${cli} ${rawDir} ${outDir}`);
33
+
34
+ const outFile = path.join(outDir, "file1.json");
35
+ expect(fs.existsSync(outFile)).toBe(true);
36
+
37
+ const json = JSON.parse(fs.readFileSync(outFile, "utf8"));
38
+ expect(json.turns.length).toBeGreaterThan(0);
39
+ });
40
+ });