// ChatIPC := Chat Incremental Pattern Constructor #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef _OPENMP #include #else inline int omp_get_max_threads(){ return 1; } inline int omp_get_thread_num(){ return 0; } #endif extern unsigned char dictionary_json[]; // provide dictionary.cpp to embed dictionary JSON bytes extern unsigned int dictionary_json_len; // --------------------------- Short utility functions ---------------------- static inline bool is_space(char c){ return std::isspace(static_cast(c)) != 0; } static inline char to_low(char c){ return static_cast(std::tolower(static_cast(c))); } static inline void safe_flush(std::ostream &os){ os.flush(); } // Tokenize by whitespace static std::vector tokenize_whitespace(const std::string &s){ std::istringstream iss(s); std::vector out; std::string t; while (iss >> t) out.push_back(t); return out; } // Tokenize by non-alphanumeric characters (for definitions) static std::vector tokenize_non_alnum(const std::string &s){ std::vector out; std::string cur; for (char ch : s){ if (std::isalnum(static_cast(ch)) || ch=='-' || ch=='\''){ cur.push_back(to_low(ch)); } else { if (!cur.empty()){ out.push_back(cur); cur.clear(); } } } if (!cur.empty()) out.push_back(cur); return out; } // --------------------------- String interning (short methods) -------------- struct StringInterner { std::unordered_set pool; std::mutex m; const std::string* intern(const std::string &s){ std::lock_guard lk(m); auto it = pool.find(s); if (it != pool.end()) return &*it; auto pr = pool.insert(s); return &*pr.first; } }; // ---------- Global parsed dictionary (populated once in main) ---------- static std::unordered_map g_raw_dict; static std::unordered_set def_tokens_from_text(const std::string &s){ auto toks = tokenize_non_alnum(s); return std::unordered_set(toks.begin(), toks.end()); } // --------------------------- Knowledge base (short methods) -------------- using StrPtr = const std::string*; struct PtrHash { size_t operator()(StrPtr p) const noexcept { return std::hash()(*p); } }; struct PtrEq { bool operator()(StrPtr a, StrPtr b) const noexcept { return *a == *b; } }; using NextSet = std::vector; struct KnowledgeBase { StringInterner interner; std::unordered_map next; std::mutex m; // def-index: for each interned word pointer -> list of interned tokens (definition expansion) std::unordered_map, PtrHash, PtrEq> def_index; std::mutex def_m; int def_depth = 0; void add_pair_interned(StrPtr k, StrPtr v){ std::lock_guard lk(m); auto &vec = next[k]; for (auto p : vec) if (*p == *v) return; vec.push_back(v); } // set def depth; if changed, drop previously computed def expansions void set_def_depth(int D){ std::lock_guard lk(def_m); if (D != def_depth){ def_index.clear(); def_depth = D; } } // compute definition expansion for a single interned word (if needed) void ensure_def_for_interned(StrPtr wp){ // quick no-op checks if (wp == nullptr) return; if (def_depth <= 0) return; // double-checked locking { std::lock_guard lk(def_m); if (def_index.find(wp) != def_index.end()) return; } // compute expansion using global parsed dictionary g_raw_dict std::unordered_set acc; std::vector frontier; auto it_raw = g_raw_dict.find(*wp); if (it_raw != g_raw_dict.end()){ auto toks = def_tokens_from_text(it_raw->second); for (auto &t : toks){ if (acc.insert(t).second) frontier.push_back(t); } } for (int depth = 1; depth < def_depth && !frontier.empty(); ++depth){ std::vector nextf; for (auto &w : frontier){ auto it2 = g_raw_dict.find(w); if (it2 == g_raw_dict.end()) continue; auto toks2 = def_tokens_from_text(it2->second); for (auto &t : toks2){ if (acc.insert(t).second) nextf.push_back(t); } } frontier.swap(nextf); } // intern all accumulated tokens and store pointers std::vector out; out.reserve(acc.size()); for (auto &s : acc){ out.push_back(interner.intern(s)); } // store atomically (prevent double insertion) { std::lock_guard lk(def_m); // another thread may have inserted meanwhile; do not overwrite if (def_index.find(wp) == def_index.end()){ def_index.emplace(wp, std::move(out)); } } } // existing public add_pair but now ensure def-expansion is built immediately void add_pair(const std::string &k, const std::string &v){ StrPtr kp = interner.intern(k); StrPtr vp = interner.intern(v); // ensure definition expansion for both words as soon as they are seen ensure_def_for_interned(kp); ensure_def_for_interned(vp); add_pair_interned(kp, vp); } std::optional lookup_by_string(const std::string &k) const { for (auto &pr : next) if (*pr.first == k) return pr.second; return std::nullopt; } std::optional lookup_by_ptr(StrPtr k) const { auto it = next.find(k); if (it==next.end()) return std::nullopt; return it->second; } }; // thread-safe snapshot of kb.def_index as string-based def-index static std::unordered_map> snapshot_def_index(KnowledgeBase &kb){ std::unordered_map> out; std::lock_guard lk(kb.def_m); out.reserve(kb.def_index.size()); for (auto &pr : kb.def_index){ std::unordered_set s; s.reserve(pr.second.size()); for (auto p : pr.second) s.insert(*p); out.emplace(*pr.first, std::move(s)); } return out; } // --------------------------- Small JSON parse helpers ---------------------- static inline bool json_valid_index(size_t i, size_t n){ return i < n; } static std::string parse_quoted_string(const std::string &text, size_t &i){ std::string out; if (!json_valid_index(i, text.size()) || text[i] != '"') throw std::runtime_error("expected '\"'"); ++i; while (json_valid_index(i, text.size())){ char c = text[i++]; if (c == '"') break; if (c == '\\'){ if (!json_valid_index(i, text.size())) break; char e = text[i++]; if (e=='n') out.push_back('\n'); else if (e=='t') out.push_back('\t'); else out.push_back(e); } else out.push_back(c); } return out; } static void skip_spaces(const std::string &s, size_t &i){ while (json_valid_index(i, s.size()) && is_space(s[i])) ++i; } // Very small JSON-like parser tailored to dictionary_json structure static std::unordered_map parse_dictionary_json(){ std::unordered_map dict; if (dictionary_json_len == 0) return dict; std::string text; text.reserve(dictionary_json_len + 1); for (unsigned int b=0; b < dictionary_json_len; ++b) text.push_back(static_cast(dictionary_json[b])); size_t i = 0; skip_spaces(text,i); if (!json_valid_index(i,text.size()) || text[i] != '{') return dict; ++i; while (true){ skip_spaces(text,i); if (!json_valid_index(i,text.size())) break; if (text[i] == '}'){ ++i; break; } std::string key = parse_quoted_string(text,i); skip_spaces(text,i); if (!json_valid_index(i,text.size()) || text[i] != ':') break; ++i; skip_spaces(text,i); std::string val; if (json_valid_index(i,text.size()) && text[i] == '"') val = parse_quoted_string(text,i); else { size_t start = i; while (json_valid_index(i,text.size()) && text[i] != ',' && text[i] != '}') ++i; val = text.substr(start, i-start); } dict.emplace(std::move(key), std::move(val)); skip_spaces(text,i); if (json_valid_index(i,text.size()) && text[i] == ','){ ++i; continue; } if (json_valid_index(i,text.size()) && text[i] == '}'){ ++i; break; } } return dict; } // --------------------------- Similarity helpers (very small) ---------------- static double jaccard_similarity(const std::unordered_set &A, const std::unordered_set &B) { if (A.empty() && B.empty()) return 1.0; size_t inter = 0; if (A.size() < B.size()){ for (const auto &x : A) if (B.count(x)) ++inter; } else { for (const auto &x : B) if (A.count(x)) ++inter; } size_t uni = A.size() + B.size() - inter; if (uni == 0) return 0.0; return static_cast(inter) / static_cast(uni); } static std::unordered_set aggregate_sets(const std::vector &tokens, const std::unordered_map> &def_index) { std::unordered_set agg; for (auto &t : tokens){ agg.insert(t); auto it = def_index.find(t); if (it != def_index.end()){ for (auto &d : it->second) agg.insert(d); } } return agg; } // --------------------------- Candidate selection (short funcs) --------------- static std::string best_candidate_by_similarity(const NextSet &cands, const std::vector &prompt_toks, const std::vector &resp_toks, const std::unordered_map> &def_index, const std::unordered_map &recent_counts, double repeat_penalty) { if (cands.empty()) return std::string(); if (cands.size() == 1) return *cands[0]; auto agg = aggregate_sets(prompt_toks, def_index); for (auto &r : resp_toks){ auto it = def_index.find(r); if (it != def_index.end()) for (auto &d : it->second) agg.insert(d); } double best = -1e9; std::string best_tok; size_t M = cands.size(); std::vector scores(M, 0.0); #pragma omp parallel for schedule(static) for (ptrdiff_t i=0;i(M);++i){ std::unordered_set candset; candset.insert(*cands[(size_t)i]); auto it = def_index.find(*cands[(size_t)i]); if (it != def_index.end()) for (auto &d : it->second) candset.insert(d); double s = jaccard_similarity(agg, candset); scores[(size_t)i] = s; } for (size_t i=0;isecond); double adjusted = s - repeat_penalty * static_cast(cnt); if (adjusted > best || (adjusted == best && tok < best_tok)){ best = adjusted; best_tok = tok; } } return best_tok; } // --------------------------- Response constructor (short units) --------------- static std::vector construct_response(KnowledgeBase &kb, const std::vector &prompt_toks, size_t maxlen, const std::unordered_map> &def_index, double repeat_penalty) { std::vector resp; if (prompt_toks.empty() || maxlen == 0) return resp; std::unordered_map recent_counts; auto would_create_2_cycle = [&](const std::string &cand) -> bool { if (resp.size() < 2) return false; // check alternation: X Y X Y ... then candidate == X and last == Y const std::string &last = resp.back(); const std::string &prev = resp[resp.size()-2]; return (cand == prev && last == resp[resp.size()-3 < resp.size() ? resp.size()-3 : 0]); // this is a cheap conservative check; main guard is repeat_penalty + single-candidate rule }; std::string last_printed; for (size_t step=0; step(prompt_toks.size())-1; p>=0; --p){ auto opt = kb.lookup_by_string(prompt_toks[(size_t)p]); if (opt){ candidates = *opt; found = true; break; } } } else { auto opt = kb.lookup_by_string(last_printed); if (opt){ candidates = *opt; found = true; } else { for (ssize_t p = static_cast(prompt_toks.size())-1; p>=0; --p){ auto opt2 = kb.lookup_by_string(prompt_toks[(size_t)p]); if (opt2){ candidates = *opt2; found = true; break; } } } } if (!found || candidates.empty()) break; // If only one candidate and it already appeared, stop to avoid 1-cycle. if (candidates.size()==1){ std::string only = *candidates[0]; if (recent_counts[only] > 0) break; resp.push_back(only); recent_counts[only] += 1; last_printed = only; std::cout << only << ' ' << std::flush; // print immediately continue; } // choose best with repeat penalty std::string chosen = best_candidate_by_similarity(candidates, prompt_toks, resp, def_index, recent_counts, repeat_penalty); if (chosen.empty()) break; // cheap 2-cycle avoider: if this would continue a trivial alternation, stop if (would_create_2_cycle(chosen)) break; resp.push_back(chosen); recent_counts[chosen] += 1; last_printed = chosen; std::cout << chosen << ' ' << std::flush; // print immediately } return resp; } // --------------------------- Learning from files (short) ------------------- static void learn_from_file(KnowledgeBase &kb, const std::string &fname){ std::ifstream ifs(fname); if (!ifs) return; std::string tok; std::string prev; bool have_prev = false; while (ifs >> tok){ if (have_prev) kb.add_pair(prev, tok); prev = tok; have_prev = true; } } static void learn_files_parallel(KnowledgeBase &kb, const std::vector &files){ #pragma omp parallel for schedule(dynamic) for (ptrdiff_t i=0;i(files.size());++i) learn_from_file(kb, files[(size_t)i]); } // --------------------------- Serialization (short functions) ---------------- static void save_kb_binary(const KnowledgeBase &kb, const std::string &fname){ std::ofstream ofs(fname, std::ios::binary); if (!ofs) throw std::runtime_error("cannot open save file"); // interned strings snapshot (must include all tokens used by def_index) std::vector interned; interned.reserve(kb.interner.pool.size()); for (auto &s : kb.interner.pool) interned.push_back(&s); uint64_t N = interned.size(); ofs.write(reinterpret_cast(&N), sizeof(N)); for (auto p : interned){ uint64_t L = p->size(); ofs.write(reinterpret_cast(&L), sizeof(L)); ofs.write(p->data(), static_cast(L)); } // edges uint64_t E = kb.next.size(); ofs.write(reinterpret_cast(&E), sizeof(E)); for (auto &pr : kb.next){ const std::string &key = *pr.first; auto it = std::find_if(interned.begin(), interned.end(), [&](const std::string* s){ return *s == key; }); if (it == interned.end()) throw std::runtime_error("save index error"); uint64_t key_idx = static_cast(std::distance(interned.begin(), it)); ofs.write(reinterpret_cast(&key_idx), sizeof(key_idx)); uint64_t M = pr.second.size(); ofs.write(reinterpret_cast(&M), sizeof(M)); for (auto nxt : pr.second){ auto it2 = std::find_if(interned.begin(), interned.end(), [&](const std::string* s){ return *s == *nxt; }); if (it2 == interned.end()) throw std::runtime_error("save index error2"); uint64_t v_idx = static_cast(std::distance(interned.begin(), it2)); ofs.write(reinterpret_cast(&v_idx), sizeof(v_idx)); } } // --- write definition expansion section --- uint64_t D = static_cast(kb.def_depth); ofs.write(reinterpret_cast(&D), sizeof(D)); // def entries: number of keys with a stored expansion uint64_t K = kb.def_index.size(); ofs.write(reinterpret_cast(&K), sizeof(K)); for (auto &pr : kb.def_index){ // key index const std::string &key = *pr.first; auto it = std::find_if(interned.begin(), interned.end(), [&](const std::string* s){ return *s == key; }); if (it == interned.end()) throw std::runtime_error("save def index error"); uint64_t key_idx = static_cast(std::distance(interned.begin(), it)); ofs.write(reinterpret_cast(&key_idx), sizeof(key_idx)); // number of tokens uint64_t M = pr.second.size(); ofs.write(reinterpret_cast(&M), sizeof(M)); for (auto tokp : pr.second){ auto it2 = std::find_if(interned.begin(), interned.end(), [&](const std::string* s){ return *s == *tokp; }); if (it2 == interned.end()) throw std::runtime_error("save def token index error"); uint64_t v_idx = static_cast(std::distance(interned.begin(), it2)); ofs.write(reinterpret_cast(&v_idx), sizeof(v_idx)); } } safe_flush(ofs); } static void load_kb_binary(KnowledgeBase &kb, const std::string &fname, int cli_dict_depth){ std::ifstream ifs(fname, std::ios::binary); if (!ifs) throw std::runtime_error("cannot open load file"); uint64_t N; ifs.read(reinterpret_cast(&N), sizeof(N)); std::vector strings; strings.reserve((size_t)N); for (uint64_t i=0;i(&L), sizeof(L)); std::string s; s.resize((size_t)L); ifs.read(&s[0], static_cast(L)); strings.push_back(std::move(s)); } std::vector ptrs; ptrs.reserve(strings.size()); for (auto &s : strings) ptrs.push_back(kb.interner.intern(s)); uint64_t E; ifs.read(reinterpret_cast(&E), sizeof(E)); for (uint64_t i=0;i(&key_idx), sizeof(key_idx)); uint64_t M; ifs.read(reinterpret_cast(&M), sizeof(M)); StrPtr key_ptr = ptrs.at((size_t)key_idx); NextSet vec; vec.reserve((size_t)M); for (uint64_t j=0;j(&v_idx), sizeof(v_idx)); vec.push_back(ptrs.at((size_t)v_idx)); } kb.next.emplace(key_ptr, std::move(vec)); } // read def-expansion section (new-format) uint64_t file_def_depth; ifs.read(reinterpret_cast(&file_def_depth), sizeof(file_def_depth)); uint64_t K; ifs.read(reinterpret_cast(&K), sizeof(K)); // populate kb.def_index from file { std::lock_guard lk(kb.def_m); kb.def_index.clear(); kb.def_depth = static_cast(file_def_depth); } for (uint64_t i=0;i(&key_idx), sizeof(key_idx)); uint64_t M; ifs.read(reinterpret_cast(&M), sizeof(M)); std::vector tokens; tokens.reserve((size_t)M); for (uint64_t j=0;j(&v_idx), sizeof(v_idx)); tokens.push_back(ptrs.at((size_t)v_idx)); } kb.def_index.emplace(ptrs.at((size_t)key_idx), std::move(tokens)); } // If CLI requested a different dict depth, clear and recompute expansion for loaded words only if (cli_dict_depth != kb.def_depth){ kb.set_def_depth(cli_dict_depth); // --- build deduplicated union of "words present" = saved strings (ptrs) ∪ KB words (keys and neighbors) std::vector targets; targets.reserve(ptrs.size() + kb.next.size()*2); { std::unordered_set seen; // include all strings from the saved file for (auto p : ptrs) { if (seen.insert(p).second) targets.push_back(p); } // include all words present in KB edges (keys and their neighbors) for (auto &pr : kb.next) { if (seen.insert(pr.first).second) targets.push_back(pr.first); for (auto v : pr.second) { if (seen.insert(v).second) targets.push_back(v); } } } // --- recompute definition expansion for each target in parallel #pragma omp parallel for schedule(dynamic) for (ptrdiff_t i = 0; i < static_cast(targets.size()); ++i) { kb.ensure_def_for_interned(targets[(size_t)i]); } } } // --------------------------- CLI + Interactive loop (shorters) ----------- static void print_usage(const char *p){ std::cout << "Usage: " << p << " [--maxlen N] [--save FILE] [--load-kb FILE] [--dict-depth D] [--learn f1 f2 ...] [--repeat-penalty P] [--help]\n"; std::cout << " --maxlen N Maximum number of tokens constructed in a response.\n"; std::cout << " --save FILE Save the knowledge-base and dictionary expansions to a binary file.\n"; std::cout << " --load-kb FILE Load a previously saved knowledge-base (and dictionary expansions) from a binary file.\n"; std::cout << " --dict-depth D Depth of dictionary-definition expansion used during learning.\n"; std::cout << " --learn f1 f2 ... Learn from one or more text files to update the knowledge base.\n"; std::cout << " --repeat-penalty P Penalize repeated tokens during response generation (higher values discourage repetition).\n"; std::cout << " --help Show command-line interface options for ChatIPC usage.\n"; } int main(int argc, char **argv){ size_t maxlen = 100; std::string savefile; std::string load_txt; std::string load_kb; int dict_depth = 2; double repeat_penalty = 0.7; // default λ std::vector learn_files; for (int i=1;i " , std::getline(std::cin, line)){ if (line.empty()){ std::cout << "\n"; continue; } auto prompt_toks = tokenize_whitespace(line); for (size_t i=1;i combined = prompt_toks; combined.insert(combined.end(), resp.begin(), resp.end()); for (size_t i=1;i