// ChatIPC := Chat Incremental Pattern Constructor #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef _OPENMP #include #else inline int omp_get_max_threads(){ return 1; } inline int omp_get_thread_num(){ return 0; } #endif extern unsigned char dictionary_json[]; // provide dictionary.cpp to embed dictionary JSON bytes extern unsigned int dictionary_json_len; // --------------------------- Short utility functions ---------------------- static inline bool is_space(char c){ return std::isspace(static_cast(c)) != 0; } static inline char to_low(char c){ return static_cast(std::tolower(static_cast(c))); } static inline void safe_flush(std::ostream &os){ os.flush(); } // Tokenize by whitespace static std::vector tokenize_whitespace(const std::string &s){ std::istringstream iss(s); std::vector out; std::string t; while (iss >> t) out.push_back(t); return out; } // Tokenize by non-alphanumeric characters (for definitions) static std::vector tokenize_non_alnum(const std::string &s){ std::vector out; std::string cur; for (char ch : s){ if (std::isalnum(static_cast(ch)) || ch=='-' || ch=='\''){ cur.push_back(to_low(ch)); } else { if (!cur.empty()){ out.push_back(cur); cur.clear(); } } } if (!cur.empty()) out.push_back(cur); return out; } // --------------------------- String interning (short methods) -------------- struct StringInterner { std::unordered_set pool; std::mutex m; const std::string* intern(const std::string &s){ std::lock_guard lk(m); auto [it, inserted] = pool.emplace(s); return &*it; } }; // ---------- Global parsed dictionary (populated once in main) ---------- static std::unordered_map global_def_dict; static std::unordered_map> global_def_tokens_cache; static void build_def_tokens_cache(){ global_def_tokens_cache.clear(); global_def_tokens_cache.reserve(global_def_dict.size()); for (const auto &pr : global_def_dict){ auto toks = tokenize_non_alnum(pr.second); std::sort(toks.begin(), toks.end()); toks.erase(std::unique(toks.begin(), toks.end()), toks.end()); global_def_tokens_cache.emplace(pr.first, std::move(toks)); } } // --------------------------- Knowledge base (short methods) -------------- using StrPtr = const std::string*; struct PtrHash { size_t operator()(StrPtr p) const noexcept { return std::hash()(p); } }; struct PtrEq { bool operator()(StrPtr a, StrPtr b) const noexcept { return a == b; } }; using NextSet = std::vector; struct KnowledgeBase { StringInterner interner; std::unordered_map next; std::unordered_map next_key_index; mutable std::mutex m; // def-index: for each interned word pointer -> list of interned tokens (definition expansion) std::unordered_map, PtrHash, PtrEq> def_index; mutable std::mutex def_m; int def_depth = 0; void add_pair_interned(StrPtr k, StrPtr v){ std::lock_guard lk(m); next_key_index.emplace(*k, k); auto &vec = next[k]; for (auto p : vec) if (p == v) return; vec.push_back(v); } // set def depth; if changed, drop previously computed def expansions void set_def_depth(int D){ std::lock_guard lk(def_m); if (D != def_depth){ def_index.clear(); def_depth = D; } } void ensure_def_for_interned(StrPtr wp){ if (wp == nullptr) return; if (def_depth <= 0) return; { std::lock_guard lk(def_m); if (def_index.find(wp) != def_index.end()) return; } std::unordered_set acc; std::vector frontier; auto it_def = global_def_tokens_cache.find(*wp); if (it_def != global_def_tokens_cache.end()){ for (const auto &t : it_def->second){ StrPtr tp = interner.intern(t); if (acc.insert(tp).second) frontier.push_back(tp); } } for (int depth = 1; depth < def_depth && !frontier.empty(); ++depth){ std::vector nextf; for (StrPtr w : frontier){ auto it2 = global_def_tokens_cache.find(*w); if (it2 == global_def_tokens_cache.end()) continue; for (const auto &t : it2->second){ StrPtr tp = interner.intern(t); if (acc.insert(tp).second) nextf.push_back(tp); } } frontier.swap(nextf); } std::vector out; out.reserve(acc.size()); for (StrPtr p : acc) out.push_back(p); { std::lock_guard lk(def_m); if (def_index.find(wp) == def_index.end()){ def_index.emplace(wp, std::move(out)); } } } // existing public add_pair but now ensure def-expansion is built immediately void add_pair(const std::string &k, const std::string &v){ StrPtr kp = interner.intern(k); StrPtr vp = interner.intern(v); // ensure definition expansion for both words as soon as they are seen ensure_def_for_interned(kp); ensure_def_for_interned(vp); add_pair_interned(kp, vp); } std::optional lookup_by_string(const std::string &k) const { std::lock_guard lk(m); auto kit = next_key_index.find(k); if (kit == next_key_index.end()) return std::nullopt; auto it = next.find(kit->second); if (it == next.end()) return std::nullopt; return it->second; } std::optional lookup_by_ptr(StrPtr k) const { std::lock_guard lk(m); auto it = next.find(k); if (it == next.end()) return std::nullopt; return it->second; } }; static std::vector intern_tokens(KnowledgeBase &kb, const std::vector &tokens) { std::vector out; out.reserve(tokens.size()); for (const auto &t : tokens) out.push_back(kb.interner.intern(t)); return out; } static std::unordered_set aggregate_sets(const std::vector &tokens, const std::unordered_map, PtrHash, PtrEq> &def_index) { std::unordered_set agg; for (StrPtr t : tokens){ agg.insert(*t); auto it = def_index.find(t); if (it != def_index.end()){ for (StrPtr d : it->second) agg.insert(*d); } } return agg; } // --------------------------- Small JSON parse helpers ---------------------- static inline bool json_valid_index(size_t i, size_t n){ return i < n; } static std::string parse_quoted_string(const std::string &text, size_t &i){ std::string out; if (!json_valid_index(i, text.size()) || text[i] != '"') throw std::runtime_error("expected '\"'"); ++i; while (json_valid_index(i, text.size())){ char c = text[i++]; if (c == '"') break; if (c == '\\'){ if (!json_valid_index(i, text.size())) break; char e = text[i++]; if (e=='n') out.push_back('\n'); else if (e=='t') out.push_back('\t'); else out.push_back(e); } else out.push_back(c); } return out; } static void skip_spaces(const std::string &s, size_t &i){ while (json_valid_index(i, s.size()) && is_space(s[i])) ++i; } // Very small JSON-like parser tailored to dictionary_json structure static std::unordered_map parse_dictionary_json(){ std::unordered_map dict; if (dictionary_json_len == 0) return dict; std::string text; text.reserve(dictionary_json_len + 1); for (unsigned int b=0; b < dictionary_json_len; ++b) text.push_back(static_cast(dictionary_json[b])); size_t i = 0; skip_spaces(text,i); if (!json_valid_index(i,text.size()) || text[i] != '{') return dict; ++i; while (true){ skip_spaces(text,i); if (!json_valid_index(i,text.size())) break; if (text[i] == '}'){ ++i; break; } std::string key = parse_quoted_string(text,i); skip_spaces(text,i); if (!json_valid_index(i,text.size()) || text[i] != ':') break; ++i; skip_spaces(text,i); std::string val; if (json_valid_index(i,text.size()) && text[i] == '"') val = parse_quoted_string(text,i); else { size_t start = i; while (json_valid_index(i,text.size()) && text[i] != ',' && text[i] != '}') ++i; val = text.substr(start, i-start); } dict.emplace(std::move(key), std::move(val)); skip_spaces(text,i); if (json_valid_index(i,text.size()) && text[i] == ','){ ++i; continue; } if (json_valid_index(i,text.size()) && text[i] == '}'){ ++i; break; } } return dict; } // --------------------------- Candidate selection (short funcs) --------------- static std::string best_candidate_by_similarity(const NextSet &cands, const std::vector &prompt_ptrs, const std::vector &resp_ptrs, const std::unordered_map, PtrHash, PtrEq> &def_index, const std::unordered_map &recent_counts, double repeat_penalty) { if (cands.empty()) return std::string(); if (cands.size() == 1) return *cands[0]; auto agg = aggregate_sets(prompt_ptrs, def_index); for (StrPtr r : resp_ptrs){ auto it = def_index.find(r); if (it != def_index.end()){ for (StrPtr d : it->second) agg.insert(*d); } } double best = -1e9; std::string best_tok; size_t M = cands.size(); std::vector scores(M, 0.0); #pragma omp parallel for schedule(static) for (ptrdiff_t i = 0; i < static_cast(M); ++i){ const StrPtr cand = cands[(size_t)i]; size_t inter = agg.count(*cand) ? 1 : 0; size_t cand_size = 1; auto it = def_index.find(cand); if (it != def_index.end()){ cand_size += it->second.size(); for (StrPtr d : it->second){ if (agg.count(*d)) ++inter; } if (std::find(it->second.begin(), it->second.end(), cand) != it->second.end()){ --cand_size; } } size_t uni = agg.size() + cand_size - inter; double s = uni ? static_cast(inter) / static_cast(uni) : 0.0; scores[(size_t)i] = s; } for (size_t i = 0; i < M; ++i){ const std::string &tok = *cands[i]; double s = scores[i]; auto rc_it = recent_counts.find(tok); int cnt = (rc_it == recent_counts.end() ? 0 : rc_it->second); double adjusted = s - repeat_penalty * static_cast(cnt); if (adjusted > best || (adjusted == best && tok < best_tok)){ best = adjusted; best_tok = tok; } } return best_tok; } // --------------------------- Response constructor (short units) --------------- static std::vector construct_response(KnowledgeBase &kb, const std::vector &prompt_toks, size_t maxlen, double repeat_penalty) { std::vector resp; if (prompt_toks.empty() || maxlen == 0) return resp; auto prompt_ptrs = intern_tokens(kb, prompt_toks); std::vector resp_ptrs; std::unordered_map recent_counts; auto would_create_2_cycle = [&](const std::string &cand) -> bool { if (resp.size() < 2) return false; const std::string &last = resp.back(); const std::string &prev = resp[resp.size()-2]; return (cand == prev && last == resp[resp.size()-3 < resp.size() ? resp.size()-3 : 0]); }; std::string last_printed; for (size_t step = 0; step < maxlen; ++step){ NextSet candidates; bool found = false; if (step == 0){ for (ssize_t p = static_cast(prompt_toks.size()) - 1; p >= 0; --p){ auto opt = kb.lookup_by_string(prompt_toks[(size_t)p]); if (opt){ candidates = *opt; found = true; break; } } } else { auto opt = kb.lookup_by_string(last_printed); if (opt){ candidates = *opt; found = true; } else { for (ssize_t p = static_cast(prompt_toks.size()) - 1; p >= 0; --p){ auto opt2 = kb.lookup_by_string(prompt_toks[(size_t)p]); if (opt2){ candidates = *opt2; found = true; break; } } } } if (!found || candidates.empty()) break; if (candidates.size() == 1){ std::string only = *candidates[0]; if (recent_counts[only] > 0) break; resp.push_back(only); resp_ptrs.push_back(kb.interner.intern(only)); recent_counts[only] += 1; last_printed = only; std::cout << only << ' ' << std::flush; continue; } std::string chosen = best_candidate_by_similarity(candidates, prompt_ptrs, resp_ptrs, kb.def_index, recent_counts, repeat_penalty); if (chosen.empty()) break; if (would_create_2_cycle(chosen)) break; resp.push_back(chosen); resp_ptrs.push_back(kb.interner.intern(chosen)); recent_counts[chosen] += 1; last_printed = chosen; std::cout << chosen << ' ' << std::flush; } return resp; } // --------------------------- Learning from files (short) ------------------- static void learn_from_file(KnowledgeBase &kb, const std::string &fname){ std::ifstream ifs(fname); if (!ifs) return; std::string tok; std::string prev; bool have_prev = false; while (ifs >> tok){ if (have_prev) kb.add_pair(prev, tok); prev = tok; have_prev = true; } } static void learn_files_parallel(KnowledgeBase &kb, const std::vector &files){ #pragma omp parallel for schedule(dynamic) for (ptrdiff_t i=0;i(files.size());++i) learn_from_file(kb, files[(size_t)i]); } // --------------------------- Serialization (short functions) ---------------- static void save_kb_binary(const KnowledgeBase &kb, const std::string &fname){ std::ofstream ofs(fname, std::ios::binary); if (!ofs) throw std::runtime_error("cannot open save file"); // interned strings snapshot (must include all tokens used by def_index) std::vector interned; interned.reserve(kb.interner.pool.size()); for (auto &s : kb.interner.pool) interned.push_back(&s); uint64_t N = interned.size(); ofs.write(reinterpret_cast(&N), sizeof(N)); for (auto p : interned){ uint64_t L = p->size(); ofs.write(reinterpret_cast(&L), sizeof(L)); ofs.write(p->data(), static_cast(L)); } std::unordered_map ptr_to_idx; ptr_to_idx.reserve(interned.size()); for (uint64_t i = 0; i < N; ++i){ ptr_to_idx.emplace(interned[(size_t)i], i); } // edges uint64_t E = kb.next.size(); ofs.write(reinterpret_cast(&E), sizeof(E)); for (auto &pr : kb.next){ uint64_t key_idx = ptr_to_idx.at(pr.first); ofs.write(reinterpret_cast(&key_idx), sizeof(key_idx)); uint64_t M = pr.second.size(); ofs.write(reinterpret_cast(&M), sizeof(M)); for (auto nxt : pr.second){ uint64_t v_idx = ptr_to_idx.at(nxt); ofs.write(reinterpret_cast(&v_idx), sizeof(v_idx)); } } // --- write definition expansion section --- uint64_t D = static_cast(kb.def_depth); ofs.write(reinterpret_cast(&D), sizeof(D)); // def entries: number of keys with a stored expansion uint64_t K = kb.def_index.size(); ofs.write(reinterpret_cast(&K), sizeof(K)); for (auto &pr : kb.def_index){ uint64_t key_idx = ptr_to_idx.at(pr.first); ofs.write(reinterpret_cast(&key_idx), sizeof(key_idx)); uint64_t M = pr.second.size(); ofs.write(reinterpret_cast(&M), sizeof(M)); for (auto tokp : pr.second){ uint64_t v_idx = ptr_to_idx.at(tokp); ofs.write(reinterpret_cast(&v_idx), sizeof(v_idx)); } } safe_flush(ofs); } static void load_kb_binary(KnowledgeBase &kb, const std::string &fname, int cli_dict_depth){ std::ifstream ifs(fname, std::ios::binary); if (!ifs) throw std::runtime_error("cannot open load file"); uint64_t N; ifs.read(reinterpret_cast(&N), sizeof(N)); std::vector strings; strings.reserve((size_t)N); kb.interner.pool.reserve((size_t)N); for (uint64_t i=0;i(&L), sizeof(L)); std::string s; s.resize((size_t)L); ifs.read(&s[0], static_cast(L)); strings.push_back(std::move(s)); } std::vector ptrs; ptrs.reserve(strings.size()); for (auto &s : strings) ptrs.push_back(kb.interner.intern(s)); uint64_t E; ifs.read(reinterpret_cast(&E), sizeof(E)); kb.next.reserve((size_t)E); kb.next_key_index.reserve((size_t)E); for (uint64_t i=0;i(&key_idx), sizeof(key_idx)); uint64_t M; ifs.read(reinterpret_cast(&M), sizeof(M)); StrPtr key_ptr = ptrs.at((size_t)key_idx); NextSet vec; vec.reserve((size_t)M); for (uint64_t j=0;j(&v_idx), sizeof(v_idx)); vec.push_back(ptrs.at((size_t)v_idx)); } kb.next.emplace(key_ptr, std::move(vec)); } // read def-expansion section (new-format) uint64_t file_def_depth; ifs.read(reinterpret_cast(&file_def_depth), sizeof(file_def_depth)); uint64_t K; ifs.read(reinterpret_cast(&K), sizeof(K)); // populate kb.def_index from file { std::lock_guard lk(kb.def_m); kb.def_index.clear(); kb.def_index.reserve((size_t)K); kb.def_depth = static_cast(file_def_depth); } for (uint64_t i=0;i(&key_idx), sizeof(key_idx)); uint64_t M; ifs.read(reinterpret_cast(&M), sizeof(M)); std::vector tokens; tokens.reserve((size_t)M); for (uint64_t j=0;j(&v_idx), sizeof(v_idx)); tokens.push_back(ptrs.at((size_t)v_idx)); } kb.def_index.emplace(ptrs.at((size_t)key_idx), std::move(tokens)); } // If CLI requested a different dict depth, clear and recompute expansion for loaded words only if (cli_dict_depth != kb.def_depth){ kb.set_def_depth(cli_dict_depth); // --- build deduplicated union of "words present" = saved strings (ptrs) ∪ KB words (keys and neighbors) std::vector targets; targets.reserve(ptrs.size() + kb.next.size()*2); { std::unordered_set seen; // include all strings from the saved file for (auto p : ptrs) { if (seen.insert(p).second) targets.push_back(p); } // include all words present in KB edges (keys and their neighbors) for (auto &pr : kb.next) { if (seen.insert(pr.first).second) targets.push_back(pr.first); for (auto v : pr.second) { if (seen.insert(v).second) targets.push_back(v); } } } // --- recompute definition expansion for each target in parallel #pragma omp parallel for schedule(dynamic) for (ptrdiff_t i = 0; i < static_cast(targets.size()); ++i) { kb.ensure_def_for_interned(targets[(size_t)i]); } } } // --------------------------- CLI + Interactive loop (shorters) ----------- static void print_usage(const char *p){ std::cout << "Usage: " << p << " [--maxlen N] [--save FILE] [--load-kb FILE] [--dict-depth D] [--learn f1 f2 ...] [--repeat-penalty P] [--help]\n"; std::cout << " --maxlen N Maximum number of tokens constructed in a response.\n"; std::cout << " --save FILE Save the knowledge-base and dictionary expansions to a binary file.\n"; std::cout << " --load-kb FILE Load a previously saved knowledge-base (and dictionary expansions) from a binary file.\n"; std::cout << " --dict-depth D Depth of dictionary-definition expansion used during learning.\n"; std::cout << " --learn f1 f2 ... Learn from one or more text files to update the knowledge base.\n"; std::cout << " --repeat-penalty P Penalize repeated tokens during response generation (higher values discourage repetition).\n"; std::cout << " --help Show command-line interface options for ChatIPC usage.\n"; } int main(int argc, char **argv){ size_t maxlen = 100; std::string savefile; std::string load_txt; std::string load_kb; int dict_depth = 2; double repeat_penalty = 0.7; // default λ std::vector learn_files; for (int i=1;i " , std::getline(std::cin, line)){ if (line.empty()){ std::cout << "\n"; continue; } auto prompt_toks = tokenize_whitespace(line); for (size_t i=1;i combined = prompt_toks; combined.insert(combined.end(), resp.begin(), resp.end()); for (size_t i=1;i