Spaces:

htg2501
/

AI

Build error

App Files Files Community

htg2501 commited on Feb 1

Commit

21b70be

verified ·

1 Parent(s): dff3652

Upload 12 files

Browse files

Files changed (13) hide show

.gitattributes +2 -0
Dockerfile +23 -20
LDA_models.pkl +3 -0
Logo_UET.png +0 -0
README.md +84 -19
__init__.py +0 -0
app.py +147 -0
c_25_0.3071.mdl +3 -0
dict_map.json +1 -0
e_25_0.3071.mdl +3 -0
requirements.txt +11 -3
summarization.py +616 -0
vietnamese-stopwords-dash.txt +1942 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+c_25_0.3071.mdl filter=lfs diff=lfs merge=lfs -text
+e_25_0.3071.mdl filter=lfs diff=lfs merge=lfs -text

Dockerfile CHANGED Viewed

@@ -1,20 +1,23 @@
-FROM python:3.13.5-slim
-WORKDIR /app
-RUN apt-get update && apt-get install -y \
-    build-essential \
-    curl \
-    git \
-    && rm -rf /var/lib/apt/lists/*
-COPY requirements.txt ./
-COPY src/ ./src/
-RUN pip3 install -r requirements.txt
-EXPOSE 8501
-HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
-ENTRYPOINT ["streamlit", "run", "src/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]

+FROM python:3.9-slim
+WORKDIR /app
+# Cài thư viện hệ thống cần thiết
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements trước để cache
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy toàn bộ source code + model
+COPY . .
+# Expose port cho HF
+EXPOSE 7860
+# Chạy Streamlit
+CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]

LDA_models.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7fe550070538e5cf9f47f4f94dacb170662b607a69c1d294ef03c6c5400031d7
+size 1983272

Logo_UET.png ADDED Viewed

README.md CHANGED Viewed

@@ -1,19 +1,84 @@
----
-title: AI
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: Streamlit template space
----
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

+# VNU Summarizer - Hệ thống tóm tắt đa văn bản tiếng Việt
+![Logo UET](./Logo_UET.png)
+## Giới thiệu
+VNU Summarizer là một ứng dụng web được phát triển nhằm cung cấp giải pháp tóm tắt đa văn bản cho tiếng Việt. Hệ thống này được xây dựng trên nền tảng Streamlit, cung cấp giao diện trực quan và dễ sử dụng cho người dùng.
+## Mục tiêu
+- Tạo các bản tóm tắt chất lượng cao từ nhiều tài liệu đầu vào
+- Hỗ trợ cả hai phương pháp tóm tắt: trích lược (extractive) và trích rút (abstractive)
+- Cung cấp công cụ đánh giá chất lượng tóm tắt dựa trên các chỉ số ROUGE
+- Tạo giao diện người dùng thân thiện, dễ sử dụng
+## Chức năng chính
+### 1. Nhập liệu đa dạng
+- **Nhập văn bản trực tiếp**: Người dùng có thể thêm nhiều vùng nhập văn bản
+- **Tải lên tệp**: Hỗ trợ nhiều định dạng tệp phổ biến (txt, pdf, docx)
+### 2. Phương pháp tóm tắt
+- **Tóm tắt trích lược (Extractive Summarization)**: Trích xuất các câu quan trọng từ văn bản gốc
+- **Tóm tắt trích rút (Abstractive Summarization)**: Tạo ra bản tóm tắt mới với cách diễn đạt riêng
+### 3. Tùy chỉnh tham số
+- **Tỷ lệ rút gọn**: Người dùng có thể chọn tỷ lệ rút gọn từ 0-50%
+- **Số câu đầu ra**: Người dùng có thể chỉ định số câu cần xuất ra trong bản tóm tắt
+### 4. Đánh giá chất lượng
+- **Chỉ số ROUGE**: Hệ thống cung cấp các chỉ số ROUGE-1, ROUGE-2, ROUGE-L để đánh giá chất lượng tóm tắt
+- **Tóm tắt mẫu**: Người dùng có thể nhập tóm tắt mẫu để so sánh với kết quả tóm tắt của hệ thống
+## Cách sử dụng
+1. **Nhập văn bản**:
+   - Chọn phương thức nhập liệu (nhập trực tiếp hoặc tải tệp lên)
+   - Nếu nhập trực tiếp, sử dụng nút "Thêm vùng nhập văn bản" để thêm nhiều văn bản
+   - Nếu tải tệp, kéo thả các tệp vào vùng quy định
+2. **Nhập tóm tắt mẫu** (không bắt buộc):
+   - Nhập bản tóm tắt mẫu cho phương pháp trích lược
+   - Nhập bản tóm tắt mẫu cho phương pháp trích rút
+3. **Cấu hình tóm tắt**:
+   - Chọn phương thức rút gọn (tỷ lệ hoặc số câu)
+   - Điều chỉnh tỷ lệ rút gọn hoặc số câu đầu ra theo nhu cầu
+4. **Xem kết quả**:
+   - Nhấn nút "Tóm tắt" để xem kết quả
+   - Kết quả sẽ hiển thị cả hai phương pháp tóm tắt cùng các chỉ số đánh giá ROUGE
+## Cấu trúc mã nguồn
+Ứng dụng được xây dựng dựa trên các thành phần chính sau:
+- `streamlit`: Framework để xây dựng giao diện web
+- `api.summarization.MultiDocSummarizationAPI`: API chính để xử lý tóm tắt đa văn bản
+- `fitz`: Thư viện xử lý tệp PDF
+- `docx`: Thư viện xử lý tệp Word
+## Yêu cầu hệ thống
+- Python 3.11
+- Streamlit
+- PyMuPDF (fitz)
+- python-docx
+- Các thư viện phụ thuộc khác được liệt kê trong tệp requirements.txt
+## Cài đặt và chạy
+```bash
+# Clone repository
+git clone <repository-url>
+# Di chuyển vào thư mục dự án
+cd vnu-summarizer
+# Cài đặt các thư viện phụ thuộc
+pip install -r requirements.txt
+# Chạy ứng dụng
+streamlit run app.py
+Tải thêm checkpoint-2200 trên notion

__init__.py ADDED Viewed

File without changes

app.py ADDED Viewed

	@@ -0,0 +1,147 @@

+import streamlit as st
+st.set_page_config(page_title="VNU Summarizer", layout="wide")
+# Chèn JavaScript để thay đổi tiêu đề ngay lập tức
+st.markdown(
+    """
+    <script>
+        document.title = "VNU Summarizer";
+    </script>
+    """,
+    unsafe_allow_html=True
+)
+from summarization import MultiDocSummarizationAPI
+import fitz
+from docx import Document
+# Cấu hình tiêu đề trang ngay từ đầu
+# Ẩn footer "Made with Streamlit"
+hide_streamlit_style = """
+    <style>
+        #MainMenu {visibility: hidden;}
+        footer {visibility: hidden;}
+    </style>
+"""
+st.markdown(hide_streamlit_style, unsafe_allow_html=True)
+st.image("./Logo_UET.png", width=150)
+def extract_text_from_pdf(uploaded_file):
+    pdf_text = ""
+    try:
+        with fitz.open(stream=uploaded_file.read(), filetype="pdf") as doc:
+            for page in doc:
+                pdf_text += page.get_text("text") + "\n"
+    except Exception as e:
+        st.error(f"Lỗi khi xử lý PDF: {e}")
+    return pdf_text
+def extract_text_from_docx(uploaded_file):
+    try:
+        doc = Document(uploaded_file)
+        return "\n".join([para.text for para in doc.paragraphs])
+    except Exception as e:
+        st.error(f"Lỗi khi xử lý DOCX: {e}")
+        return ""
+def add_text_area():
+    st.session_state.additional_texts.append("")
+def remove_text_area(index):
+    st.session_state.additional_texts.pop(index)
+if "show_summary" not in st.session_state:
+    st.session_state.show_summary = False
+if "additional_texts" not in st.session_state:
+    st.session_state.additional_texts = []
+st.markdown("<h1>Hệ thống tóm tắt đa văn bản tiếng Việt</h1>", unsafe_allow_html=True)
+col1, col2 = st.columns([3, 1])
+with col1:
+    st.markdown("### ✍️ Nhập văn bản")
+    input_method = st.radio("Phương thức nhập liệu:", ["Nhập văn bản", "Kéo thả tệp"], horizontal=True)
+    texts = []
+    if input_method == "Nhập văn bản":
+        if st.button("➕ Thêm vùng nhập văn bản"):
+            add_text_area()
+        for i, text in enumerate(st.session_state.additional_texts):
+            with st.expander(f"📌 Văn bản {i + 1}", expanded=True):
+                col_expander = st.columns([13, 0.5])
+                with col_expander[0]:
+                    updated_text = st.text_area("", text, height=200, key=f"text_{i}")
+                    st.session_state.additional_texts[i] = updated_text
+                with col_expander[1]:
+                    if st.button("🗑", key=f"delete_{i}", help="Xóa văn bản"):
+                        remove_text_area(i)
+                        st.experimental_rerun()
+            texts.append(st.session_state.additional_texts[i])
+    else:
+        uploaded_files = st.file_uploader(
+            "📂 Kéo thả tệp văn bản:",
+            type=["txt", "pdf", "docx"],
+            accept_multiple_files=True
+        )
+        if uploaded_files:
+            for uploaded_file in uploaded_files:
+                all_texts = ""
+                if uploaded_file.type == "text/plain":
+                    all_texts = uploaded_file.getvalue().decode("utf-8")
+                elif uploaded_file.type == "application/pdf":
+                    all_texts = extract_text_from_pdf(uploaded_file)
+                elif uploaded_file.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
+                    all_texts = extract_text_from_docx(uploaded_file)
+                texts.append(all_texts)
+    # st.markdown("### 🎯 Nhập tóm tắt mẫu")
+    # golden_ext = st.text_area("📑 Tóm tắt tóm lược", height=100)
+    # golden_abs = st.text_area("📝 Tóm tắt trích rút", height=100)
+with col2:
+    st.markdown("### ⚙️ Tuỳ chọn tóm tắt")
+    summary_method = st.selectbox("Chọn phương thức rút gọn:", ["Số câu", "Tỷ lệ"])
+    if summary_method == "Tỷ lệ":
+        compress_ratio = st.slider("🔽 Chọn tỷ lệ rút gọn:", 0, 50, 15, step=1, format="%d%%") / 100
+    else:
+        compress_ratio = st.number_input("🔢 Số câu đầu ra:", min_value=1, max_value=20, value=5, step=1)
+    if st.button("🚀 Tóm tắt") and any(texts):
+        summary_results = MultiDocSummarizationAPI(
+            texts, compress_ratio#, golden_ext=golden_ext or None, golden_abs=golden_abs or None
+        )
+        st.session_state.extractive_summary = summary_results.get("extractive_summ", "Không có kết quả")
+        st.session_state.abstractive_summary = summary_results.get("abstractive_summ", "Không có kết quả")
+        st.session_state.rouge_ext = summary_results.get("score_ext", ("None", "None", "None"))
+        st.session_state.rouge_abs = summary_results.get("score_abs", ("None", "None", "None"))
+        st.session_state.show_summary = True
+        st.experimental_rerun()
+if st.session_state.get("show_summary", False):
+    col_summary = st.columns(2)
+    rouge_ext = st.session_state.rouge_ext if st.session_state.rouge_ext is not None else ("None", "None", "None")
+    rouge_abs = st.session_state.rouge_abs if st.session_state.rouge_abs is not None else ("None", "None", "None")
+    with col_summary[0]:
+        st.markdown("### 📑 Tóm tắt tóm lược")
+        # st.markdown(f"**🔹 ROUGE 1:** {rouge_ext[0]}")
+        # st.markdown(f"**🔹 ROUGE 2:** {rouge_ext[1]}")
+        # st.markdown(f"**🔹 ROUGE L:** {rouge_ext[2]}")
+        st.text_area("📑 Tóm tắt trích lược:", st.session_state.extractive_summary, height=250)
+    # with col_summary[1]:
+    #     st.markdown("### 📝 Tóm tắt trích rút")
+    #     st.markdown(f"**🔹 ROUGE 1:** {rouge_abs[0]}")
+    #     st.markdown(f"**🔹 ROUGE 2:** {rouge_abs[1]}")
+    #     st.markdown(f"**🔹 ROUGE L:** {rouge_abs[2]}")
+    #     st.text_area("Văn bản tóm tắt trích rút:", st.session_state.abstractive_summary, height=250)

c_25_0.3071.mdl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a3064b88582b598217f7c9a1f1dd307fc9b3838c5df5f00f97ffa15261d17e19
+size 41996419

dict_map.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"\u00f2a": "o\u00e0", "\u00d2a": "O\u00e0", "\u00d2A": "O\u00c0", "\u00f3a": "o\u00e1", "\u00d3a": "O\u00e1", "\u00d3A": "O\u00c1", "\u1ecfa": "o\u1ea3", "\u1ecea": "O\u1ea3", "\u1eceA": "O\u1ea2", "\u00f5a": "o\u00e3", "\u00d5a": "O\u00e3", "\u00d5A": "O\u00c3", "\u1ecda": "o\u1ea1", "\u1ecca": "O\u1ea1", "\u1eccA": "O\u1ea0", "\u00f2e": "o\u00e8", "\u00d2e": "O\u00e8", "\u00d2E": "O\u00c8", "\u00f3e": "o\u00e9", "\u00d3e": "O\u00e9", "\u00d3E": "O\u00c9", "\u1ecfe": "o\u1ebb", "\u1ecee": "O\u1ebb", "\u1eceE": "O\u1eba", "\u00f5e": "o\u1ebd", "\u00d5e": "O\u1ebd", "\u00d5E": "O\u1ebc", "\u1ecde": "o\u1eb9", "\u1ecce": "O\u1eb9", "\u1eccE": "O\u1eb8", "\u00f9y": "u\u1ef3", "\u00d9y": "U\u1ef3", "\u00d9Y": "U\u1ef2", "\u00fay": "u\u00fd", "\u00day": "U\u00fd", "\u00daY": "U\u00dd", "\u1ee7y": "u\u1ef7", "\u1ee6y": "U\u1ef7", "\u1ee6Y": "U\u1ef6", "\u0169y": "u\u1ef9", "\u0168y": "U\u1ef9", "\u0168Y": "U\u1ef8", "\u1ee5y": "u\u1ef5", "\u1ee4y": "U\u1ef5", "\u1ee4Y": "U\u1ef4", "\u2026": "."}

e_25_0.3071.mdl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3977998fae042e01c9b8d5c56303ace9cfa696cf016ea6fc924332f63cc85afc
+size 42000261

requirements.txt CHANGED Viewed

@@ -1,3 +1,11 @@
-altair
-pandas
-streamlit

+streamlit == 1.25.0
+torch
+torchvision
+torchaudio
+transformers
+sentencepiece
+numpy
+rouge
+underthesea
+pymupdf
+python-docx

summarization.py ADDED Viewed

	@@ -0,0 +1,616 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import pickle
+import numpy as np
+from rouge import Rouge
+import string
+import re
+from transformers import AutoModel, AutoModelForSeq2SeqLM, AutoTokenizer
+from underthesea import sent_tokenize, word_tokenize
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+abstract_tokenizer_path = "vinai/bartpho-syllable-base"
+abstract_model_path = "htg2501/checkpoint"
+extractive_model_path = "./e_25_0.3071.mdl"
+contrastive_model_path = "./c_25_0.3071.mdl"
+stopword_path = "./vietnamese-stopwords-dash.txt"
+LDA_model_path = "./LDA_models.pkl"
+phobert = AutoModel.from_pretrained("vinai/phobert-base-v2").to(device)
+phobert_tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
+model_summarization = AutoModelForSeq2SeqLM.from_pretrained(abstract_model_path).to(device)
+tokenizer_summarization = AutoTokenizer.from_pretrained(abstract_tokenizer_path)
+"""# Extractive model"""
+def getRouge2(ref, pred, kind):
+    try:
+        return round(Rouge().get_scores(pred.lower(), ref.lower())[0]['rouge-2'][kind], 4)
+    except ValueError:
+        return 0.0
+class MLP(nn.Module):
+    def __init__(self, dims: list, layers=2, act=nn.LeakyReLU(), dropout_p=0.1, keep_last_layer=False):
+        super(MLP, self).__init__()
+        assert len(dims) == layers + 1
+        self.layers = layers
+        self.act = act
+        self.dropout = nn.Dropout(dropout_p)
+        self.keep_last = keep_last_layer
+        self.mlp_layers = nn.ModuleList([])
+        for i in range(self.layers):
+            self.mlp_layers.append(nn.Linear(dims[i], dims[i + 1]))
+    def forward(self, x):
+        for i in range(len(self.mlp_layers) - 1):
+            x = self.dropout(self.act(self.mlp_layers[i](x)))
+        if self.keep_last:
+            x = self.mlp_layers[-1](x)
+        else:
+            x = self.act(self.mlp_layers[-1](x))
+        return x
+class GraphAttentionLayer(nn.Module):
+    def __init__(self, in_features: int, out_features: int, n_heads: int,
+                 is_concat: bool = True,
+                 dropout: float = 0.6,
+                 leaky_relu_negative_slope: float = 0.2):
+        super().__init__()
+        self.is_concat = is_concat
+        self.n_heads = n_heads
+        # Calculate the number of dimensions per head
+        if is_concat:
+            assert out_features % n_heads == 0
+            self.n_hidden = out_features // n_heads
+        else:
+            self.n_hidden = out_features
+        self.linear = nn.Linear(in_features, self.n_hidden * n_heads, bias=False)
+        self.attn = nn.Linear(self.n_hidden * 2, 1, bias=False)
+        self.activation = nn.LeakyReLU(negative_slope=leaky_relu_negative_slope)
+        self.softmax = nn.Softmax(dim=1)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, h: torch.Tensor, adj_mat: torch.Tensor, docnum, secnum):
+        n_nodes = h.shape[0]
+        g = self.linear(h).view(n_nodes, self.n_heads, self.n_hidden)
+        g_repeat = g.repeat(n_nodes, 1, 1)
+        g_repeat_interleave = g.repeat_interleave(n_nodes, dim=0)
+        g_concat = torch.cat([g_repeat_interleave, g_repeat], dim=-1)
+        g_concat = g_concat.view(n_nodes, n_nodes, self.n_heads, 2 * self.n_hidden)
+        e = self.activation(self.attn(g_concat))
+        e = e.squeeze(-1)
+        # The adjacency matrix should have shape
+        # `[n_nodes, n_nodes, n_heads]` or`[n_nodes, n_nodes, 1]`
+        assert adj_mat.shape[0] == 1 or adj_mat.shape[0] == n_nodes
+        assert adj_mat.shape[1] == 1 or adj_mat.shape[1] == n_nodes
+        assert adj_mat.shape[2] == 1 or adj_mat.shape[2] == self.n_heads
+        # Mask $e_{ij}$ based on adjacency matrix.
+        # $e_{ij}$ is set to $- \infty$ if there is no edge from $i$ to $j$.
+        e = e.masked_fill(adj_mat == 0, float(-1e9))
+        a = self.softmax(e)
+        a = self.dropout(a)
+        attn_res = torch.einsum('ijh,jhf->ihf', a, g)
+        # Concatenate the heads
+        if self.is_concat:
+            return attn_res.reshape(n_nodes, self.n_heads * self.n_hidden)
+        # Take the mean of the heads
+        else:
+            return attn_res.mean(dim=1)
+class GAT(nn.Module):
+    def __init__(self, in_features: int, n_hidden: int, n_classes: int, n_heads: int, dropout: float):
+        super().__init__()
+        self.layer1 = GraphAttentionLayer(in_features, n_hidden, n_heads, is_concat=True, dropout=dropout)
+        self.activation = nn.ELU()
+        self.output = GraphAttentionLayer(n_hidden, n_classes, 1, is_concat=False, dropout=dropout)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x: torch.Tensor, adj_mat: torch.Tensor, docnum, secnum):
+        x = x.squeeze(0)
+        adj_mat = adj_mat.squeeze(0)
+        adj_x = adj_mat.clone().sum(dim=1, keepdim=True).repeat(1, x.shape[1]).bool()
+        adj_mat = adj_mat.unsqueeze(-1).bool()
+        x = self.dropout(x)
+        x = self.layer1(x, adj_mat, docnum, secnum)
+        x = self.activation(x)
+        x = self.dropout(x)
+        x = self.output(x, adj_mat, docnum, secnum).masked_fill(adj_x == 0, float(0))
+        return x.unsqueeze(0)
+class StepWiseGraphConvLayer(nn.Module):
+    def __init__(self, in_dim, hid_dim, dropout_p=0.1, act=nn.LeakyReLU(), nheads=6, iter=1, final="att"):
+        super().__init__()
+        self.act = act
+        self.dropout = nn.Dropout(dropout_p)
+        self.iter = iter
+        self.in_dim = in_dim
+        self.gat = nn.ModuleList([GAT(in_features=in_dim, n_hidden=hid_dim, n_classes=in_dim,
+                                      dropout=dropout_p, n_heads=nheads) for _ in range(iter)])
+        self.gat2 = nn.ModuleList([GAT(in_features=in_dim, n_hidden=hid_dim, n_classes=in_dim,
+                                       dropout=dropout_p, n_heads=nheads) for _ in range(iter)])
+        self.gat3 = nn.ModuleList([GAT(in_features=in_dim, n_hidden=hid_dim, n_classes=in_dim,
+                                       dropout=dropout_p, n_heads=nheads) for _ in range(iter)])
+        self.out_ffn = MLP([in_dim * 3, hid_dim, hid_dim, in_dim], layers=3, dropout_p=dropout_p)
+    def forward(self, feature, adj, docnum, secnum):
+        sen_adj = adj.clone()
+        sen_adj[:, -docnum - secnum - 1:, :] = sen_adj[:, :, -docnum - secnum - 1:] = 0
+        sec_adj = adj.clone()
+        sec_adj[:, :-docnum - secnum - 1, :] = sec_adj[:, -docnum - 1:, :] = sec_adj[:, :, -docnum - 1:] = 0
+        doc_adj = adj.clone()
+        doc_adj[:, :-docnum - 1, :] = 0
+        feature_sen = feature.clone()
+        feature_resi = feature
+        feature_sen_re = feature_sen.clone()
+        for i in range(0, self.iter):
+            feature_sen = self.gat[i](feature_sen, sen_adj, docnum, secnum)
+        feature_sen = F.layer_norm(feature_sen + feature_sen_re, [self.in_dim])
+        feature_sec = feature_sen.clone()
+        feature_sec_re = feature_sec.clone()
+        for i in range(0, self.iter):
+            feature_sec = self.gat2[i](feature_sec, sec_adj, docnum, secnum)
+        feature_sec = F.layer_norm(feature_sec + feature_sec_re, [self.in_dim])
+        feature_doc = feature_sec.clone()
+        feature_doc_re = feature_doc.clone()
+        for i in range(0, self.iter):
+            feature_doc = self.gat3[i](feature_doc, doc_adj, docnum, secnum)
+        feature_doc = F.layer_norm(feature_doc + feature_doc_re, [self.in_dim])
+        feature_sec[:, :-docnum - secnum - 1, :] = adj[:, :-docnum - secnum - 1,
+                                                   -docnum - secnum - 1:-docnum - 1] @ feature_sec[:,
+                                                                                       -docnum - secnum - 1:-docnum - 1,
+                                                                                       :]
+        feature_doc[:, -docnum - secnum - 1:-docnum - 1, :] = adj[:, -docnum - secnum - 1:-docnum - 1,
+                                                              -docnum - 1:] @ feature_doc[:, -docnum - 1:, :]
+        feature_doc[:, :-docnum - secnum - 1, :] = adj[:, :-docnum - secnum - 1,
+                                                   -docnum - secnum - 1:-docnum - 1] @ feature_doc[:,
+                                                                                       -docnum - secnum - 1:-docnum - 1,
+                                                                                       :]
+        feature = torch.concat([feature_doc, feature_sec, feature_sen], dim=-1)
+        feature = F.layer_norm(self.out_ffn(feature) + feature_resi, [self.in_dim])
+        return feature
+class Contrast_Encoder(nn.Module):
+    def __init__(self, input_dim, hidden_dim, heads, act=nn.LeakyReLU(0.1), dropout_p=0.1):
+        super(Contrast_Encoder, self).__init__()
+        self.graph_encoder = StepWiseGraphConvLayer(in_dim=input_dim, hid_dim=hidden_dim,
+                                                    dropout_p=dropout_p, act=act, nheads=heads, iter=1)
+        self.common_proj_mlp = MLP([input_dim, hidden_dim, input_dim], layers=2, dropout_p=dropout_p, act=act,
+                                   keep_last_layer=False)
+    def forward(self, p_gfeature, doc_lens, p_adj, docnum, secnum):
+        posVec = torch.cat(
+            [PositionVec[:l] for l in doc_lens] + [torch.zeros(secnum + docnum + 1, 768).float().to(device)], dim=0)
+        p_gfeature = p_gfeature + posVec.unsqueeze(0)
+        pg = self.graph_encoder(p_gfeature, p_adj, docnum, secnum)
+        pg = self.common_proj_mlp(pg)
+        return pg
+class End2End_Encoder(nn.Module):
+    def __init__(self, input_dim, hidden_dim, heads, act=nn.LeakyReLU(0.1), dropout_p=0.3):
+        super(End2End_Encoder, self).__init__()
+        self.graph_encoder = StepWiseGraphConvLayer(in_dim=input_dim, hid_dim=hidden_dim,
+                                                    dropout_p=dropout_p, act=act, nheads=heads, iter=1)
+        self.dropout = nn.Dropout(dropout_p)
+        self.out_proj_layer_mlp = MLP([input_dim, hidden_dim, input_dim], layers=2, dropout_p=dropout_p, act=act,
+                                      keep_last_layer=False)
+        self.linear = MLP([input_dim, 1], layers=1, dropout_p=dropout_p, act=act, keep_last_layer=True)
+    def forward(self, x, doc_lens, adj, docnum, secnum):
+        x = self.graph_encoder(x, adj, docnum, secnum)
+        x = self.out_proj_layer_mlp(x)
+        return self.linear(x)[:, :-docnum - secnum - 1, :]
+def _similarity(h1: torch.Tensor, h2: torch.Tensor):
+    h1 = F.normalize(h1)
+    h2 = F.normalize(h2)
+    return h1 @ h2.t()
+class InfoNCE(nn.Module):
+    def __init__(self, tau):
+        super(InfoNCE, self).__init__()
+        self.tau = tau
+    def forward(self, anchor, sample, pos_mask, *args, **kwargs):
+        sim = _similarity(anchor, sample) / self.tau
+        if len(anchor) > 1:
+            sim, _ = torch.max(sim, dim=0, keepdim=True)
+        exp_sim = torch.exp(sim)
+        loss = torch.log((exp_sim * pos_mask).sum(dim=1)) - torch.log(exp_sim.sum(dim=1))
+        return -loss.mean()
+class Cluster:
+    def __init__(self, sent_texts, sent_vecs, doc_lens, doc_sec_mask, sec_sen_mask):
+        assert len(sent_vecs) == len(sent_texts)
+        self.docnum = len(doc_sec_mask)
+        self.secnum = len(sec_sen_mask)
+        self.feature = torch.cat(
+            (torch.stack(sent_vecs, dim=0), torch.zeros((self.secnum + self.docnum + 1, sent_vecs[0].shape[0]))),
+            dim=0).to(device)
+        self.adj = torch.from_numpy(self.mask_to_adj(doc_sec_mask, sec_sen_mask)).float().to(device)
+        self.sent_text = np.array(sent_texts)
+        self.doc_lens = doc_lens
+        self.init_node_vec()
+        self.feature = self.feature.float()
+    def init_node_vec(self):
+        docnum, secnum = self.docnum, self.secnum
+        for i in range(-secnum - docnum - 1, -docnum - 1):
+            mask = self.adj[i].clone()
+            mask[-secnum - docnum - 1:] = 0
+            self.feature[i] = torch.mean(self.feature[mask.bool()], dim=0)
+        for i in range(-docnum - 1, -1):
+            mask = self.adj[i].clone()
+            mask[-docnum - 1:] = 0
+            self.feature[i] = torch.mean(self.feature[mask.bool()], dim=0)
+        self.feature[-1] = torch.mean(self.feature[-docnum - 1:-1], dim=0)
+    def mask_to_adj(self, doc_sec_mask, sec_sen_mask):
+        sen_num = sec_sen_mask.shape[1]
+        sec_num = sec_sen_mask.shape[0]
+        doc_num = doc_sec_mask.shape[0]
+        adj = np.zeros((sen_num + sec_num + doc_num + 1, sen_num + sec_num + doc_num + 1))
+        # section connection
+        adj[-sec_num - doc_num - 1:-doc_num - 1, 0:-sec_num - doc_num - 1] = sec_sen_mask
+        adj[0:-sec_num - doc_num - 1, -sec_num - doc_num - 1:-doc_num - 1] = sec_sen_mask.T
+        for i in range(0, doc_num):
+            doc_mask = doc_sec_mask[i]
+            doc_mask = doc_mask.reshape((1, len(doc_mask)))
+            adj[sen_num:-doc_num - 1, sen_num:-doc_num - 1] += doc_mask * doc_mask.T
+        # doc connection
+        adj[-doc_num - 1:-1, -sec_num - doc_num - 1:-doc_num - 1] = doc_sec_mask
+        adj[-sec_num - doc_num - 1:-doc_num - 1, -doc_num - 1:-1] = doc_sec_mask.T
+        adj[-doc_num - 1:, -doc_num - 1:] = 1
+        #build sentence connection
+        for i in range(0, sec_num):
+            sec_mask = sec_sen_mask[i]
+            sec_mask = sec_mask.reshape((1, len(sec_mask)))
+            adj[:sen_num, :sen_num] += sec_mask * sec_mask.T
+        return adj
+def meanTokenVecs(text):
+    sent = text.lower()
+    input_ids = torch.tensor([phobert_tokenizer.encode(sent)])
+    tokenized_text = phobert_tokenizer.tokenize(sent)
+    with torch.no_grad():
+        features = phobert(input_ids.to(device))
+    wordVecs, buffer, buffer_str = {}, [], ''
+    for token in zip(tokenized_text, features.last_hidden_state[0, 1:-1, :]):
+        if token[0][-2:] == '@@':
+            buffer.append(token[1])
+            buffer_str += token[0][:-2]
+            continue
+        if buffer:
+            buffer.append(token[1])
+            buffer_str += token[0]
+            wordVecs[buffer_str] = torch.mean(torch.stack(buffer), dim=0)
+            buffer, buffer_str = [], ''
+        else:
+            wordVecs[token[0]] = token[1]
+    return torch.mean(torch.stack([vec for w, vec in wordVecs.items() if w not in string.punctuation]), dim=0).to(
+        torch.device('cpu'))
+def getPositionEncoding(pos, d=768, n=10000):
+    P = np.zeros(d)
+    for i in np.arange(int(d / 2)):
+        denominator = np.power(n, 2 * i / d)
+        P[2 * i] = np.sin(pos / denominator)
+        P[2 * i + 1] = np.cos(pos / denominator)
+    return P
+def removeRedundant(text):
+    text = text.lower()
+    words = [w for w in text.split(' ') if w not in stop_w]
+    return ' '.join(words)
+def divideSection(doc_text, category='Giáo dục'):
+    sent_para, para_sec, sent_sec = {}, {}, {}
+    paras = [para for para in doc_text.split('\n') if para != '']
+    all_sents = []
+    # prepare sent_Para
+    sentcnt = 0
+    for i, para in enumerate(paras):
+        sents = [word_tokenize(sent, format="text") for sent in sent_tokenize(para) if sent != '' and len(sent) > 4]
+        all_sents.extend(sents)
+        for ii, sent in enumerate(sents):
+            sent_para[sentcnt + ii] = i
+            sent = removeRedundant(sent)
+        sentcnt += len(sents)
+    # prepare para_sec
+    paras = [removeRedundant(para) for para in paras]
+    tf, lda_model = cate_models[category]
+    X = tf.transform(paras)
+    lda_top = lda_model.transform(X)
+    for i, para_top in enumerate(lda_top):
+        para_sec[i] = para_top.argmax()
+    # output sent_sec
+    for k, v in sent_para.items():
+        sent_sec[k] = para_sec[v]
+    return sent_sec, all_sents
+def loadClusterData(docs_org, category='Giáo dục'):  # docs_org: list of text for each document
+    seclist, docs = {}, []
+    for d, doc in enumerate(docs_org):
+        seclist[d], sentTexts = divideSection(doc, category)
+        docs.append(sentTexts)
+    secnum = 0
+    for k, val_dict in seclist.items():
+        vals = set(val_dict.values())
+        for ki, vi in val_dict.items():
+            for i, v in enumerate(vals):
+                if vi == v:
+                    val_dict[ki] = i + secnum
+                    break
+        seclist[k] = val_dict
+        secnum += len(vals)
+    sents, sentVecs, secIDs, doc_lens = [], [], [], []
+    sentnum = sum([len(doc.values()) for doc in seclist.values()])
+    doc_sec_mask = np.zeros((len(docs), secnum))
+    sec_sen_mask = np.zeros((secnum, sentnum))
+    cursec, cursent = 0, 0
+    for d, doc in enumerate(docs):
+        doc_lens.append(len(doc))
+        doc_endsec = max(seclist[d].values())
+        doc_sec_mask[d][cursec:doc_endsec + 1] = 1
+        cursec = doc_endsec + 1
+        for s, sent in enumerate(doc):
+            sents.append(sent)
+            sentVecs.append(meanTokenVecs(sent))
+            sec_sen_mask[seclist[d][s], cursent] = 1
+            cursent += 1
+    return Cluster(sents, sentVecs, doc_lens, doc_sec_mask, sec_sen_mask)
+def val_e2e(data):
+    feature = data.feature.unsqueeze(0)
+    doc_lens = data.doc_lens
+    adj = data.adj.unsqueeze(0)
+    docnum = data.docnum
+    secnum = data.secnum
+    with torch.no_grad():
+        feature = c_model(feature, doc_lens, adj, docnum, secnum)
+        x = model(feature, doc_lens, adj, docnum, secnum)
+        scores = torch.sigmoid(x.squeeze(-1))
+    return scores, data.sent_text
+def normalize_text(text):
+    text = str(text).replace('_', ' ')
+    text = re.sub(r'\s+', ' ', text)
+    text = re.sub(r'\s+([.,;:?)/!?â€])', r'\1', text)
+    text = re.sub(r'([\(â€œ])\s+', r'\1', text)
+    return text
+def track_changes(old_words, new_words):
+    # Find the longest common subsequence (LCS) between the two word sequences
+    def get_lcs_matrix(words1, words2):
+        m, n = len(words1), len(words2)
+        dp = [[0] * (n + 1) for _ in range(m + 1)]
+        for i in range(1, m + 1):
+            for j in range(1, n + 1):
+                if words1[i - 1] == words2[j - 1]:
+                    dp[i][j] = dp[i - 1][j - 1] + 1
+                else:
+                    dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
+        return dp
+    def get_lcs(words1, words2, dp):
+        i, j = len(words1), len(words2)
+        lcs = []
+        while i > 0 and j > 0:
+            if words1[i - 1] == words2[j - 1]:
+                lcs.append((i - 1, j - 1))
+                i -= 1
+                j -= 1
+            elif dp[i - 1][j] > dp[i][j - 1]:
+                i -= 1
+            else:
+                j -= 1
+        return sorted(lcs)
+    # Find the changed segments at word level
+    dp_matrix = get_lcs_matrix(old_words, new_words)
+    lcs_positions = get_lcs(old_words, new_words, dp_matrix)
+    changes = []
+    old_pos = 0
+    new_pos = 0
+    # Process matching and non-matching segments
+    for old_idx, new_idx in lcs_positions:
+        # If there's a gap before this match, it's a change
+        if old_idx > old_pos or new_idx > new_pos:
+            changes.append((old_pos, old_idx, new_pos, new_idx))
+        # Move positions after the match
+        old_pos = old_idx + 1
+        new_pos = new_idx + 1
+    # Check if there's a change at the end
+    if old_pos < len(old_words) or new_pos < len(new_words):
+        changes.append((old_pos, len(old_words), new_pos, len(new_words)))
+    return changes
+class Abstractive_Summarization:
+    @staticmethod
+    def generateSummaryBySent(texts, batch=32):
+        model_summarization.eval()
+        predictions = []
+        with torch.no_grad():
+            for i in range(0, len(texts), batch):
+                batch_texts = texts[i:i + batch]
+                inputs = tokenizer_summarization(batch_texts, padding=True, max_length=1024, truncation=True,
+                                                 return_tensors='pt').to(device)
+                outputs = model_summarization.generate(**inputs, num_beams=5,
+                                                       early_stopping=True, no_repeat_ngram_size=3)
+                prediction = tokenizer_summarization.batch_decode(outputs, skip_special_tokens=True)
+                predictions.extend(prediction)
+        return predictions
+PositionVec = torch.stack([torch.from_numpy(getPositionEncoding(i, d=768)) for i in range(200)], dim=0).float().to(
+    device)
+stop_w = ['...']
+# with open(stopword_path, 'r', encoding='utf-8') as f:
+#     for w in f.readlines():
+#         stop_w.append(w.strip())
+stop_w.extend([c for c in '!"#$%&\'()*+,./:;<=>?@[\\]^`{|}~â€¦â€œâ€â€™â€˜'])
+with open(LDA_model_path, mode='rb') as fp:
+    cate_models = pickle.load(fp)
+c_model = Contrast_Encoder(768, 1024, 4).to(device)
+model = End2End_Encoder(768, 1024, 4).to(device)
+model.load_state_dict(torch.load(extractive_model_path, map_location=device), strict=False)
+c_model.load_state_dict(torch.load(contrastive_model_path, map_location=device), strict=False)
+model.eval()
+c_model.eval()
+def get_summary(scores, sents, max_sent=5):
+    ranked_score_idxs = torch.argsort(scores[0], dim=0, descending=True)
+    sents = [s.replace('_', ' ') for s in sents]
+    summSentIDList = []
+    for i in ranked_score_idxs:
+        if len(summSentIDList) >= max_sent: break
+        s = sents[i]
+        replicated, delIDs = False, []
+        for chosedID in summSentIDList:
+            if getRouge2(s, sents[chosedID], 'p') >= 0.45:
+                delIDs.append(chosedID)
+            if getRouge2(sents[chosedID], s, 'p') >= 0.45:
+                replicated = True
+                break
+        if replicated: continue
+        for delID in delIDs:
+            del summSentIDList[summSentIDList.index(delID)]
+        summSentIDList.append(i)
+    summSentIDList = sorted(summSentIDList)
+    return [s for i, s in enumerate(sents) if i in summSentIDList]
+def MultiDocSummarizationAPI(texts, compress_ratio):
+    """
+    Summarizes a list of documents using both extractive and abstractive methods.
+    Parameters:
+    - texts (list of str): A list of document texts to be summarized.
+    - compress_ratio (float): A ratio or count determining the number of sentences in the summary.
+      If less than 1, it represents the fraction of the original sentences to include in the summary.
+      If 1 or greater, it represents the exact number of sentences to include in the summary.
+    Returns:
+    - dict: A dictionary containing:
+        - 'extractive_summ' (str): The extractive summary of the documents.
+        - 'abstractive_summ' (str): The abstractive summary of the documents.
+    """
+    assert compress_ratio > 0, "Compress ratio need to be greater than 0."
+    docs = [text.strip() for text in texts]
+    data_tree = loadClusterData(docs)
+    scores, sents = val_e2e(data_tree)
+    output_sent_cnt = int(len(sents) * compress_ratio) if compress_ratio < 1 else int(compress_ratio)
+    print('Expected sentence count:', output_sent_cnt)
+    extractive_summ_sents = [normalize_text(sent) for sent in get_summary(scores, sents, max_sent=output_sent_cnt)]
+    extractive_summ = ' '.join(extractive_summ_sents)
+    abstractive_summ_sents = Abstractive_Summarization.generateSummaryBySent(extractive_summ_sents)
+    abstractive_summ_sents = [normalize_text(s) for s in abstractive_summ_sents]
+    final_sents = []
+    for ii, (ext, abs) in enumerate(zip(extractive_summ_sents, abstractive_summ_sents)):
+        if ii == 0:
+            final_sents.append(ext)
+            continue
+        abs_splits, ext_splits = word_tokenize(abs), word_tokenize(ext)
+        abs_splits_cop, ext_splits_cop = abs_splits.copy(), ext_splits.copy()
+        if len(abs_splits_cop):
+            abs_splits_cop[-1] = abs_splits[-1][:-1] if len(abs_splits[-1]) and abs_splits[-1][-1] == '.' else abs_splits[-1]
+        if len(ext_splits_cop):
+            ext_splits_cop[-1] = ext_splits[-1][:-1] if len(ext_splits[-1]) and ext_splits[-1][-1] == '.' else ext_splits[-1]
+        changes, abs_parts = track_changes(ext_splits_cop, abs_splits_cop), [(0, len(abs_splits))]
+        for start_old, end_old, start_new, end_new in changes:
+            old_part = ' '.join(ext_splits[start_old:end_old])
+            # Revert change in the cases of spelling errors
+            revert, ignoreFirstSentWord = False, 1 if start_old == 0 else 0
+            old_names = {}
+            for w in ext_splits_cop[start_old + ignoreFirstSentWord:end_old]:
+                if len(w) == 0: continue
+                if 'A'<=w[0]<='Z' or w[0] in ['Ä‚', 'Ă‚', 'Ä', 'Ă', 'Ă”', 'Æ ', 'Æ¯']:
+                    if w in old_names:
+                        old_names[w] += 1
+                    else:
+                        old_names[w] = 1
+            for w in abs_splits_cop[start_new + ignoreFirstSentWord:end_new]:
+                if len(w) == 0: continue
+                if 'A'<=w[0]<='Z' or w[0] in ['Ä‚', 'Ă‚', 'Ä', 'Ă', 'Ă”', 'Æ ', 'Æ¯']:
+                    if w in old_names:
+                        old_names[w] -= 1
+                        if old_names[w] < 0:
+                            revert = True
+                            break
+                    else:
+                        revert = True
+                        break
+            if revert:
+                pop_part = abs_parts[-1]
+                abs_parts.pop()
+                abs_parts.extend([(pop_part[0], start_new), old_part, (end_new, pop_part[1])])
+                # print('\nOLD:', old_part, '\n', ' '.join(abs_splits[start_new:end_new]))
+                # print(ext, '\n', abs)
+        abs = ' '.join([part if isinstance(part, str) else ' '.join(abs_splits[part[0]:part[1]]) for part in abs_parts])
+        final_sents.append(normalize_text(abs))
+    abstract_summ = ' '.join(final_sents)
+    return {'extractive_summ': extractive_summ,
+            'abstractive_summ': abstract_summ}

vietnamese-stopwords-dash.txt ADDED Viewed

	@@ -0,0 +1,1942 @@

+a_lô
+a_ha
+ai
+ai_ai
+ai_nấy
+ai_đó
+alô
+amen
+anh
+anh_ấy
+ba
+ba_ba
+ba_bản
+ba_cùng
+ba_họ
+ba_ngày
+ba_ngôi
+ba_tăng
+bao_giờ
+bao_lâu
+bao_nhiêu
+bao_nả
+bay_biến
+biết
+biết_bao
+biết_bao_nhiêu
+biết_chắc
+biết_chừng_nào
+biết_mình
+biết_mấy
+biết_thế
+biết_trước
+biết_việc
+biết_đâu
+biết_đâu_chừng
+biết_đâu_đấy
+biết_được
+buổi
+buổi_làm
+buổi_mới
+buổi_ngày
+buổi_sớm
+bà
+bà_ấy
+bài
+bài_bác
+bài_bỏ
+bài_cái
+bác
+bán
+bán_cấp
+bán_dạ
+bán_thế
+bây_bẩy
+bây_chừ
+bây_giờ
+bây_nhiêu
+bèn
+béng
+bên
+bên_bị
+bên_có
+bên_cạnh
+bông
+bước
+bước_khỏi
+bước_tới
+bước_đi
+bạn
+bản
+bản_bộ
+bản_riêng
+bản_thân
+bản_ý
+bất_chợt
+bất_cứ
+bất_giác
+bất_kì
+bất_kể
+bất_kỳ
+bất_luận
+bất_ngờ
+bất_nhược
+bất_quá
+bất_quá_chỉ
+bất_thình_lình
+bất_tử
+bất_đồ
+bấy
+bấy_chầy
+bấy_chừ
+bấy_giờ
+bấy_lâu
+bấy_lâu_nay
+bấy_nay
+bấy_nhiêu
+bập_bà_bập_bõm
+bập_bõm
+bắt_đầu
+bắt_đầu_từ
+bằng
+bằng_cứ
+bằng_không
+bằng_người
+bằng_nhau
+bằng_như
+bằng_nào
+bằng_nấy
+bằng_vào
+bằng_được
+bằng_ấy
+bển
+bệt
+bị
+bị_chú
+bị_vì
+bỏ
+bỏ_bà
+bỏ_cha
+bỏ_cuộc
+bỏ_không
+bỏ_lại
+bỏ_mình
+bỏ_mất
+bỏ_mẹ
+bỏ_nhỏ
+bỏ_quá
+bỏ_ra
+bỏ_riêng
+bỏ_việc
+bỏ_xa
+bỗng
+bỗng_chốc
+bỗng_dưng
+bỗng_không
+bỗng_nhiên
+bỗng_nhưng
+bỗng_thấy
+bỗng_đâu
+bộ
+bộ_thuộc
+bộ_điều
+bội_phần
+bớ
+bởi
+bởi_ai
+bởi_chưng
+bởi_nhưng
+bởi_sao
+bởi_thế
+bởi_thế_cho_nên
+bởi_tại
+bởi_vì
+bởi_vậy
+bởi_đâu
+bức
+cao
+cao_lâu
+cao_ráo
+cao_răng
+cao_sang
+cao_số
+cao_thấp
+cao_thế
+cao_xa
+cha
+cha_chả
+chao_ôi
+chia_sẻ
+chiếc
+cho
+cho_biết
+cho_chắc
+cho_hay
+cho_nhau
+cho_nên
+cho_rằng
+cho_rồi
+cho_thấy
+cho_tin
+cho_tới
+cho_tới_khi
+cho_về
+cho_ăn
+cho_đang
+cho_được
+cho_đến
+cho_đến_khi
+cho_đến_nỗi
+choa
+chu_cha
+chui_cha
+chung
+chung_cho
+chung_chung
+chung_cuộc
+chung_cục
+chung_nhau
+chung_qui
+chung_quy
+chung_quy_lại
+chung_ái
+chuyển
+chuyển_tự
+chuyển_đạt
+chuyện
+chuẩn_bị
+chành_chạnh
+chí_chết
+chính
+chính_bản
+chính_giữa
+chính_là
+chính_thị
+chính_điểm
+chùn_chùn
+chùn_chũn
+chú
+chú_dẫn
+chú_khách
+chú_mày
+chú_mình
+chúng
+chúng_mình
+chúng_ta
+chúng_tôi
+chúng_ông
+chăn_chắn
+chăng
+chăng_chắc
+chăng_nữa
+chơi
+chơi_họ
+chưa
+chưa_bao_giờ
+chưa_chắc
+chưa_có
+chưa_cần
+chưa_dùng
+chưa_dễ
+chưa_kể
+chưa_tính
+chưa_từng
+chầm_chập
+chậc
+chắc
+chắc_chắn
+chắc_dạ
+chắc_hẳn
+chắc_lòng
+chắc_người
+chắc_vào
+chắc_ăn
+chẳng_lẽ
+chẳng_những
+chẳng_nữa
+chẳng_phải
+chết_nỗi
+chết_thật
+chết_tiệt
+chỉ
+chỉ_chính
+chỉ_có
+chỉ_là
+chỉ_tên
+chỉn
+chị
+chị_bộ
+chị_ấy
+chịu
+chịu_chưa
+chịu_lời
+chịu_tốt
+chịu_ăn
+chọn
+chọn_bên
+chọn_ra
+chốc_chốc
+chớ
+chớ_chi
+chớ_gì
+chớ_không
+chớ_kể
+chớ_như
+chợt
+chợt_nghe
+chợt_nhìn
+chủn
+chứ
+chứ_ai
+chứ_còn
+chứ_gì
+chứ_không
+chứ_không_phải
+chứ_lại
+chứ_lị
+chứ_như
+chứ_sao
+coi_bộ
+coi_mòi
+con
+con_con
+con_dạ
+con_nhà
+con_tính
+cu_cậu
+cuối
+cuối_cùng
+cuối_điểm
+cuốn
+cuộc
+càng
+càng_càng
+càng_hay
+cá_nhân
+các
+các_cậu
+cách
+cách_bức
+cách_không
+cách_nhau
+cách_đều
+cái
+cái_gì
+cái_họ
+cái_đã
+cái_đó
+cái_ấy
+câu_hỏi
+cây
+cây_nước
+còn
+còn_như
+còn_nữa
+còn_thời_gian
+còn_về
+có
+có_ai
+có_chuyện
+có_chăng
+có_chăng_là
+có_chứ
+có_cơ
+có_dễ
+có_họ
+có_khi
+có_ngày
+có_người
+có_nhiều
+có_nhà
+có_phải
+có_số
+có_tháng
+có_thế
+có_thể
+có_vẻ
+có_ý
+có_ăn
+có_điều
+có_điều_kiện
+có_đáng
+có_đâu
+có_được
+cóc_khô
+cô
+cô_mình
+cô_quả
+cô_tăng
+cô_ấy
+công_nhiên
+cùng
+cùng_chung
+cùng_cực
+cùng_nhau
+cùng_tuổi
+cùng_tột
+cùng_với
+cùng_ăn
+căn
+căn_cái
+căn_cắt
+căn_tính
+cũng
+cũng_như
+cũng_nên
+cũng_thế
+cũng_vậy
+cũng_vậy_thôi
+cũng_được
+cơ
+cơ_chỉ
+cơ_chừng
+cơ_cùng
+cơ_dẫn
+cơ_hồ
+cơ_hội
+cơ_mà
+cơn
+cả
+cả_nghe
+cả_nghĩ
+cả_ngày
+cả_người
+cả_nhà
+cả_năm
+cả_thảy
+cả_thể
+cả_tin
+cả_ăn
+cả_đến
+cảm_thấy
+cảm_ơn
+cấp
+cấp_số
+cấp_trực_tiếp
+cần
+cần_cấp
+cần_gì
+cần_số
+cật_lực
+cật_sức
+cậu
+cổ_lai
+cụ_thể
+cụ_thể_là
+cụ_thể_như
+của
+của_ngọt
+của_tin
+cứ
+cứ_như
+cứ_việc
+cứ_điểm
+cực_lực
+do
+do_vì
+do_vậy
+do_đó
+duy
+duy_chỉ
+duy_có
+dài
+dài_lời
+dài_ra
+dành
+dành_dành
+dào
+dì
+dù
+dù_cho
+dù_dì
+dù_gì
+dù_rằng
+dù_sao
+dùng
+dùng_cho
+dùng_hết
+dùng_làm
+dùng_đến
+dưới
+dưới_nước
+dạ
+dạ_bán
+dạ_con
+dạ_dài
+dạ_dạ
+dạ_khách
+dần_dà
+dần_dần
+dầu_sao
+dẫn
+dẫu
+dẫu_mà
+dẫu_rằng
+dẫu_sao
+dễ
+dễ_dùng
+dễ_gì
+dễ_khiến
+dễ_nghe
+dễ_ngươi
+dễ_như_chơi
+dễ_sợ
+dễ_sử_dụng
+dễ_thường
+dễ_thấy
+dễ_ăn
+dễ_đâu
+dở_chừng
+dữ
+dữ_cách
+em
+em_em
+giá_trị
+giá_trị_thực_tế
+giảm
+giảm_chính
+giảm_thấp
+giảm_thế
+giống
+giống_người
+giống_nhau
+giống_như
+giờ
+giờ_lâu
+giờ_này
+giờ_đi
+giờ_đây
+giờ_đến
+giữ
+giữ_lấy
+giữ_ý
+giữa
+giữa_lúc
+gây
+gây_cho
+gây_giống
+gây_ra
+gây_thêm
+gì
+gì_gì
+gì_đó
+gần
+gần_bên
+gần_hết
+gần_ngày
+gần_như
+gần_xa
+gần_đây
+gần_đến
+gặp
+gặp_khó_khăn
+gặp_phải
+gồm
+hay
+hay_biết
+hay_hay
+hay_không
+hay_là
+hay_làm
+hay_nhỉ
+hay_nói
+hay_sao
+hay_tin
+hay_đâu
+hiểu
+hiện_nay
+hiện_tại
+hoàn_toàn
+hoặc
+hoặc_là
+hãy
+hãy_còn
+hơn
+hơn_cả
+hơn_hết
+hơn_là
+hơn_nữa
+hơn_trước
+hầu_hết
+hết
+hết_chuyện
+hết_cả
+hết_của
+hết_nói
+hết_ráo
+hết_rồi
+hết_ý
+họ
+họ_gần
+họ_xa
+hỏi
+hỏi_lại
+hỏi_xem
+hỏi_xin
+hỗ_trợ
+khi
+khi_khác
+khi_không
+khi_nào
+khi_nên
+khi_trước
+khiến
+khoảng
+khoảng_cách
+khoảng_không
+khá
+khá_tốt
+khác
+khác_gì
+khác_khác
+khác_nhau
+khác_nào
+khác_thường
+khác_xa
+khách
+khó
+khó_biết
+khó_chơi
+khó_khăn
+khó_làm
+khó_mở
+khó_nghe
+khó_nghĩ
+khó_nói
+khó_thấy
+khó_tránh
+không
+không_ai
+không_bao_giờ
+không_bao_lâu
+không_biết
+không_bán
+không_chỉ
+không_còn
+không_có
+không_có_gì
+không_cùng
+không_cần
+không_cứ
+không_dùng
+không_gì
+không_hay
+không_khỏi
+không_kể
+không_ngoài
+không_nhận
+không_những
+không_phải
+không_phải_không
+không_thể
+không_tính
+không_điều_kiện
+không_được
+không_đầy
+không_để
+khẳng_định
+khỏi
+khỏi_nói
+kể
+kể_cả
+kể_như
+kể_tới
+kể_từ
+liên_quan
+loại
+loại_từ
+luôn
+luôn_cả
+luôn_luôn
+luôn_tay
+là
+là_cùng
+là_là
+là_nhiều
+là_phải
+là_thế_nào
+là_vì
+là_ít
+làm
+làm_bằng
+làm_cho
+làm_dần_dần
+làm_gì
+làm_lòng
+làm_lại
+làm_lấy
+làm_mất
+làm_ngay
+làm_như
+làm_nên
+làm_ra
+làm_riêng
+làm_sao
+làm_theo
+làm_thế_nào
+làm_tin
+làm_tôi
+làm_tăng
+làm_tại
+làm_tắp_lự
+làm_vì
+làm_đúng
+làm_được
+lâu
+lâu_các
+lâu_lâu
+lâu_nay
+lâu_ngày
+lên
+lên_cao
+lên_cơn
+lên_mạnh
+lên_ngôi
+lên_nước
+lên_số
+lên_xuống
+lên_đến
+lòng
+lòng_không
+lúc
+lúc_khác
+lúc_lâu
+lúc_nào
+lúc_này
+lúc_sáng
+lúc_trước
+lúc_đi
+lúc_đó
+lúc_đến
+lúc_ấy
+lý_do
+lượng
+lượng_cả
+lượng_số
+lượng_từ
+lại
+lại_bộ
+lại_cái
+lại_còn
+lại_giống
+lại_làm
+lại_người
+lại_nói
+lại_nữa
+lại_quả
+lại_thôi
+lại_ăn
+lại_đây
+lấy
+lấy_có
+lấy_cả
+lấy_giống
+lấy_làm
+lấy_lý_do
+lấy_lại
+lấy_ra
+lấy_ráo
+lấy_sau
+lấy_số
+lấy_thêm
+lấy_thế
+lấy_vào
+lấy_xuống
+lấy_được
+lấy_để
+lần
+lần_khác
+lần_lần
+lần_nào
+lần_này
+lần_sang
+lần_sau
+lần_theo
+lần_trước
+lần_tìm
+lớn
+lớn_lên
+lớn_nhỏ
+lời
+lời_chú
+lời_nói
+mang
+mang_lại
+mang_mang
+mang_nặng
+mang_về
+muốn
+mà
+mà_cả
+mà_không
+mà_lại
+mà_thôi
+mà_vẫn
+mình
+mạnh
+mất
+mất_còn
+mọi
+mọi_giờ
+mọi_khi
+mọi_lúc
+mọi_người
+mọi_nơi
+mọi_sự
+mọi_thứ
+mọi_việc
+mối
+mỗi
+mỗi_lúc
+mỗi_lần
+mỗi_một
+mỗi_ngày
+mỗi_người
+một
+một_cách
+một_cơn
+một_khi
+một_lúc
+một_số
+một_vài
+một_ít
+mới
+mới_hay
+mới_rồi
+mới_đây
+mở
+mở_mang
+mở_nước
+mở_ra
+mợ
+mức
+nay
+ngay
+ngay_bây_giờ
+ngay_cả
+ngay_khi
+ngay_khi_đến
+ngay_lúc
+ngay_lúc_này
+ngay_lập_tức
+ngay_thật
+ngay_tức_khắc
+ngay_tức_thì
+ngay_từ
+nghe
+nghe_chừng
+nghe_hiểu
+nghe_không
+nghe_lại
+nghe_nhìn
+nghe_như
+nghe_nói
+nghe_ra
+nghe_rõ
+nghe_thấy
+nghe_tin
+nghe_trực_tiếp
+nghe_đâu
+nghe_đâu_như
+nghe_được
+nghen
+nghiễm_nhiên
+nghĩ
+nghĩ_lại
+nghĩ_ra
+nghĩ_tới
+nghĩ_xa
+nghĩ_đến
+nghỉm
+ngoài
+ngoài_này
+ngoài_ra
+ngoài_xa
+ngoải
+nguồn
+ngày
+ngày_càng
+ngày_cấp
+ngày_giờ
+ngày_ngày
+ngày_nào
+ngày_này
+ngày_nọ
+ngày_qua
+ngày_rày
+ngày_tháng
+ngày_xưa
+ngày_xửa
+ngày_đến
+ngày_ấy
+ngôi
+ngôi_nhà
+ngôi_thứ
+ngõ_hầu
+ngăn_ngắt
+ngươi
+người
+người_hỏi
+người_khác
+người_khách
+người_mình
+người_nghe
+người_người
+người_nhận
+ngọn
+ngọn_nguồn
+ngọt
+ngồi
+ngồi_bệt
+ngồi_không
+ngồi_sau
+ngồi_trệt
+ngộ_nhỡ
+nhanh
+nhanh_lên
+nhanh_tay
+nhau
+nhiên_hậu
+nhiều
+nhiều_ít
+nhiệt_liệt
+nhung_nhăng
+nhà
+nhà_chung
+nhà_khó
+nhà_làm
+nhà_ngoài
+nhà_ngươi
+nhà_tôi
+nhà_việc
+nhân_dịp
+nhân_tiện
+nhé
+nhìn
+nhìn_chung
+nhìn_lại
+nhìn_nhận
+nhìn_theo
+nhìn_thấy
+nhìn_xuống
+nhóm
+nhón_nhén
+như
+như_ai
+như_chơi
+như_không
+như_là
+như_nhau
+như_quả
+như_sau
+như_thường
+như_thế
+như_thế_nào
+như_thể
+như_trên
+như_trước
+như_tuồng
+như_vậy
+như_ý
+nhưng
+nhưng_mà
+nhược_bằng
+nhất
+nhất_loạt
+nhất_luật
+nhất_là
+nhất_mực
+nhất_nhất
+nhất_quyết
+nhất_sinh
+nhất_thiết
+nhất_thì
+nhất_tâm
+nhất_tề
+nhất_đán
+nhất_định
+nhận
+nhận_biết
+nhận_họ
+nhận_làm
+nhận_nhau
+nhận_ra
+nhận_thấy
+nhận_việc
+nhận_được
+nhằm
+nhằm_khi
+nhằm_lúc
+nhằm_vào
+nhằm_để
+nhỉ
+nhỏ
+nhỏ_người
+nhớ
+nhớ_bập_bõm
+nhớ_lại
+nhớ_lấy
+nhớ_ra
+nhờ
+nhờ_chuyển
+nhờ_có
+nhờ_nhờ
+nhờ_đó
+nhỡ_ra
+những
+những_ai
+những_khi
+những_là
+những_lúc
+những_muốn
+những_như
+nào
+nào_cũng
+nào_hay
+nào_là
+nào_phải
+nào_đâu
+nào_đó
+này
+này_nọ
+nên
+nên_chi
+nên_chăng
+nên_làm
+nên_người
+nên_tránh
+nó
+nóc
+nói
+nói_bông
+nói_chung
+nói_khó
+nói_là
+nói_lên
+nói_lại
+nói_nhỏ
+nói_phải
+nói_qua
+nói_ra
+nói_riêng
+nói_rõ
+nói_thêm
+nói_thật
+nói_toẹt
+nói_trước
+nói_tốt
+nói_với
+nói_xa
+nói_ý
+nói_đến
+nói_đủ
+năm
+năm_tháng
+nơi
+nơi_nơi
+nước
+nước_bài
+nước_cùng
+nước_lên
+nước_nặng
+nước_quả
+nước_xuống
+nước_ăn
+nước_đến
+nấy
+nặng
+nặng_căn
+nặng_mình
+nặng_về
+nếu
+nếu_có
+nếu_cần
+nếu_không
+nếu_mà
+nếu_như
+nếu_thế
+nếu_vậy
+nếu_được
+nền
+nọ
+nớ
+nức_nở
+nữa
+nữa_khi
+nữa_là
+nữa_rồi
+oai_oái
+oái
+pho
+phè
+phè_phè
+phía
+phía_bên
+phía_bạn
+phía_dưới
+phía_sau
+phía_trong
+phía_trên
+phía_trước
+phóc
+phót
+phù_hợp
+phăn_phắt
+phương_chi
+phải
+phải_biết
+phải_chi
+phải_chăng
+phải_cách
+phải_cái
+phải_giờ
+phải_khi
+phải_không
+phải_lại
+phải_lời
+phải_người
+phải_như
+phải_rồi
+phải_tay
+phần
+phần_lớn
+phần_nhiều
+phần_nào
+phần_sau
+phần_việc
+phắt
+phỉ_phui
+phỏng
+phỏng_như
+phỏng_nước
+phỏng_theo
+phỏng_tính
+phốc
+phụt
+phứt
+qua
+qua_chuyện
+qua_khỏi
+qua_lại
+qua_lần
+qua_ngày
+qua_tay
+qua_thì
+qua_đi
+quan_trọng
+quan_trọng_vấn_đề
+quan_tâm
+quay
+quay_bước
+quay_lại
+quay_số
+quay_đi
+quá
+quá_bán
+quá_bộ
+quá_giờ
+quá_lời
+quá_mức
+quá_nhiều
+quá_tay
+quá_thì
+quá_tin
+quá_trình
+quá_tuổi
+quá_đáng
+quá_ư
+quả
+quả_là
+quả_thật
+quả_thế
+quả_vậy
+quận
+ra
+ra_bài
+ra_bộ
+ra_chơi
+ra_gì
+ra_lại
+ra_lời
+ra_ngôi
+ra_người
+ra_sao
+ra_tay
+ra_vào
+ra_ý
+ra_điều
+ra_đây
+ren_rén
+riu_ríu
+riêng
+riêng_từng
+riệt
+rày
+ráo
+ráo_cả
+ráo_nước
+ráo_trọi
+rén
+rén_bước
+rích
+rón_rén
+rõ
+rõ_là
+rõ_thật
+rút_cục
+răng
+răng_răng
+rất
+rất_lâu
+rằng
+rằng_là
+rốt_cuộc
+rốt_cục
+rồi
+rồi_nữa
+rồi_ra
+rồi_sao
+rồi_sau
+rồi_tay
+rồi_thì
+rồi_xem
+rồi_đây
+rứa
+sa_sả
+sang
+sang_năm
+sang_sáng
+sang_tay
+sao
+sao_bản
+sao_bằng
+sao_cho
+sao_vậy
+sao_đang
+sau
+sau_chót
+sau_cuối
+sau_cùng
+sau_hết
+sau_này
+sau_nữa
+sau_sau
+sau_đây
+sau_đó
+so
+so_với
+song_le
+suýt
+suýt_nữa
+sáng
+sáng_ngày
+sáng_rõ
+sáng_thế
+sáng_ý
+sì
+sì_sì
+sất
+sắp
+sắp_đặt
+sẽ
+sẽ_biết
+sẽ_hay
+số
+số_cho_biết
+số_cụ_thể
+số_loại
+số_là
+số_người
+số_phần
+số_thiếu
+sốt_sột
+sớm
+sớm_ngày
+sở_dĩ
+sử_dụng
+sự
+sự_thế
+sự_việc
+tanh
+tanh_tanh
+tay
+tay_quay
+tha_hồ
+tha_hồ_chơi
+tha_hồ_ăn
+than_ôi
+thanh
+thanh_ba
+thanh_chuyển
+thanh_không
+thanh_thanh
+thanh_tính
+thanh_điều_kiện
+thanh_điểm
+thay_đổi
+thay_đổi_tình_trạng
+theo
+theo_bước
+theo_như
+theo_tin
+thi_thoảng
+thiếu
+thiếu_gì
+thiếu_điểm
+thoạt
+thoạt_nghe
+thoạt_nhiên
+thoắt
+thuần
+thuần_ái
+thuộc
+thuộc_bài
+thuộc_cách
+thuộc_lại
+thuộc_từ
+thà
+thà_là
+thà_rằng
+thành_ra
+thành_thử
+thái_quá
+tháng
+tháng_ngày
+tháng_năm
+tháng_tháng
+thêm
+thêm_chuyện
+thêm_giờ
+thêm_vào
+thì
+thì_giờ
+thì_là
+thì_phải
+thì_ra
+thì_thôi
+thình_lình
+thích
+thích_cứ
+thích_thuộc
+thích_tự
+thích_ý
+thím
+thôi
+thôi_việc
+thúng_thắng
+thương_ôi
+thường
+thường_bị
+thường_hay
+thường_khi
+thường_số
+thường_sự
+thường_thôi
+thường_thường
+thường_tính
+thường_tại
+thường_xuất_hiện
+thường_đến
+thảo_hèn
+thảo_nào
+thấp
+thấp_cơ
+thấp_thỏm
+thấp_xuống
+thấy
+thấy_tháng
+thẩy
+thậm
+thậm_chí
+thậm_cấp
+thậm_từ
+thật
+thật_chắc
+thật_là
+thật_lực
+thật_quả
+thật_ra
+thật_sự
+thật_thà
+thật_tốt
+thật_vậy
+thế
+thế_chuẩn_bị
+thế_là
+thế_lại
+thế_mà
+thế_nào
+thế_nên
+thế_ra
+thế_sự
+thế_thì
+thế_thôi
+thế_thường
+thế_thế
+thế_à
+thế_đó
+thếch
+thỉnh_thoảng
+thỏm
+thốc
+thốc_tháo
+thốt
+thốt_nhiên
+thốt_nói
+thốt_thôi
+thộc
+thời_gian
+thời_gian_sử_dụng
+thời_gian_tính
+thời_điểm
+thục_mạng
+thứ
+thứ_bản
+thứ_đến
+thửa
+thực_hiện
+thực_hiện_đúng
+thực_ra
+thực_sự
+thực_tế
+thực_vậy
+tin
+tin_thêm
+tin_vào
+tiếp_theo
+tiếp_tục
+tiếp_đó
+tiện_thể
+toà
+toé_khói
+toẹt
+trong
+trong_khi
+trong_lúc
+trong_mình
+trong_ngoài
+trong_này
+trong_số
+trong_vùng
+trong_đó
+trong_ấy
+tránh
+tránh_khỏi
+tránh_ra
+tránh_tình_trạng
+tránh_xa
+trên
+trên_bộ
+trên_dưới
+trước
+trước_hết
+trước_khi
+trước_kia
+trước_nay
+trước_ngày
+trước_nhất
+trước_sau
+trước_tiên
+trước_tuổi
+trước_đây
+trước_đó
+trả
+trả_của
+trả_lại
+trả_ngay
+trả_trước
+trếu_tráo
+trển
+trệt
+trệu_trạo
+trỏng
+trời_đất_ơi
+trở_thành
+trừ_phi
+trực_tiếp
+trực_tiếp_làm
+tuy
+tuy_có
+tuy_là
+tuy_nhiên
+tuy_rằng
+tuy_thế
+tuy_vậy
+tuy_đã
+tuyệt_nhiên
+tuần_tự
+tuốt_luốt
+tuốt_tuồn_tuột
+tuốt_tuột
+tuổi
+tuổi_cả
+tuổi_tôi
+tà_tà
+tên
+tên_chính
+tên_cái
+tên_họ
+tên_tự
+tênh
+tênh_tênh
+tìm
+tìm_bạn
+tìm_cách
+tìm_hiểu
+tìm_ra
+tìm_việc
+tình_trạng
+tính
+tính_cách
+tính_căn
+tính_người
+tính_phỏng
+tính_từ
+tít_mù
+tò_te
+tôi
+tôi_con
+tông_tốc
+tù_tì
+tăm_tắp
+tăng
+tăng_chúng
+tăng_cấp
+tăng_giảm
+tăng_thêm
+tăng_thế
+tại
+tại_lòng
+tại_nơi
+tại_sao
+tại_tôi
+tại_vì
+tại_đâu
+tại_đây
+tại_đó
+tạo
+tạo_cơ_hội
+tạo_nên
+tạo_ra
+tạo_ý
+tạo_điều_kiện
+tấm
+tấm_bản
+tấm_các
+tấn
+tấn_tới
+tất_cả
+tất_cả_bao_nhiêu
+tất_thảy
+tất_tần_tật
+tất_tật
+tập_trung
+tắp
+tắp_lự
+tắp_tắp
+tọt
+tỏ_ra
+tỏ_vẻ
+tốc_tả
+tối_ư
+tốt
+tốt_bạn
+tốt_bộ
+tốt_hơn
+tốt_mối
+tốt_ngày
+tột
+tột_cùng
+tớ
+tới
+tới_gần
+tới_mức
+tới_nơi
+tới_thì
+tức_thì
+tức_tốc
+từ
+từ_căn
+từ_giờ
+từ_khi
+từ_loại
+từ_nay
+từ_thế
+từ_tính
+từ_tại
+từ_từ
+từ_ái
+từ_điều
+từ_đó
+từ_ấy
+từng
+từng_cái
+từng_giờ
+từng_nhà
+từng_phần
+từng_thời_gian
+từng_đơn_vị
+từng_ấy
+tự
+tự_cao
+tự_khi
+tự_lượng
+tự_tính
+tự_tạo
+tự_vì
+tự_ý
+tự_ăn
+tựu_trung
+veo
+veo_veo
+việc
+việc_gì
+vung_thiên_địa
+vung_tàn_tán
+vung_tán_tàn
+và
+vài
+vài_ba
+vài_người
+vài_nhà
+vài_nơi
+vài_tên
+vài_điều
+vào
+vào_gặp
+vào_khoảng
+vào_lúc
+vào_vùng
+vào_đến
+vâng
+vâng_chịu
+vâng_dạ
+vâng_vâng
+vâng_ý
+vèo
+vèo_vèo
+vì
+vì_chưng
+vì_rằng
+vì_sao
+vì_thế
+vì_vậy
+ví_bằng
+ví_dù
+ví_phỏng
+ví_thử
+vô_hình_trung
+vô_kể
+vô_luận
+vô_vàn
+vùng
+vùng_lên
+vùng_nước
+văng_tê
+vượt
+vượt_khỏi
+vượt_quá
+vạn_nhất
+vả_chăng
+vả_lại
+vấn_đề
+vấn_đề_quan_trọng
+vẫn
+vẫn_thế
+vậy
+vậy_là
+vậy_mà
+vậy_nên
+vậy_ra
+vậy_thì
+vậy_ư
+về
+về_không
+về_nước
+về_phần
+về_sau
+về_tay
+vị_trí
+vị_tất
+vốn_dĩ
+với
+với_lại
+với_nhau
+vở
+vụt
+vừa
+vừa_khi
+vừa_lúc
+vừa_mới
+vừa_qua
+vừa_rồi
+vừa_vừa
+xa
+xa_cách
+xa_gần
+xa_nhà
+xa_tanh
+xa_tắp
+xa_xa
+xa_xả
+xem
+xem_lại
+xem_ra
+xem_số
+xin
+xin_gặp
+xin_vâng
+xiết_bao
+xon_xón
+xoành_xoạch
+xoét
+xoẳn
+xoẹt
+xuất_hiện
+xuất_kì_bất_ý
+xuất_kỳ_bất_ý
+xuể
+xuống
+xăm_xúi
+xăm_xăm
+xăm_xắm
+xảy_ra
+xềnh_xệch
+xệp
+xử_lý
+yêu_cầu
+à
+à_này
+à_ơi
+ào
+ào_vào
+ào_ào
+á
+á_à
+ái
+ái_chà
+ái_dà
+áng
+áng_như
+âu_là
+ít
+ít_biết
+ít_có
+ít_hơn
+ít_khi
+ít_lâu
+ít_nhiều
+ít_nhất
+ít_nữa
+ít_quá
+ít_ra
+ít_thôi
+ít_thấy
+ô_hay
+ô_hô
+ô_kê
+ô_kìa
+ôi_chao
+ôi_thôi
+ông
+ông_nhỏ
+ông_tạo
+ông_từ
+ông_ấy
+ông_ổng
+úi
+úi_chà
+úi_dào
+ý
+ý_chừng
+ý_da
+ý_hoặc
+ăn
+ăn_chung
+ăn_chắc
+ăn_chịu
+ăn_cuộc
+ăn_hết
+ăn_hỏi
+ăn_làm
+ăn_người
+ăn_ngồi
+ăn_quá
+ăn_riêng
+ăn_sáng
+ăn_tay
+ăn_trên
+ăn_về
+đang
+đang_tay
+đang_thì
+điều
+điều_gì
+điều_kiện
+điểm
+điểm_chính
+điểm_gặp
+điểm_đầu_tiên
+đành_đạch
+đáng
+đáng_kể
+đáng_lí
+đáng_lý
+đáng_lẽ
+đáng_số
+đánh_giá
+đánh_đùng
+đáo_để
+đâu
+đâu_có
+đâu_cũng
+đâu_như
+đâu_nào
+đâu_phải
+đâu_đâu
+đâu_đây
+đâu_đó
+đây
+đây_này
+đây_rồi
+đây_đó
+đã
+đã_hay
+đã_không
+đã_là
+đã_lâu
+đã_thế
+đã_vậy
+đã_đủ
+đó
+đó_đây
+đúng
+đúng_ngày
+đúng_ra
+đúng_tuổi
+đúng_với
+đơn_vị
+đưa
+đưa_cho
+đưa_chuyện
+đưa_em
+đưa_ra
+đưa_tay
+đưa_tin
+đưa_tới
+đưa_vào
+đưa_về
+đưa_xuống
+đưa_đến
+được
+được_cái
+được_lời
+được_nước
+được_tin
+đại_loại
+đại_nhân
+đại_phàm
+đại_để
+đạt
+đảm_bảo
+đầu_tiên
+đầy
+đầy_năm
+đầy_phè
+đầy_tuổi
+đặc_biệt
+đặt
+đặt_làm
+đặt_mình
+đặt_mức
+đặt_ra
+đặt_trước
+đặt_để
+đến
+đến_bao_giờ
+đến_cùng
+đến_cùng_cực
+đến_cả
+đến_giờ
+đến_gần
+đến_hay
+đến_khi
+đến_lúc
+đến_lời
+đến_nay
+đến_ngày
+đến_nơi
+đến_nỗi
+đến_thì
+đến_thế
+đến_tuổi
+đến_xem
+đến_điều
+đến_đâu
+đều
+đều_bước
+đều_nhau
+đều_đều
+để
+để_cho
+để_giống
+để_không
+để_lòng
+để_lại
+để_mà
+để_phần
+để_được
+để_đến_nỗi
+đối_với
+đồng_thời
+đủ
+đủ_dùng
+đủ_nơi
+đủ_số
+đủ_điều
+đủ_điểm
+ơ
+ơ_hay
+ơ_kìa
+ơi
+ơi_là
+ư
+ạ
+ạ_ơi
+ấy
+ấy_là
+ầu_ơ
+ắt
+ắt_hẳn
+ắt_là
+ắt_phải
+ắt_thật
+ối_dào
+ối_giời
+ối_giời_ơi
+ồ
+ồ_ồ
+ổng
+ớ
+ớ_này
+ờ
+ờ_ờ
+ở
+ở_lại
+ở_như
+ở_nhờ
+ở_năm
+ở_trên
+ở_vào
+ở_đây
+ở_đó
+ở_được
+ủa
+ứ_hự
+ứ_ừ
+ừ
+ừ_nhé
+ừ_thì
+ừ_ào
+ừ_ừ
+ử