File size: 3,685 Bytes

781c3b4

# ViSoNorm: Vietnamese Text Normalization Model

ViSoNorm is a state-of-the-art Vietnamese text normalization model that converts informal, non-standard Vietnamese text into standard Vietnamese. The model uses a multi-task learning approach with NSW (Non-Standard Word) detection, mask prediction, and lexical normalization heads.

## Model Architecture

- **Base Model**: ViSoBERT (Vietnamese Social Media BERT)
- **Multi-task Heads**:
  - NSW Detection: Identifies tokens that need normalization
  - Mask Prediction: Determines how many masks to add for multi-token expansions
  - Lexical Normalization: Predicts normalized tokens

## Features

- **Self-contained inference**: Built-in `normalize_text` method
- **NSW detection**: Built-in `detect_nsw` method for detailed analysis
- **HuggingFace compatible**: Works seamlessly with `AutoModelForMaskedLM`
- **Production ready**: No hardcoded patterns, works for any Vietnamese text
- **Multi-token expansion**: Handles cases like "sv" → "sinh viên", "ctrai" → "con trai"
- **Confidence scoring**: Provides confidence scores for NSW detection and normalization

## Installation

```bash

pip install transformers torch

```

## Usage

### Basic Usage

```python

from transformers import AutoTokenizer, AutoModelForMaskedLM



# Load model and tokenizer

model_repo = "hadung1802/visobert-normalizer"

tokenizer = AutoTokenizer.from_pretrained(model_repo)

model = AutoModelForMaskedLM.from_pretrained(model_repo, trust_remote_code=True)



# Normalize text

text = "sv dh gia dinh chua cho di lam :))"

normalized_text, source_tokens, predicted_tokens = model.normalize_text(

    tokenizer, text, device='cpu'

)



print(f"Original: {text}")

print(f"Normalized: {normalized_text}")

```

### NSW Detection

```python

# Detect Non-Standard Words (NSW) in text

text = "nhìn thôi cung thấy đau long quá đi :))"

nsw_results = model.detect_nsw(tokenizer, text, device='cpu')



print(f"Text: {text}")

for result in nsw_results:

    print(f"NSW: '{result['nsw']}' → '{result['prediction']}' (confidence: {result['confidence_score']})")

```

### Batch Processing

```python

texts = [

    "sv dh gia dinh chua cho di lam :))",

    "chúng nó bảo em là ctrai", 

    "t vs b chơi vs nhau đã lâu"

]



for text in texts:

    normalized_text, _, _ = model.normalize_text(tokenizer, text, device='cpu')

    print(f"{text} → {normalized_text}")

```

### Expected Output

#### Text Normalization
```

sv dh gia dinh chua cho di lam :)) → sinh viên đại học gia đình chưa cho đi làm :))

chúng nó bảo em là ctrai → chúng nó bảo em là con trai

t vs b chơi vs nhau đã lâu → tôi với bạn chơi với nhau đã lâu

```

#### NSW Detection
```python

# Input: "nhìn thôi cung thấy đau long quá đi :))"

[

  {

    "index": 3,

    "start_index": 10,

    "end_index": 14,

    "nsw": "cung",

    "prediction": "cũng",

    "confidence_score": 0.9415

  },

  {

    "index": 6,

    "start_index": 24,

    "end_index": 28,

    "nsw": "long",

    "prediction": "lòng",

    "confidence_score": 0.7056

  }

]

```

### NSW Detection Output Format

The `detect_nsw` method returns a list of dictionaries with the following structure:

- **`index`**: Position of the token in the sequence
- **`start_index`**: Start character position in the original text

- **`end_index`**: End character position in the original text  
- **`nsw`**: The original non-standard word (detokenized)
- **`prediction`**: The predicted normalized word (detokenized)
- **`confidence_score`**: Combined confidence score (0.0 to 1.0)