Spaces:
Runtime error
Runtime error
Leonardo
commited on
Update scripts/text_cleaner_tool.py
Browse files- scripts/text_cleaner_tool.py +20 -1
scripts/text_cleaner_tool.py
CHANGED
|
@@ -11,7 +11,7 @@ class TextCleanerTool(Tool):
|
|
| 11 |
name = "clean_text"
|
| 12 |
description = (
|
| 13 |
"Cleans and normalizes text using the cleantext library. "
|
| 14 |
-
"
|
| 15 |
)
|
| 16 |
inputs = {
|
| 17 |
"text": {"type": "string", "description": "The input text to clean"},
|
|
@@ -33,6 +33,25 @@ class TextCleanerTool(Tool):
|
|
| 33 |
"""
|
| 34 |
Clean text using the cleantext library with flexible options.
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
Example API:
|
| 37 |
clean("some input",
|
| 38 |
fix_unicode=True, # fix various unicode errors
|
|
|
|
| 11 |
name = "clean_text"
|
| 12 |
description = (
|
| 13 |
"Cleans and normalizes text using the cleantext library. "
|
| 14 |
+
"Transforms messy user-generated content into normalized text."
|
| 15 |
)
|
| 16 |
inputs = {
|
| 17 |
"text": {"type": "string", "description": "The input text to clean"},
|
|
|
|
| 33 |
"""
|
| 34 |
Clean text using the cleantext library with flexible options.
|
| 35 |
|
| 36 |
+
User-generated content on the Web and in social media is often dirty. Preprocess your scraped data with `clean-text` to create a normalized text representation. For instance, turn this corrupted input:
|
| 37 |
+
|
| 38 |
+
```
|
| 39 |
+
A bunch of \\u2018new\\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
禄Y贸霉 脿r茅 r茂ght <3!芦
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
into this clean output:
|
| 46 |
+
|
| 47 |
+
```
|
| 48 |
+
A bunch of 'new' references, including [moana](<URL>).
|
| 49 |
+
|
| 50 |
+
"you are right <3!"
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
`clean-text` uses ftfy, unidecode and numerous hand-crafted rules, i.e., RegEx.
|
| 54 |
+
|
| 55 |
Example API:
|
| 56 |
clean("some input",
|
| 57 |
fix_unicode=True, # fix various unicode errors
|