Leonardo commited on
Commit
fe63cd7
verified
1 Parent(s): 8520a66

Update scripts/text_cleaner_tool.py

Browse files
Files changed (1) hide show
  1. scripts/text_cleaner_tool.py +20 -1
scripts/text_cleaner_tool.py CHANGED
@@ -11,7 +11,7 @@ class TextCleanerTool(Tool):
11
  name = "clean_text"
12
  description = (
13
  "Cleans and normalizes text using the cleantext library. "
14
- "Example usage: clean_text(text='Your text here', options={'lower': True, 'no_urls': True})"
15
  )
16
  inputs = {
17
  "text": {"type": "string", "description": "The input text to clean"},
@@ -33,6 +33,25 @@ class TextCleanerTool(Tool):
33
  """
34
  Clean text using the cleantext library with flexible options.
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  Example API:
37
  clean("some input",
38
  fix_unicode=True, # fix various unicode errors
 
11
  name = "clean_text"
12
  description = (
13
  "Cleans and normalizes text using the cleantext library. "
14
+ "Transforms messy user-generated content into normalized text."
15
  )
16
  inputs = {
17
  "text": {"type": "string", "description": "The input text to clean"},
 
33
  """
34
  Clean text using the cleantext library with flexible options.
35
 
36
+ User-generated content on the Web and in social media is often dirty. Preprocess your scraped data with `clean-text` to create a normalized text representation. For instance, turn this corrupted input:
37
+
38
+ ```
39
+ A bunch of \\u2018new\\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).
40
+
41
+
42
+ 禄Y贸霉 脿r茅 r茂ght <3!芦
43
+ ```
44
+
45
+ into this clean output:
46
+
47
+ ```
48
+ A bunch of 'new' references, including [moana](<URL>).
49
+
50
+ "you are right <3!"
51
+ ```
52
+
53
+ `clean-text` uses ftfy, unidecode and numerous hand-crafted rules, i.e., RegEx.
54
+
55
  Example API:
56
  clean("some input",
57
  fix_unicode=True, # fix various unicode errors