Spaces:
Running
Running
| title: Text Duplicates | |
| emoji: π€ | |
| colorFrom: green | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 3.0.2 | |
| app_file: app.py | |
| pinned: false | |
| tags: | |
| - evaluate | |
| - measurement | |
| description: >- | |
| Returns the duplicate fraction of duplicate strings in the input. | |
| # Measurement Card for Text Duplicates | |
| ## Measurement Description | |
| The `text_duplicates` measurement returns the fraction of duplicated strings in the input data. | |
| ## How to Use | |
| This measurement requires a list of strings as input: | |
| ```python | |
| >>> data = ["hello sun","hello moon", "hello sun"] | |
| >>> duplicates = evaluate.load("text_duplicates") | |
| >>> results = duplicates.compute(data=data) | |
| ``` | |
| ### Inputs | |
| - **data** (list of `str`): The input list of strings for which the duplicates are calculated. | |
| ### Output Values | |
| - **duplicate_fraction**(`float`): the fraction of duplicates in the input string(s). | |
| - **duplicates_dict**(`list`): (optional) a list of tuples with the duplicate strings and the number of times they are repeated. | |
| By default, this measurement outputs a dictionary containing the fraction of duplicates in the input string(s) (`duplicate_fraction`): | |
| ) | |
| ```python | |
| {'duplicate_fraction': 0.33333333333333337} | |
| ``` | |
| With the `list_duplicates=True` option, this measurement will also output a dictionary of tuples with duplicate strings and their counts. | |
| ```python | |
| {'duplicate_fraction': 0.33333333333333337, 'duplicates_dict': {'hello sun': 2}} | |
| ``` | |
| Warning: the `list_duplicates=True` function can be memory-intensive for large datasets. | |
| ### Examples | |
| Example with no duplicates | |
| ```python | |
| >>> data = ["foo", "bar", "foobar"] | |
| >>> duplicates = evaluate.load("text_duplicates") | |
| >>> results = duplicates.compute(data=data) | |
| >>> print(results) | |
| {'duplicate_fraction': 0.0} | |
| ``` | |
| Example with multiple duplicates and `list_duplicates=True`: | |
| ```python | |
| >>> data = ["hello sun", "goodbye moon", "hello sun", "foo bar", "foo bar"] | |
| >>> duplicates = evaluate.load("text_duplicates") | |
| >>> results = duplicates.compute(data=data, list_duplicates=True) | |
| >>> print(results) | |
| {'duplicate_fraction': 0.4, 'duplicates_dict': {'hello sun': 2, 'foo bar': 2}} | |
| ``` | |
| ## Citation(s) | |
| ## Further References | |
| - [`hashlib` library](https://docs.python.org/3/library/hashlib.html) | |