address-emnet / README.md
pawan2411's picture
Update README.md
fcb64d2 verified
---
base_model: pawan2411/address_net
datasets: []
language: []
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:4008
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: Orchard Road 313, Singapore 238895
sentences:
- Orchard Rd 313, Singapore 238895
- 15 Rue de la Paix/75002/France
- NY, 5th Avenue and 57th Street
- source_sentence: 1 Raffles Place, One Raffles Place, Singapore 048616
sentences:
- 1 Raffles Place, Singapore 048616
- Madrid 28001 Spain Calle Serrano 30
- Kurfürstendamm 185/10707 Berlin/Germany
- source_sentence: Kurfürstendamm 207-208, 10719 Berlin, Germany
sentences:
- Argentina CABA C1073ABA 1925 Avenida 9 de Julio
- Kurfürstendamm ๒๐๗-๒๐๘, ๑๐๗๑๙ Berlin, Germany
- 123 Main St, Anytown, AB T1A 1A1
- source_sentence: Via Tornabuoni, 50123 Firenze FI, Italy
sentences:
- Hamngatan 18-20, Stockholm, Sweden
- 1 Florida, Argentina
- Tornabuoni St, 50123 Italy
- source_sentence: Nanjing Road Pedestrian Street, Huangpu, Shanghai 200001, China
sentences:
- Nanjing Rd Ped St, Huangpu Dist, Shanghai, China
- 5 Rue du Faubourg Saint-Honoré, Paris, France
- 6 Place d'Italie, Paris
---
## Address Embedding Model
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b697a4ce95149b1205f652/X7wZsnDYXkZilbCaa2U8g.png)
This model generates embeddings for addresses, designed to facilitate address matching, deduplication, and standardization tasks.
## Model description
The Address Matching Embedding Model is designed to create vector representations of addresses that capture semantic similarities, making it easier to match and deduplicate addresses across different formats and styles.
- **Model Type:** Sentence Transformer
- **Base model:** [pawan2411/address_net](https://huggingface.co/pawan2411/address_net) <!-- at revision 59a25ad94c91cf025ae8d44f21e404c387065b4b -->
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("pawan2411/address-emnet")
# Run inference
sentences = [
'60 Ratchadaphisek Rd, Khwaeng Khlong Toei, Khet Khlong Toei, Krung Thep Maha Nakhon 10110',
'60 Ratchadaphisek Road, Krung Thep Maha Nakhon, Thailand',
'61 Ratchadaphisek Road, Krung Thep Maha Nakhon, Thailand'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
```
<!--
### Direct Usage (Transformers)
<details><summary>Click to see the direct usage in Transformers</summary>
</details>
-->
<!--
### Downstream Usage (Sentence Transformers)
You can finetune this model on your own dataset.
<details><summary>Click to expand</summary>
</details>
-->
<!--
### Out-of-Scope Use
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->
<!--
## Bias, Risks and Limitations
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->
<!--
### Recommendations
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->
<!-- * Size: 4,008 training samples
* Columns: <code>sentence_0</code> and <code>sentence_1</code>
* Approximate statistics based on the first 1000 samples:
| | sentence_0 | sentence_1 |
|:--------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
| type | string | string |
| details | <ul><li>min: 10 tokens</li><li>mean: 16.73 tokens</li><li>max: 29 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 11.4 tokens</li><li>max: 27 tokens</li></ul> |
* Samples:
| sentence_0 | sentence_1 |
|:------------------------------------------------------------------------------------|:------------------------------------------------|
| <code>1-7-1 Konan, Minato City, Tokyo 108-0075, Japan</code> | <code>1-7-1 Konan, Tokyo 108-0075, Japan</code> |
| <code>Avenida Paulista, 1000 - Bela Vista, São Paulo - SP, 01310-100, Brazil</code> | <code>Bela Vista 01310-100</code> |
| <code>Strada Lipscani 25, București 030031, Romania</code> | <code>Strada Lipscani București</code> |
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
```json
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
```
-->
## Citation
### BibTeX
#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```
<!--
## Glossary
*Clearly define terms in order to be accessible across audiences.*
-->
<!--
## Model Card Authors
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->
<!--
## Model Card Contact
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->