lamossta commited on
Commit
4820148
·
1 Parent(s): 06c5510

env config files

Browse files
Dockerfile ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.12-slim
2
+
3
+ WORKDIR /app
4
+
5
+ RUN apt-get update && apt-get install -y --no-install-recommends build-essential make nginx && \
6
+ rm -rf /var/lib/apt/lists/*
7
+
8
+ COPY requirements.txt .
9
+ COPY requirements-cpu.txt .
10
+ COPY Makefile .
11
+ RUN make install
12
+
13
+ COPY src/ src/
14
+ COPY data/ data/
15
+ COPY pages/ pages/
16
+ COPY app.py .
17
+ COPY main.py .
18
+ COPY nginx.conf .
19
+ COPY start.sh .
20
+
21
+ RUN make preprocess && make augment
22
+
23
+ RUN chmod +x start.sh
24
+
25
+ RUN mkdir -p /tmp/nginx && \
26
+ chmod -R 777 /var/log/nginx /var/lib/nginx /tmp/nginx
27
+
28
+ RUN useradd -m -u 1000 user && \
29
+ mkdir -p /app/models && \
30
+ chown -R 1000:1000 /app/models
31
+ USER user
32
+
33
+ EXPOSE 7860
34
+
35
+ CMD ["./start.sh"]
assignment.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Entity Sentiment
2
+
3
+ Your task is to create a FastAPI application which will classify sentiment with respect to specific entities in a text.
4
+
5
+ For example, given a text such as `Google had solid Q4 2025 earnings but Microsoft's were below expectations`, the system should be able to say that for `Google` the sentiment is `positive`, but for `Microsoft` the sentiment is `negative`.
6
+
7
+ You will use `positive`, `neutral`, and `negative` as sentiment values.
8
+
9
+ # Data
10
+
11
+ The data is in `data.json`. It is an array of samples with the following format:
12
+
13
+ ```json
14
+ [
15
+ {
16
+ "id": int, sample ID,
17
+ "text": str, article text,
18
+ "entities": [
19
+ {
20
+ "entity_id": int, entity ID,
21
+ "entity_text": str, text of the entity,
22
+ "entity_type": str, one of ["company", "location"],
23
+ "positions": [
24
+ {
25
+ "position_text": str, text of the occurrence,
26
+ "length": int, length of the occurrence,
27
+ "offset": int, offset from the start of text
28
+ },
29
+ ... other positions
30
+ ],
31
+ "label": str, one of ["positive", "negative", "neutral"]
32
+ },
33
+ ... other entities
34
+ ],
35
+ },
36
+ ... other samples
37
+ ]
38
+ ```
39
+
40
+ Each sample can have multiple entities, each with its own label, and multiple positions per entity (an entity can occur multiple times in the text).
41
+
42
+ # Assignment
43
+
44
+ Create a FastAPI application which will expose a `/predict` endpoint. The endpoint will accept an array of samples in the same format as the example above, except for the `label` key.
45
+
46
+ ```json
47
+ [
48
+ {
49
+ "id": int, sample ID,
50
+ "text": str, article text,
51
+ "entities": [
52
+ {
53
+ "entity_id": int, entity ID,
54
+ "entity_text": str, text of the entity,
55
+ "entity_type": str, one of ["company", "location"],
56
+ "positions": [
57
+ {
58
+ "position_text": str, text of the occurrence,
59
+ "length": int, length of the occurrence,
60
+ "offset": int, offset from the start of text
61
+ },
62
+ ... other positions
63
+ ]
64
+ },
65
+ ... other entities
66
+ ]
67
+ },
68
+ ... other inputs
69
+ ]
70
+ ```
71
+
72
+ For each sample, it will perform the sentiment classification and output an object with the following shape:
73
+
74
+ ```json
75
+ [
76
+ {
77
+ "id": int, sample ID,
78
+ "entities": [
79
+ {
80
+ "entity_id": int, entity ID,
81
+ "entity_text": str, text of the entity,
82
+ "classification": str, one of ["positive", "negative", "neutral"]
83
+ },
84
+ ...
85
+ ]
86
+ },
87
+ ... other outputs
88
+ ]
89
+ ```
90
+
91
+ ## Data Preparation
92
+
93
+ Examine the data and create an understanding of the dataset. We are interested in data hygiene practices and general good data science practices.
94
+
95
+ ## Classification
96
+
97
+ Create a system which will perform the classification. We do not expect perfect metrics or extensive experiments—the point of this assignment is not to spend days training—but we do expect a well-reasoned approach which can deal with all the different edge cases this task has. Make sure to note the best metrics you achieve. Feel free to use any framework, architecture, and approach.
98
+
99
+ ## API and Docker
100
+
101
+ Create the FastAPI application, a `Dockerfile` for the application, and a `docker-compose.yml` file so that we can simply run the application with `docker compose up`.
102
+
103
+ # Deliverables and Documentation
104
+
105
+ Share a link to your GitHub repository or a zipped folder containing:
106
+ 1. The Python code (the application itself, any data analysis, preprocessing, training, and evaluation scripts)
107
+ 2. The `Dockerfile` and `docker-compose.yml`
108
+ 3. A short documentation
109
+
110
+ ## Documentation
111
+
112
+ The documentation should not be a formal report; it can be just structured notes. The goal is for us to understand your process, the decisions you made, why you made them, and what worked and what did not. We want to see how you approach a relatively open-ended problem and what solution you come up with. If you know your solution has shortcomings or edge cases it cannot deal with, note them.
docker-compose.yml ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ services:
2
+ app:
3
+ build: .
4
+ ports:
5
+ - "7860:7860"
6
+ env_file:
7
+ - .env
8
+ volumes:
9
+ - models:/app/models
10
+
11
+ volumes:
12
+ models:
nginx.conf ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ pid /tmp/nginx/nginx.pid;
2
+
3
+ events {
4
+ worker_connections 1024;
5
+ }
6
+
7
+ http {
8
+ client_body_temp_path /tmp/nginx/client_body;
9
+ proxy_temp_path /tmp/nginx/proxy;
10
+ fastcgi_temp_path /tmp/nginx/fastcgi;
11
+ uwsgi_temp_path /tmp/nginx/uwsgi;
12
+ scgi_temp_path /tmp/nginx/scgi;
13
+
14
+ server {
15
+ listen 7860;
16
+
17
+ location /api/ {
18
+ proxy_pass http://127.0.0.1:8000/;
19
+ proxy_set_header Host $host;
20
+ proxy_set_header X-Real-IP $remote_addr;
21
+ }
22
+
23
+ location / {
24
+ proxy_pass http://127.0.0.1:8501;
25
+ proxy_set_header Host $host;
26
+ proxy_set_header X-Real-IP $remote_addr;
27
+ proxy_http_version 1.1;
28
+ proxy_set_header Upgrade $http_upgrade;
29
+ proxy_set_header Connection "upgrade";
30
+ }
31
+ }
32
+ }
requirements-cpu.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ --index-url https://download.pytorch.org/whl/cpu
2
+ torch>=2.6.0
requirements-gpu.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ torch>=2.6.0
2
+ torchvision>=0.21.0
requirements.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ transformers>=4.50.3
2
+ pandas>=1.4.0
3
+ numpy<2
4
+ scikit-learn
5
+ matplotlib
6
+ seaborn
7
+ onnxruntime>=1.18.0
8
+ onnxscript>=0.1.0
9
+ fastapi>=0.121.0
10
+ uvicorn[standard]>=0.34.0
11
+ pydantic>=2.12.4
12
+ streamlit>=1.35.0
13
+ requests>=2.32.5
14
+ python-dotenv>=1.2.1
15
+ huggingface_hub>=0.23.0
16
+ fasttext-wheel
17
+ accelerate>=1.1.0
sample_input.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "id": 0,
4
+ "text": "Google had solid Q4 2025 earnings but Microsoft's were not great.",
5
+ "entities": [
6
+ {
7
+ "entity_id": 0,
8
+ "entity_text": "Google",
9
+ "entity_type": "company",
10
+ "positions": [
11
+ {"position_text": "Google", "length": 6, "offset": 0}
12
+ ]
13
+ },
14
+ {
15
+ "entity_id": 1,
16
+ "entity_text": "Microsoft",
17
+ "entity_type": "company",
18
+ "positions": [
19
+ {"position_text": "Microsoft", "length": 9, "offset": 40}
20
+ ]
21
+ }
22
+ ]
23
+ }
24
+ ]
start.sh ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ set -e
3
+
4
+ make hf-download-all
5
+
6
+ uvicorn app:app --host 0.0.0.0 --port 8000 &
7
+
8
+ streamlit run main.py --server.port 8501 --server.address 0.0.0.0 --server.headless true &
9
+
10
+ nginx -g "daemon off;" -c /app/nginx.conf