Spaces:

lamossta
/

sv-task

Sleeping

App Files Files Community

lamossta commited on Apr 20

Commit

4820148

1 Parent(s): 06c5510

env config files

Browse files

Files changed (9) hide show

Dockerfile +35 -0
assignment.md +112 -0
docker-compose.yml +12 -0
nginx.conf +32 -0
requirements-cpu.txt +2 -0
requirements-gpu.txt +2 -0
requirements.txt +17 -0
sample_input.json +24 -0
start.sh +10 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,35 @@

+FROM python:3.12-slim
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends build-essential make nginx && \
+    rm -rf /var/lib/apt/lists/*
+COPY requirements.txt .
+COPY requirements-cpu.txt .
+COPY Makefile .
+RUN make install
+COPY src/ src/
+COPY data/ data/
+COPY pages/ pages/
+COPY app.py .
+COPY main.py .
+COPY nginx.conf .
+COPY start.sh .
+RUN make preprocess && make augment
+RUN chmod +x start.sh
+RUN mkdir -p /tmp/nginx && \
+    chmod -R 777 /var/log/nginx /var/lib/nginx /tmp/nginx
+RUN useradd -m -u 1000 user && \
+    mkdir -p /app/models && \
+    chown -R 1000:1000 /app/models
+USER user
+EXPOSE 7860
+CMD ["./start.sh"]

assignment.md ADDED Viewed

	@@ -0,0 +1,112 @@

+# Entity Sentiment
+Your task is to create a FastAPI application which will classify sentiment with respect to specific entities in a text.
+For example, given a text such as `Google had solid Q4 2025 earnings but Microsoft's were below expectations`, the system should be able to say that for `Google` the sentiment is `positive`, but for `Microsoft` the sentiment is `negative`.
+You will use `positive`, `neutral`, and `negative` as sentiment values.
+# Data
+The data is in `data.json`. It is an array of samples with the following format:
+```json
+[
+    {
+        "id": int, sample ID,
+        "text": str, article text,
+        "entities": [
+            {
+                "entity_id": int, entity ID,
+                "entity_text": str, text of the entity,
+                "entity_type": str, one of ["company", "location"],
+                "positions": [
+                    {
+                        "position_text": str, text of the occurrence,
+                        "length": int, length of the occurrence,
+                        "offset": int, offset from the start of text
+                    },
+                    ... other positions
+                ],
+                "label": str, one of ["positive", "negative", "neutral"]
+            },
+            ... other entities
+        ],
+    },
+    ... other samples
+]
+```
+Each sample can have multiple entities, each with its own label, and multiple positions per entity (an entity can occur multiple times in the text).
+# Assignment
+Create a FastAPI application which will expose a `/predict` endpoint. The endpoint will accept an array of samples in the same format as the example above, except for the `label` key.
+```json
+[
+    {
+        "id": int, sample ID,
+        "text": str, article text,
+        "entities": [
+            {
+                "entity_id": int, entity ID,
+                "entity_text": str, text of the entity,
+                "entity_type": str, one of ["company", "location"],
+                "positions": [
+                    {
+                        "position_text": str, text of the occurrence,
+                        "length": int, length of the occurrence,
+                        "offset": int, offset from the start of text
+                    },
+                    ... other positions
+                ]
+            },
+            ... other entities
+        ]
+    },
+    ... other inputs
+]
+```
+For each sample, it will perform the sentiment classification and output an object with the following shape:
+```json
+[
+    {
+        "id": int, sample ID,
+        "entities": [
+            {
+                "entity_id": int, entity ID,
+                "entity_text": str, text of the entity,
+                "classification": str, one of ["positive", "negative", "neutral"]
+            },
+            ...
+        ]
+    },
+    ... other outputs
+]
+```
+## Data Preparation
+Examine the data and create an understanding of the dataset. We are interested in data hygiene practices and general good data science practices.
+## Classification
+Create a system which will perform the classification. We do not expect perfect metrics or extensive experiments—the point of this assignment is not to spend days training—but we do expect a well-reasoned approach which can deal with all the different edge cases this task has. Make sure to note the best metrics you achieve. Feel free to use any framework, architecture, and approach.
+## API and Docker
+Create the FastAPI application, a `Dockerfile` for the application, and a `docker-compose.yml` file so that we can simply run the application with `docker compose up`.
+# Deliverables and Documentation
+Share a link to your GitHub repository or a zipped folder containing:
+1. The Python code (the application itself, any data analysis, preprocessing, training, and evaluation scripts)
+2. The `Dockerfile` and `docker-compose.yml`
+3. A short documentation
+## Documentation
+The documentation should not be a formal report; it can be just structured notes. The goal is for us to understand your process, the decisions you made, why you made them, and what worked and what did not. We want to see how you approach a relatively open-ended problem and what solution you come up with. If you know your solution has shortcomings or edge cases it cannot deal with, note them.

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,12 @@

+services:
+  app:
+    build: .
+    ports:
+      - "7860:7860"
+    env_file:
+      - .env
+    volumes:
+      - models:/app/models
+volumes:
+  models:

nginx.conf ADDED Viewed

	@@ -0,0 +1,32 @@

+pid /tmp/nginx/nginx.pid;
+events {
+    worker_connections 1024;
+}
+http {
+    client_body_temp_path /tmp/nginx/client_body;
+    proxy_temp_path /tmp/nginx/proxy;
+    fastcgi_temp_path /tmp/nginx/fastcgi;
+    uwsgi_temp_path /tmp/nginx/uwsgi;
+    scgi_temp_path /tmp/nginx/scgi;
+    server {
+        listen 7860;
+        location /api/ {
+            proxy_pass http://127.0.0.1:8000/;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+        }
+        location / {
+            proxy_pass http://127.0.0.1:8501;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_http_version 1.1;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection "upgrade";
+        }
+    }
+}

requirements-cpu.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ --index-url https://download.pytorch.org/whl/cpu
2	+ torch>=2.6.0

requirements-gpu.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ torch>=2.6.0
2	+ torchvision>=0.21.0

requirements.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+transformers>=4.50.3
+pandas>=1.4.0
+numpy<2
+scikit-learn
+matplotlib
+seaborn
+onnxruntime>=1.18.0
+onnxscript>=0.1.0
+fastapi>=0.121.0
+uvicorn[standard]>=0.34.0
+pydantic>=2.12.4
+streamlit>=1.35.0
+requests>=2.32.5
+python-dotenv>=1.2.1
+huggingface_hub>=0.23.0
+fasttext-wheel
+accelerate>=1.1.0

sample_input.json ADDED Viewed

	@@ -0,0 +1,24 @@

+[
+  {
+    "id": 0,
+    "text": "Google had solid Q4 2025 earnings but Microsoft's were not great.",
+    "entities": [
+      {
+        "entity_id": 0,
+        "entity_text": "Google",
+        "entity_type": "company",
+        "positions": [
+          {"position_text": "Google", "length": 6, "offset": 0}
+        ]
+      },
+      {
+        "entity_id": 1,
+        "entity_text": "Microsoft",
+        "entity_type": "company",
+        "positions": [
+          {"position_text": "Microsoft", "length": 9, "offset": 40}
+        ]
+      }
+    ]
+  }
+]

start.sh ADDED Viewed

	@@ -0,0 +1,10 @@

+#!/bin/bash
+set -e
+make hf-download-all
+uvicorn app:app --host 0.0.0.0 --port 8000 &
+streamlit run main.py --server.port 8501 --server.address 0.0.0.0 --server.headless true &
+nginx -g "daemon off;" -c /app/nginx.conf