There is a folder name "data-persistent" in the running container that the code reads and writes from, I want to save the changes made in that folder. when I use persistent volume, it removes/hides the data from that folder and the code gives an error. So what should be my approach.
FROM python:latest
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
#RUN mkdir data-persistent
ADD linkedin_scrape.py .
COPY requirements.txt ./requirements.txt
COPY final_links.csv ./final_links.csv
COPY credentials.txt ./credentials.txt
COPY vectorizer.pk ./vectorizer.pk
COPY model_IvE ./model_IvE
COPY model_JvP ./model_JvP
COPY model_NvS ./model_NvS
COPY model_TvF ./model_TvF
COPY nocopy.xlsx ./nocopy.xlsx
COPY data.db /data-persistent/
COPY textdata.txt /data-persistent/
RUN ls -la /data-persistent/*
RUN pip install -r requirements.txt
CMD python linkedin_scrape.py --bind 0.0.0.0:8080 --timeout 90
And my deployment file
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-first-cluster1
spec:
replicas: 2
selector:
matchLabels:
app: scrape
template:
metadata:
labels:
app: scrape
spec:
containers:
- name: scraper
image: image-name
#
ports:
- containerPort: 8080
env:
- name: PORT
value: "8080"
volumeMounts:
- mountPath: "/dev/shm"
name: dshm
- mountPath: "/data-persistent/"
name: tester
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: tester
persistentVolumeClaim:
claimName: my-pvc-claim-1
Let me explain the workflow of the code. The code reads from the textdata.txt file which contains the indices of links to be scraped e.g. from 100 to 150, then it scrapes the profiles, inserts them to data.db file and then writes to the texdata.txt file the sequence to be scraped in next run e.g. 150 to 200.
First , k8s volume mounting point overwrite the original file system /data-persistent/
To solve such a case you have many options
Solution 1
Solution 2
its not good to copy data in docker images , that will increase images sizes and also unityfing code and data change pipelines
Its better to keep data in any shared storage like "s3" , and let the "init container" compare and sync data
if cloud services like s3 not available
you can use persistent volume type that support multipe r/w mounts
attach same volume to another deployment { use busybox image as example } and do the copy with "kubectl cp"
scale temp deployments to zero after finalizing the copy , also you can make it as part of CI pipeline
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With