我已经成功地在本地开发了一个超级简单的ETL过程(以下称为load_staging),它从远程位置提取数据,然后将未经处理的数据写入我本地Windows机器上的MongoDB容器。现在,我想使用DockerOperator为每个任务安排这个过程,即我想创建一个源代码的docker镜像,然后使用DockerOperator在该镜像中执行源代码。由于我正在使用Windows机器工作,因此只能在Docker容器内部使用Airflow。
我已经启动了Airflow容器(以下称为webserver)和MongoDB容器(以下称为mongo),并使用“docker-compose up”手动触发了Airflow的DAG。根据Airflow的记录,任务已被成功执行,但似乎docker镜像中的代码没有被执行,因为任务完成得太快了,并且在从我的镜像启动docker容器之后,任务以错误代码0执行,即我看不到任务本身的日志输出。请参见下面的日志:
上面提到的
上面Dockerfile引用的web服务器容器的Dockerfile如下所示:
我已经启动了Airflow容器(以下称为webserver)和MongoDB容器(以下称为mongo),并使用“docker-compose up”手动触发了Airflow的DAG。根据Airflow的记录,任务已被成功执行,但似乎docker镜像中的代码没有被执行,因为任务完成得太快了,并且在从我的镜像启动docker容器之后,任务以错误代码0执行,即我看不到任务本身的日志输出。请参见下面的日志:
[2020-01-20 17:09:44,444] {{docker_operator.py:194}} INFO - Starting docker container from image myaccount/myrepo:load_staging_op
[2020-01-20 17:09:50,473] {{logging_mixin.py:95}} INFO - [[34m2020-01-20 17:09:50,472[0m] {{[34mlocal_task_job.py:[0m105}} INFO[0m - Task exited with return code 0[0m
所以,我的两个问题是:
- 我得出了正确的结论吗?还有什么其他可能是这个问题的根源呢?
- 如何确保图像内部的代码始终被执行?
在下面,你可以找到更多关于我如何设置DockerOperator,如何定义应该由DockerOperator执行的图像,启动web服务器和mongo容器的docker-compose.yml文件以及用于创建web服务器容器的Dockerfile的进一步信息。
在我的DAG定义文件中,我像这样指定了DockerOperator:
CONFIG_FILEPATH = "/configs/docker_execution.ini"
data_object_name = "some_name"
task_id_ = "{}_task".format(data_object_name)
cmd = "python /src/etl/load_staging_op/main.py --config_filepath={} --data_object_name={}".format(CONFIG_FILEPATH, data_object_name)
staging_op = DockerOperator(
command=cmd,
task_id=task_id_,
image="myaccount/myrepo:load_staging_op",
api_version="auto",
auto_remove=True
)
上面提到的
load_staging_op
镜像的Dockerfile如下所示:# Inherit from Python image
FROM python:3.7
# Install environment
USER root
COPY ./src/etl/load_staging_op/requirements.txt ./
RUN pip install -r requirements.txt
# Copy source code files into container
COPY ./configs /configs
COPY ./wsdl /wsdl
COPY ./src/all_constants.py /src/all_constants.py
COPY ./src/etl/load_staging_op/utils.py /src/etl/load_staging_op/utils.py
COPY ./src/etl/load_staging_op/main.py /src/etl/load_staging_op/main.py
# Extend python path so that custom modules are found
ENV PYTHONPATH "${PYTHONPATH}:/src"
ENTRYPOINT [ "sh", "-c"]
docker-compose.yml
文件的相关部分如下:
version: '2.1'
services:
webserver:
build: ./docker-airflow
restart: always
privileged: true
depends_on:
- mongo
- mongo-express
volumes:
- ./docker-airflow/dags:/usr/local/airflow/dags
# source code volume
- ./src:/src
- ./docker-airflow/workdir:/home/workdir
# Mount the docker socket from the host (currently my laptop) into the webserver container
# so that we can build docker images from inside the webserver container.
- //var/run/docker.sock:/var/run/docker.sock # the two "//" are needed for windows OS
- ./configs:/configs
- ./wsdl:/wsdl
ports:
# Change port to 8081 to avoid Jupyter conflicts
- 8081:8080
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
networks:
- mynet
mongo:
container_name: mymongo
image: mongo
restart: always
ports:
- 27017:27017
networks:
- mynet
上面Dockerfile引用的web服务器容器的Dockerfile如下所示:
FROM puckel/docker-airflow:1.10.4
# Adds DAG folder to the PATH
ENV PYTHONPATH "${PYTHONPATH}:/src:/usr/local/airflow/dags"
# Install the optional packages
COPY requirements.txt requirements.txt # make sure something like docker==4.1.0 is in this requirements.txt file!
USER root
RUN pip install -r requirements.txt
# Install docker inside the webserver container
RUN curl -sSL https://get.docker.com/ | sh
ENV SHARE_DIR /usr/local/share
# Install simple text editor for debugging
RUN ["apt-get", "update"]
RUN ["apt-get", "-y", "install", "vim"]
感谢您的帮助,我非常感激!
ENTRYPOINT ["sh", "-c"]
的主要效果是使容器忽略其所有命令行参数。 我期望该设置运行python
,忽略所有其他选项,并立即退出。 您应该能够删除该ENTRYPOINT
行。(还要考虑如果data_object_name
中有空格或标点符号会发生什么。) - David MazeENTRYPOINT [ "sh", "-c"]
替换为CMD python /src/etl/load_staging_op/main.py --config_filepath=/configs/docker_execution.ini --data_object_name=some_name
,然后使用docker build -t myaccount/myrepo:load_staging_op -f path_to_dockerfile .
构建任务的镜像并运行docker run -it myaccount/myrepo:load_staging_op
时,任务显然会失败,但我可以看到一些日志输出。我明天会进行测试! - Kevin Südmersen