SageMaker终端处于“创建”状态无法前进。

7

我正在尝试部署一个SageMaker终端节点,但它无限期地卡在“创建”阶段。下面是我的Dockerfile和训练/服务脚本。模型训练没有任何问题。只有终端节点部署卡在“创建”阶段。

以下是文件夹结构

文件夹结构

|_code
   |_train_serve.py
|_Dockerfile

以下是Dockerfile文件。 Dockerfile
# ##########################################################

# Adapt your container (to work with SageMaker)
# # https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html
# # https://hub.docker.com/r/huanjason/scikit-learn/dockerfile

ARG REGION=us-east-1

FROM python:3.7

RUN apt-get update && apt-get -y install gcc

RUN pip3 install \
        # numpy==1.16.2 \
        numpy \
        # scikit-learn==0.20.2 \
        scikit-learn \
        pandas \
        # scipy==1.2.1 \
        scipy \
        mlflow

RUN rm -rf /root/.cache

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE

# Install sagemaker-training toolkit to enable SageMaker Python SDK
RUN pip3 install sagemaker-training

ENV PATH="/opt/ml/code:${PATH}"

# Copies the training code inside the container
COPY  /code /opt/ml/code

# Defines train_serve.py as script entrypoint
ENV SAGEMAKER_PROGRAM train_serve.py

以下是用于训练和服务模型的脚本:

train_serve.py

import os
import ast
import warnings
import sys
import json
import ast
import argparse
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import PolynomialFeatures
from urllib.parse import urlparse
import logging
import pickle

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

if __name__ =='__main__':
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # Data, model, and output directories
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='kc_house_data_train.csv')
    parser.add_argument('--test-file', type=str, default='kc_house_data_test.csv')
    parser.add_argument('--features', type=str)  # we ask user to explicitly name features
    parser.add_argument('--target', type=str) # we ask user to explicitly name the target

    args, _ = parser.parse_known_args()

    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # Reading training and testing datasets
    logging.info('reading training and testing datasets')
    logging.info(f"{args.train} {args.train_file} {args.test} {args.test_file}")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    
    logging.info(args.features.split(','))
    logging.info(args.target)
    train_x = np.array(train_df[args.features.split(',')]).reshape(-1,1)
    test_x = np.array(test_df[args.features.split(',')]).reshape(-1,1)
    train_y = np.array(train_df[args.target]).reshape(-1,1)
    test_y = np.array(test_df[args.target]).reshape(-1,1)  

    reg = linear_model.LinearRegression()

    reg.fit(train_x, train_y)
    predicted_price = 
    reg.predict(test_x)
    (rmse, mae, r2) = eval_metrics(test_y, predicted_price)

    logging.info(f"        Linear model: (features={args.features}, target={args.target})")
    logging.info(f"            RMSE: {rmse}")
    logging.info(f"            MAE: {mae}")
    logging.info(f"            R2: {r2}")

    model_path = os.path.join(args.model_dir, "model.pkl")
    logging.info(f"saving to {model_path}")          
    logging.info(args.model_dir)
    with open(model_path, 'wb') as path:
        pickle.dump(reg, path)


def model_fn(model_dir):
    with open(os.path.join(model_dir, "model.pkl"), "rb") as input_model:
        model = pickle.load(input_model)
    return model
    
def predict_fn(input_object, model):
    _return = model.predict(input_object)
    return _return

云监控中是否记录了任何错误?它被卡住了多久?这可能是原因吗?https://forums.aws.amazon.com/thread.jspa?threadID=320543 - rok
1
它卡住了大约5个小时,没有任何进展。后来,我删除了端点配置和模型以使其失败。CloudWatch中没有显示任何内容。 - jon
1
@jon,你能找出解决方案吗? - Abercrombie
1个回答

2

一种调查的方法是尝试通过AWS控制台作为Batch Transform的一部分使用相同的模型,因为这种流程似乎比推理端点创建提供更好的错误消息和诊断。

在我的情况下,这让我意识到与模型关联的IAM角色已经不存在了。我忽略了这一点,因为这些角色是由CDK管理的,并且在某个时候被移除了,但是这些模型是通过Step Functions管道动态创建的。

无论如何,使用不存在的角色进行部署会导致SageMaker端点保持“正在创建”状态几个小时,然后失败并显示“请求服务失败。如果重试后仍然失败,请联系客户支持”,而且没有CloudWatch日志。重新创建具有有效角色的模型可以解决此问题。

如果上述内容不适用于OP,我表示歉意。OP报告了相同的问题,但使用我不熟悉的不同设置。我只是分享了一个类似问题的结果,这让我来到了这个页面,以防对未来的任何人有所帮助。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接