华为云用户手册

  • 支持的功能 表1 ToolKit(latest)功能列表 支持的功能 说明 对应操作指导 SSH远程连接 支持SSH远程连接ModelArts的Notebook开发环境。 配置PyCharm ToolKit远程连接Notebook 训练模型 支持将本地开发的代码,快速提交至ModelArts并自动创建新版训练作业,在训练作业运行期间获取训练日志并展示到本地。 提交训练作业(新版训练) 停止训练作业 查看训练日志 OBS上传下载 上传本地文件或文件夹至OBS,从OBS下载文件或文件夹到本地。 在PyCharm中上传下载文件
  • 在OBS中查看 提交训练作业时,系统将自动在您配置的OBS Path中,使用作业名称创建一个新的文件夹,用于存储训练输出的模型、日志和代码。 例如“train-job-01”作业,提交作业时会在“test-modelarts2”桶下创建一个命名为“train-job-01”的文件夹,且此文件夹下分别新建了三个文件夹“output”、“log”、“code”,分别用于存储输出模型、日志和训练代码。“output”文件夹还会根据您的训练作业版本再创建子文件夹,结构示例如下。 test-modelarts2 |---train-job-01 |---output |---log |---code
  • 样例 数据样本 使用公开数据集AirPassengers.csv的前120行 数据示例: Month,Passengers1949-01,1121949-02,1181949-03,1321949-04,1291949-05,1211949-06,1351949-07,1481949-08,1481949-09,1361949-10,1191949-11,1041949-12,1181950-01,1151950-02,1261950-03,141
  • 概述 该算子可以帮助我们自动确定 A R I M A ( p , d , q ) ( P , D , Q ) m。 Auto ARIMA选择过程 1. 执行差分测试 决定差分d的大小(KPSS检测和ADF检测) 2. 拟合模型: 通过限制start_p、max_p、start_q max_q,在该范围内搜索最优参数;如果启用了季节性可选项,则还会执行Canova-Hansen来确定季节性差分的最佳阶数 D ,并之后基于此确定最佳 P 和 Q 超参数。 3. Auto ARIMA基于给定的information_criterion进行模型优化,范围('aic', 'aicc', 'bic', 'hqic', 'oob');A并通过生成AIC和BIC值来确定参数的最佳组合。AIC和BIC值是用于比较模型的评估器。这些值越低,模型就越好。 目前 Auto ARIMA算子只支持在Notebook环境运行,不支持DLI环境下运行。
  • 参数说明 表3 参数 是否必选 描述 默认值 seq_col_name 是 时序列。用来对valueColName排序。 无 value_col_name 是 数值列 无 group_col_names 否 分组列,多列用逗号分隔,如“col0,col1”。每个分组会构建一个时间序列 无 frequency 否 时序频率,正整数,范围为(0, 12]。 12说明 12表示12月/年。 max_order 否 p,q最大值,正整数,范围为[0,4]。 2 max_seasonal_order 否 季节性p,q最大值,正整数,范围为[0,2]。 1 max_diff 否 差分d最大值。正整数,范围为[0,2]。 2 max_seasonal_diff 否 季节性差分d最大值。正整数,范围为[0,1]。 1 diff 否 差分d,正整数,范围为[0,2]。diff与maxDiff同时设置时,maxDiff被忽略。diff与seasonalDiff要同时设置。 -1说明 取值为-1表示不指定diff。 seasonal_diff 否 季节性差分d。正整数,范围为[0,1]。seasonalDiff与maxSeasonalDiff同时设置时,maxSeasonalDiff被忽略。 -1说明 取值为-1表示不指定seasonalDiff。 max_iter 否 最大迭代次数,正整数 1500 tol 否 容忍度,double类型。 1e-5 predict_step 否 预测条数,数字,范围为(0, 365]。 12 confidence_level 否 预测置信水平,数字,范围为(0, 1)。 0.95
  • 场景介绍 在AI开发过程中,如何将文件方便快速地上传到Notebook几乎是每个开发者都会遇到的问题。 ModelArts之前对文件直接上传到Notebook的大小限制是100MB,超过限制的文件无法直接上传;其次需要上传的文件并不都在本地,可能是GitHub的开源仓库,可能是类似开源数据集(https://nodejs.org/dist/v12.4.0/node-v12.4.0-linux-x64.tar.xz)这样的远端文件,也可能是存放在OBS中的文件,ModelArts之前无法将这些文件直接上传到Notebook中;在文件上传过程中,用户无法获得更多的信息,例如上传进度和速度。 ModelArts上传文件特性主要解决了以上三个问题,不仅提供了更多上传文件的功能满足用户需求,而且展示了更多文件上传的细节,提升了用户的体验。 当前的文件上传功能: 支持上传本地文件; 支持Clone GitHub开源仓库; 支持上传OBS文件; 支持上传远端文件; 将文件上传详情可视化。 父主题: 上传文件至JupyterLab
  • 上传文件要求 对于大小不超过100MB的文件直接上传,并展示文件大小、上传进度及速度等详细信息。 对于大小超过100MB不超过5GB的文件可以使用OBS中转,系统先将文件上传OBS(对象桶或并行文件系统),然后从OBS下载到Notebook,上传完成后,会将文件从OBS中删除。 5GB以上的文件上传通过调用ModelArts SDK或者Moxing完成。 对于Notebook当前目录下已经有同文件名称的文件,可以覆盖继续上传,也可以取消。 支持10个文件同时上传,其余文件显示“等待上传”。不支持上传文件夹,可以将文件夹压缩成压缩包上传至Notebook后, 在Terminal中解压压缩包。 unzip xxx.zip #在xxx.zip压缩包所在路径直接解压 解压命令的更多使用说明可以在主流搜索引擎中查找Linux解压命令操作。 多个文件同时上传时,JupyterLab窗口最下面会显示上传文件总数和已上传文件数。
  • 异常处理 通过OBS下载文件到Notebook中时,提示Permission denied。请依次排查: 请确保读取的OBS桶和Notebook处于同一站点区域,例如:都在华北-北京四站点。不支持跨站点访问OBS桶。具体请参见如何查看OBS桶与ModelArts是否在同一区域。 请确认操作Notebook的帐号有权限读取OBS桶中的数据。如没有权限,请参见在Notebook中,如何访问其他帐号的OBS桶?。
  • 参数说明 参数 子参数 参数说明 input_columns_str - 数据集的特征列名组成的格式化字符串,例如: "column_a" "column_a,column_b" label_col - 目标列名 model_input_features_col - 特征向量的列名 prediction_col - 训练模型时,预测结果对应的列名,默认为"prediction" max_depth - 树的最大深度,默认为5 max_bins - 特征分裂时的最大分箱个数,默认为32 min_instances_per_node - 决策树分裂时要求每个节点必须包含的实例数目,默认为1 min_info_gain - 最小信息增益,默认为0 subsampling_rate - 训练每棵树时,对训练集的抽样率,默认为1 max_iter - 最大迭代次数,默认为20 step_size - 步长,默认为0.1
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "select_columns_str": "", # @param {"label":"select_columns_str","type":"string","required":"false","helpTip":""} "bucket_num": 10 # @param {"label":"bucket_num","type":"integer","required":"true","range":"(0,2147483647)","helpTip":""} }bucket_statistics____id___ = MLSBucketStatistics(**params)bucket_statistics____id___.run()# @output {"label":"dataframe","name":"bucket_statistics____id___.get_outputs()['output_port_1']","type":"DataFrame"}
  • 样例 inputs = { "dataframe": None, # @input {"label":"dataframe","type":"DataFrame"} "pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"} "gbt_regressor_model": None}params = { "inputs": inputs, "input_columns_str": "", # @param {"label": "input_columns_str", "type": "string", "required": "false", "helpTip": ""} "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""} "model_input_features_col": "model_features", # @param {"label": "model_input_features_col", "type": "string", "required": "false", "helpTip": ""} "prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "false", "helpTip": ""} "max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "false", "helpTip": ""} "subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "false", "helpTip": ""} "loss_type": "squared", # @param {"label": "loss_type", "type": "enum", "required": "false", "options": "squared, absolute", "helpTip": ""} "max_iter": 20, # @param {"label": "max_iter", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "step_size": 0.1, # @param {"label": "step_size", "type": "number", "required": "false", "helpTip": ""} "impurity": "variance"}gbt_regression_feature_importance____id___ = MLSGBTRegressorFeatureImportance(**params)gbt_regression_feature_importance____id___.run()# @output {"label":"dataframe","name":"gbt_regression_feature_importance____id___.get_outputs()['output_port_1']","type":"DataFrame"}
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "select_column_name": "", # @param {"label":"select_column_name","type":"string","required":"true","helpTip":""} "string_bucket_show_num": 10, # @param {"label":"string_bucket_show_num","type":"integer","required":"true","helpTip":""} "numerical_bucket_show_num": 10, # @param {"label":"numerical_bucket_show_num","type":"integer","required":"true","helpTip":""} "numerical_interval": 0.05 # @param {"label":"numerical_interval","type":"float","required":"true","helpTip":""}}plot_bar_chart____id___ = MLSPlotBarChart(**params)plot_bar_chart____id___.run()
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "select_column_name": "", # @param {"label":"select_column_name","type":"string","required":"true","helpTip":""} "numeric_intervals_str": "", # @param {"label":"numeric_intervals_str","type":"string","required":"false","helpTip":""} "numeric_interval_length": "", # @param {"label":"numeric_interval_length","type":"string","required":"false","helpTip":""} "show_share_number": 5, # @param {"label":"show_share_number","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""} "figure_length": "", # @param {"label":"figure_length","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""} "figure_width": "" # @param {"label":"figure_width","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}}plot_pie____id___ = MLSPlotPie(**params)plot_pie____id___.run()
  • 输入 参数 子参数 参数说明 inputs dataframe 参数必选,表示输入的数据集;如果没有pipeline_model和gbt_regressor_model参数,表示直接根据数据集训练梯度提升树回归模型得到特征重要性 pipeline_model 参数可选,如果含有该参数,表示根据上游的pyspark pipeline模型对象pipeline_model来计算特征重要性 gbt_regressor_model 参数可选,如果含有该参数,表示根据上游的gbt_regressor_model对象来计算特征重要性
  • 参数说明 参数 子参数 参数说明 select_column_name - 选择列的列名 numeric_intervals_str - 画饼形图时,每个区间的长度组成的字符串,逗号隔开 numeric_interval_length - 如果numeric_intervals_str没有设置,默认饼形图的每个区间的长度一样,numeric_interval_length表示此时的区间长度 show_share_number - 饼形图的份额数目,默认为5 figure_length - 图的长度 figure_width - 图的宽度
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "select_columns_str": "", # @param {"label":"select_columns_str","type":"string","required":"true","helpTip":""} "start_index": 0, # @param {"label":"start_index","type":"integer","required":"true","helpTip":""} "end_index": 0, # @param {"label":"end_index","type":"integer","required":"true","helpTip":""} "figure_length": 30, # @param {"label":"figure_length","type":"integer","required":"false","helpTip":""} "figure_width": 10 # @param {"label":"figure_width","type":"integer","required":"false","helpTip":""}}plot_line____id___ = MLSPlotLine(**params)plot_line____id___.run()
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "select_columns_str": "" # @param {"label":"select_columns_str","type":"string","required":"true","helpTip":""}}box_plot____id___ = MLSBoxPlot(**params)box_plot____id___.run()
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "start_index": "", # @param {"label":"start_index","type":"integer","required":"true","range":"[0,2147483647]","helpTip":""} "end_index": "", # @param {"label":"end_index","type":"integer","required":"true","range":"[0,2147483647]","helpTip":""} "x_axis_column_name": "", # @param {"label":"x_axis_column_name","type":"string","required":"false","helpTip":""} "y_axis_columns_str": "", # @param {"label":"y_axis_columns_str","type":"string","required":"false","helpTip":""} "figure_length": "", # @param {"label":"figure_length","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""} "figure_width": "" # @param {"label":"figure_width","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}}plot_scatter____id___ = MLSPlotScatter(**params)plot_scatter____id___.run()
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "new_column_name_map_str": "" # @param {"label":"new_column_name_map_str","type":"string","required":"true","helpTip":""}}change_column_name____id___ = MLSChangeColumnName(**params)change_column_name____id___.run()# @output {"label":"dataframe","name":"change_column_name____id___.get_outputs()['output_port_1']","type":"DataFrame"}
  • 参数说明 参数 子参数 参数说明 start_index - 只对数据集转成的数组的某个区间内元素化散点图,start_index表示开始位置 end_index - 只对数据集转成的数组的某个区间内元素化散点图,end_index表示结束位置 x_axis_column_name - 散点图x轴的列名 y_axis_columns_str - 散点图y轴的某些列,y_axis_columns_str表示用列名逗号隔开的字符串 figure_length - 图的长度 figure_width - 图的宽度
  • 样例 inputs = { "left_dataframe": None, # @input {"label":"left_dataframe","type":"DataFrame"} "right_dataframe": None # @input {"label":"right_dataframe","type":"DataFrame"}}params = { "inputs": inputs}column_append____id___ = MLSColumnAppend(**params)column_append____id___.run()# @output {"label":"dataframe","name":"column_append____id___.get_outputs()['output_port_1']","type":"DataFrame"}
  • 参数说明 参数 子参数 参数说明 input_columns_str - 数据集的特征列名组成的格式化字符串,例如: "column_a" "column_a,column_b" label_col - 目标列名 model_input_features_col - 特征向量的列名 prediction_col - 训练模型时,预测结果对应的列名,默认为"prediction" max_depth - 树的最大深度,默认为5 max_bins - 特征分裂时的最大分箱个数,默认为32 min_instances_per_node - 树分裂时要求每个节点必须包含的实例数目,默认为1 min_info_gain - 最小信息增益,默认为0.0 subsampling_rate - 训练每棵树时,对训练集的抽样率,默认为1.0 num_trees - 树的个数,默认为20 feature_subset_strategy - 每个树节点分裂时使用的特征个数,默认为"auto"
  • 样例 inputs = { "dataframe": None, # @input {"label":"dataframe","type":"DataFrame"} "pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"} "random_forest_regressor_model": None}params = { "inputs": inputs, "input_columns_str": "", # @param {"label": "input_columns_str", "type": "string", "required": "false", "helpTip": ""} "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""} "model_input_features_col": "model_features", # @param {"label": "model_input_features_col", "type": "string", "required": "false", "helpTip": ""} "prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "false", "helpTip": ""} "max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "false", "helpTip": ""} "impurity": "variance", "subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "false", "helpTip": ""} "num_trees": 20, # @param {"label": "num_trees", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "feature_subset_strategy": "auto" # @param {"label": "feature_subset_strategy", "type": "enum", "options":"auto,all,onethird,sqrt,log2", "required": "false", "helpTip": ""}}rf_regression_feature_importance____id___ = MLSRandomForestRegressorFeatureImportance(**params)rf_regression_feature_importance____id___.run()# @output {"label":"dataframe","name":"rf_regression_feature_importance____id___.get_outputs()['output_port_1']","type":"DataFrame"}
  • 输入 参数 子参数 参数说明 inputs dataframe 参数必选,表示输入的数据集;如果没有pipeline_model和random_forest_regressor_model参数,表示直接根据数据集训练随机森林分类模型得到特征重要性 pipeline_model 参数可选,如果含有该参数,表示根据上游的pyspark pipeline模型对象pipeline_model来计算特征重要性 random_forest_regressor_model 参数可选,如果含有该参数,表示根据上游的random_forest_regressor_model对象来计算特征重要性
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "fraction": 0.7, # @param {"label":"fraction","type":"number","required":"true","range":"(0.0,1.0)","helpTip": ""} "seed": 0 # @param {"label":"seed","type":"integer","required":"true","range":"(0,2147483647]","helpTip":"seed"}}dataset_sample____id___ = MLSDatasetSample(**params)dataset_sample____id___.run()# @output {"label":"dataframe","name":"dataset_sample____id___.get_outputs()['output_port_1']","type":"DataFrame"}
  • 样例 inputs = { "dataframe": None, # @input {"label":"dataframe","type":"DataFrame"} "pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"} "random_forest_classify_model": None}params = { "inputs": inputs, "input_columns_str": "", # @param {"label": "input_columns_str", "type": "string", "required": "false", "helpTip": ""} "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""} "model_input_features_col": "model_features", # @param {"label": "model_input_features_col", "type": "string", "required": "false", "helpTip": ""} "classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type": "string", "required": "false", "helpTip": ""} "prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string", "required": "false", "helpTip": ""} "prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "false", "helpTip": ""} "max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "false", "helpTip": ""} "impurity": "gini", # @param {"label": "impurity", "type": "enum", "required": "false", "options": "entropy,gini", "helpTip": ""} "num_trees": 20, # @param {"label": "num_trees", "type": "integer", "required": "false","range":"(0,2147483647]", "helpTip": ""} "feature_subset_strategy": "all", # @param {"label": "feature_subset_strategy", "type": "enum", "options":"auto,all,onethird,sqrt,log2", "required": "false", "helpTip": ""} "subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "false", "helpTip": ""} "seed": 0 # @param {"label": "seed", "type": "integer", "required": "false","range":"[0,2147483647]", "helpTip": ""}}rf_classify_feature_importance____id___ = MLSRandomForestClassifierFeatureImportance(**params)rf_classify_feature_importance____id___.run()# @output {"label":"dataframe","name":"rf_classify_feature_importance____id___.get_outputs()['output_port_1']","type":"DataFrame"}
  • 参数说明 参数 子参数 参数说明 input_columns_str - 数据集的特征列名组成的格式化字符串,例如: "column_a" "column_a,column_b" label_col - 目标列名 model_input_features_col - 特征向量的列名 classifier_label_index_col - 将目标列按照标签编码后的列名,默认为"label_index" prediction_index_col - 训练模型时,预测结果对应标签的列名,默认为"prediction_index" prediction_col - 训练模型时,预测结果对应的列名,默认为"prediction" max_depth - 树的最大深度,默认为5 max_bins - 特征分裂时的最大分箱个数,默认为32 min_instances_per_node - 树分裂时要求每个节点必须包含的实例数目,默认为1 min_info_gain - 最小信息增益,默认为0.0 impurity - 纯度,支持"gini"和"entropy",默认为"gini" num_trees - 树的个数,默认为20 feature_subset_strategy - 每个树节点分裂时使用的特征个数,默认为"all" subsampling_rate - 训练每棵树时,对训练集的抽样率,默认为1.0 seed - 随机数种子,默认为0
  • 输入 参数 子参数 参数说明 inputs dataframe 参数必选,表示输入的数据集;如果没有pipeline_model和random_forest_classify_model参数,表示直接根据数据集训练随机森林分类模型得到特征重要性 pipeline_model 参数可选,如果含有该参数,表示根据上游的pyspark pipeline模型对象pipeline_model来计算特征重要性 random_forest_classify_model 参数可选,如果含有该参数,表示根据上游的random_forest_classify_model对象来计算特征重要性
  • 参数说明 参数 子参数 参数说明 agg_operators_str - 代表各种聚合操作的格式化字符串,例如: "sum,old_column_a,new_column_a" "sum,old_column_a,new_column_a;covar,old_column_b,new_column_b,additional_column_b" 聚合操作有: sum:求和 sum_distinct:去重后求和 avg:均值 avg_distinct:去重后求均值 min:最小值 max:最大值 count:计数 count_distinct:去重后计数 stddev_pop:标准差 stddev_samp:样本标准差 var_pop:方差 var_samp:样本方差 covar_pop:协方差 covar_samp:样本协方差 corr:相关系数 percentile_approx:近似百分比 group_by_columns_str - 代表对某些列进行分组操作的格式化字符串,例如: "column_a" "column_a,column_b,column_c"
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "agg_operators_str": "", # @param {"label":"agg_operators_str","type":"string","required":"true","helpTip":""} "group_by_columns_str": "" # @param {"label":"group_by_columns_str","type":"string","required":"true","helpTip":""}}dataset_aggregate____id___ = MLSDatasetAggerate(**params)dataset_aggregate____id___.run()# @output {"label":"dataframe","name":"dataset_aggregate____id___.get_outputs()['output_port_1']","type":"DataFrame"}
共100000条