华为云用户手册

  • 概述 “朴素贝叶斯”节点用于产生多分类模型,用户在使用时需要指定数据的“Role”字段,默认支持“Input”、“Target”、“Rejected”、“ID”四种类型,且只能选择其一种。 朴素贝叶斯算法是基于贝叶斯定理与特征条件独立假设的分类方法。 朴素贝叶斯法实现简单,学习与预测的效率都很高,是一种常用的方法。对于给定的训练数据集: 首先基于特征条件独立假设学习输入/输出的联合概率分布。 然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率最大的输出y。
  • 概述 “多层感知机分类”节点可用于建立一个基于前馈人工神经网络的分类模型。 前馈人工神经网络采用一种单向多层结构。其中每一层包含若干个神经元,同一层的神经元之间没有互相连接,层间信息的传送只沿一个方向进行。其中第一层称为输入层。最后一层为输出层,中间为隐层。K+1层前馈神经网络矩阵形式如下表示,其中X为特征集,w为权重值,b为偏置量,y为预测值。 中间层的节点使用sigmod函数: 输出层的节点使用softmax函数: 输出层中的节点个数对应类别数量。
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean", "required": "true", "helpTip": ""} "input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false", "helpTip": ""} "outer_pipeline_stages": None, "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""} "classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type": "string", "required": "true", "helpTip": ""} "classifier_feature_vector_col": "model_features", # @param {"label": "classifier_feature_vector_col", "type": "string", "required": "true", "helpTip": ""} "prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true", "helpTip": ""} "prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string", "required": "true", "helpTip": ""} "max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "tol": 1e-6, # @param {"label": "tol", "type": "number", "required": "true", "range": "(0,none)", "helpTip": ""} "seed": 0, # @param {"label": "seed", "type": "integer", "required": "false", "range": "[0,2147483647]", "helpTip": ""} "layers_str": "", # @param {"label": "layers_str", "type": "string", "required": "false", "helpTip": ""} "block_size": 128, "step_size": 0.03, # @param {"label": "step_size", "type": "number", "required": "true", "range": "(0,none)", "helpTip": ""} "solver": "l-bfgs", # @param {"label": "solver", "type": "enum", "required": "true", "options": "gd,l-bfgs", "helpTip": ""} "initial_weights_str": "" # @param {"label": "initial_weights_str", "type": "string", "required": "false", "helpTip": ""}}multilayer_perception_classifier____id___ = MLSMultilayerPerceptronClassifier(**params)multilayer_perception_classifier____id___.run()# @output {"label":"pipeline_model","name":"multilayer_perception_classifier____id___.get_outputs()['output_port_1']","type":"PipelineModel"}
  • 参数说明 参数 子参数 参数说明 b_use_default_encoder - 是否使用默认编码,默认为True input_features_str - 输入的列名以逗号分隔组成的字符串,例如: "column_a" "column_a,column_b" label_col - 目标列 classifier_label_index_col - 目标列经过标签编码后的新的列名,默认为"label_index" classifier_feature_vector_col - 算子输入的特征向量列的列名,默认为"model_features" prediction_col - 算子输出的预测label的列名,默认为"prediction" prediction_index_col - 算子输出的预测label对应的标签列,默认为"prediction_index" max_iter - 最大迭代次数,默认为100 tol - 收敛阈值,默认为1e-6 seed - 随机数种子,默认为0 layers_str - 层的个数用逗号分隔组成的字符串,例如: "2,3,4" "3" step_size - 步长,默认为0.03 solver - 用来优化的处理算法,支持l-bfgs、gd,默认为"l-bfgs" initial_weights_str - 初始化权重用逗号分隔组成的字符串,例如: "0.01" "0.01,0.02,0.04"
  • 参数说明 参数 子参数 参数说明 b_use_default_encoder - 是否使用默认编码,默认为True input_features_str - 输入的列名以逗号分隔组成的字符串,例如: "column_a" "column_a,column_b" label_col - 目标列 classifier_label_index_col - 目标列经过标签编码后的新的列名,默认为"label_index" classifier_feature_vector_col - 算子输入的特征向量列的列名,默认为"model_features" prediction_col - 算子输出的预测label对应的标签列,默认为"prediction_index" prediction_index_col - 算子输出的预测label的列名,默认为"prediction" max_iter - 最大迭代次数,默认为100 reg_param - 正则化参数,默认为0.0 elastic_net_param - 弹性网络参数,默认为0.0 tol - 迭代算法的收敛阈值,默认为1e-6 fit_intercept - 是否要使用截距,默认为True standardization - 是否正则化特征,默认为True aggregation_depth - 聚合的深度,默认为2 family - 模型训练中使用哪种标签分布,支持auto、binomial、multinomial,默认为"auto"
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean", "required": "true", "helpTip": ""} "input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false", "helpTip": ""} "outer_pipeline_stages": None, "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"} "classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type": "string", "required": "true", "helpTip": ""} "classifier_feature_vector_col": "model_features", # @param {"label": "classifier_feature_vector_col", "type": "string", "required": "true", "helpTip": ""} "prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true", "helpTip": ""} "prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string", "required": "true", "helpTip": ""} "max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "reg_param": 0.0, # @param {"label": "reg_param", "type": "number", "required": "true", "range": "[0,none)", "helpTip": ""} "elastic_net_param": 0.0, # @param {"label": "elastic_net_param", "type": "number", "required": "true", "range": "[0,none)", "helpTip": ""} "tol": 1e-6, # @param {"label": "tol", "type": "number", "required": "true", "range": "(0,none)", "helpTip": ""} "fit_intercept": True, # @param {"label": "fit_intercept", "type": "boolean", "required": "true", "helpTip": ""} "standardization": True, # @param {"label": "standardization", "type": "boolean", "required": "true", "helpTip": ""} "aggregation_depth": 2, # @param {"label": "aggregation_depth", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "family": "auto", # @param {"label": "family", "type": "enum", "required": "true", "options":"auto,binomial,multinomial", "helpTip": ""} "lower_bounds_on_coefficients": None, "upper_bounds_on_coefficients": None, "lower_bounds_on_intercepts": None, "upper_bounds_on_intercepts": None}lr_classifier____id___ = MLSLogisticRegressionClassifier(**params)lr_classifier____id___.run()# @output {"label":"pipeline_model","name":"lr_classifier____id___.get_outputs()['output_port_1']","type":"PipelineModel"}
  • 参数说明 参数 子参数说明 参数说明 b_use_default_encoder - 是否使用默认编码,默认为True input_features_str - 输入的列名以逗号分隔组成的字符串,例如: "column_a" "column_a,column_b" cluster_feature_vector_col - 算子输入的特征向量列的列名,默认为"model_features" prediction_col - pyspark kmeans聚类器输出的预测列 k - 聚类的个数,默认为2 init_mode - 聚类采用的初始算法,random、k-means,默认为"random" init_steps - 采用k-means|| 初始化模式的步数,默认为2 max_iter - 最大迭代次数,默认为20 tol - 迭代算法的收敛阈值,默认为1e-4
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean", "required": "true", "helpTip": ""} "outer_pipeline_stages": None, "input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false", "helpTip": ""} "cluster_feature_vector_col": "model_features", # @param {"label": "cluster_feature_vector_col", "type": "string", "required": "true", "helpTip": ""} "prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true", "helpTip": ""} "k": 2, # @param {"label": "k", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "init_mode": "random", # @param {"label": "init_mode", "type": "string", "required": "true", "options": "random,k-means", "helpTip": ""} "init_steps": 2, # @param {"label": "init_steps", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "max_iter": 20, # @param {"label": "max_iter", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "tol": 1e-4 # @param {"label": "tol", "type": "number", "required": "true", "range": "(0.0,none)", "helpTip": ""}}kmeans____id___ = MLSKmeans(**params)kmeans____id___.run()# @output {"label":"pipeline_model","name":"kmeans____id___.get_outputs()['output_port_1']","type":"PipelineModel"}
  • 概述 “支持向量机分类”节点构造一个线性支持向量机模型,支持二分类和多分类。该节点采用Trust Region Newton Method(TRON)算法优化L2-SVM模型,更适用于大规模数据的建模,模型训练效率更高。 算法实现方式的简介如下: 二分类 给定训练集,惩罚系数,通过TRON优化方法求解以下非约束优化问题,得出权值向量和偏置量: 并通过以下决策函数对新样本预测出类别标签。 多分类 通过one-vs-the-rest策略实现多分类任务。训练时依次把某个类别的样本归为一类,其他剩余的样本归为另一类,转化为k个二分类问题,构造出了k个二分类SVM分类器。分类时将未知样本分类为具有最大分类函数值的那一类。
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean", "required": "true", "helpTip": ""} "input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false", "helpTip": ""} "outer_pipeline_stages": None, "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"} "classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type": "string", "required": "true", "helpTip": ""} "classifier_feature_vector_col": "model_features", # @param {"label": "classifier_feature_vector_col", "type": "string", "required": "true", "helpTip": ""} "prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string", "required": "true", "helpTip": ""} "prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true", "helpTip": ""} "max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "reg_param": 0.0, # @param {"label": "reg_param", "type": "number", "required": "true", "range": "[0,none)", "helpTip": ""} "tol": 1e-6, # @param {"label": "tol", "type": "number", "required": "true", "range": "(0,none)", "helpTip": ""} "fit_intercept": True, # @param {"label": "fit_intercept", "type": "boolean", "required": "true", "helpTip": ""} "standardization": True, # @param {"label": "standardization", "type": "boolean", "required": "true", "helpTip": ""} "aggregation_depth": 2 # @param {"label": "aggregation_depth", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""}}linear_svc_classifier____id___ = MLSLinearSVCClassifier(**params)linear_svc_classifier____id___.run()# @output {"label":"pipeline_model","name":"linear_svc_classifier____id___.get_outputs()['output_port_1']","type":"PipelineModel"}
  • 参数说明 参数 子参数 参数说明 b_use_default_encoder - 是否使用默认编码,默认为True input_features_str - 输入的列名以逗号分隔组成的字符串,例如: "column_a" "column_a,column_b" label_col - 目标列 classifier_label_index_col - 目标列经过标签编码后的新的列名,默认为"label_index" classifier_feature_vector_col - 算子输入的特征向量列的列名,默认为"model_features" prediction_index_col - 算子输出的预测label对应的标签列,默认为"prediction_index" prediction_col - 算子输出的预测label的列名,默认为"prediction" max_iter - 最大迭代次数,默认为100 reg_param - 正则化系数,默认为0.0 tol - 收敛阈值,默认为1e-6 fit_intercept - 默认为True standardization - 训练模型之前是否对训练特征标准化,默认为True aggregation_depth - 聚合时的深度,默认为2
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean", "required": "true", "helpTip": ""} "input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false", "helpTip": ""} "outer_pipeline_stages": None, "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"} "classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type": "string", "required": "true", "helpTip": ""} "classifier_feature_vector_col": "model_features", # @param {"label": "classifier_feature_vector_col", "type": "string", "required": "true", "helpTip": ""} "prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string", "required": "true", "helpTip": ""} "prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true", "helpTip": ""} "max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "true", "range": "[0,none)", "helpTip": ""} "impurity": "gini", # @param {"label": "impurity", "type": "enum", "required": "true", "options": "entropy,gini", "helpTip": ""} "num_trees": 20, # @param {"label": "num_trees", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "feature_subset_strategy": "all", # @param {"label": "feature_subset_strategy", "type": "enum", "required": "true", "options":"auto,all,onethird,sqrt,log2", "helpTip": ""} "subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "true", "range": "(0,1.0]", "helpTip": ""} "seed": 0 # @param {"label": "seed", "type": "integer", "required": "true", "range":"[0,2147483647]","helpTip": "seed"}}rf_classifier____id___ = MLSRandomForestClassifier(**params)rf_classifier____id___.run()# @output {"label":"pipeline_model","name":"rf_classifier____id___.get_outputs()['output_port_1']","type":"PipelineModel"}
  • 参数说明 参数 子参数 参数说明 b_use_default_encoder - 是否使用默认编码,默认为True input_features_str - 输入的列名以逗号分隔组成的字符串,例如: "column_a" "column_a,column_b" label_col - 目标列 classifier_label_index_col - 传给分类器的目标列,必须为数值列 classifier_feature_vector_col - 传给分类器的特征列,必须为向量列 prediction_col - 算子输出的预测label的列名,默认为"prediction" prediction_index_col - 算子输出的预测label对应的标签列,默认为"prediction_index" max_depth - 树的最大深度,默认为5 max_bins - 最大分箱数,默认为32 min_instances_per_node - 节点分割时,要求子节点必须包含的最少实例数,默认为1 min_info_gain - 节点是否分割要求的最小信息增益,默认为0 impurity - 计算信息增益的方法,支持entropy、gini,默认为"gini" num_trees - 树的个数,默认为20 feature_subset_strategy - 节点分割时考虑用到的特征列的策略,支持auto、all、onethird、sqrt、log2、n,默认为"all" subsampling_rate - 学习每棵决策树用到的训练集的比例,默认为1.0 seed - 随机数种子,默认为0
  • 概述 “随机决策森林分类”节点用于产生二分类或多分类模型。随机决策森林是用随机的方式建立一个森林模型,森林由很多的决策树组成,每棵决策树之间没有关联。当有一个新的样本输入时,森林中的每一棵决策树分别进行判断,哪一类被选择最多,就预测这个样本属于那一类。 随机决策森林分类中的决策树算法通过基尼不纯度(Gini impurity)或熵(Entropy)来对一个集合的有序程度进行量化,并对一次拆分进行量化评价。 基尼不纯度是指将来自集合中的某种结果随机应用于集合中某一数据项的预期误差率,计算公式如下: 熵是信息论中的概念,用来表示集合的无序程度,熵越大表示集合越混乱,反之则表示集合越有序,计算公式如下: fi表示类别i样本数量占所有样本的比例,C表示数据类别数。
  • 样例 inputs = { "dataframe": None, # @input {"label":"dataframe","type":"DataFrame"} "pipeline_model": None # @input {"label":"pipeline_model","type":"PipelineModel"}}params = { "inputs": inputs}model_predict____id___ = MLSModelPredict(**params)model_predict____id___.run()# @output {"label":"dataframe","name":"model_predict____id___.get_outputs()['output_port_1']","type":"DataFrame"}
  • 概述 二分k-means算法是分层聚类(Hierarchical clustering)的一种,分层聚类是聚类分析中常用的方法。 分层聚类的策略一般有两种: 聚合:这是一种自底向上的方法,每一个观察者初始化本身为一类,然后两两结合。 分裂:这是一种自顶向下的方法,所有观察者初始化为一类,然后递归地分裂它们。二分k-means算法是分裂法的一种。 二分k-means算法是k-means算法的改进算法,相比k-means算法,它可以加速k-means算法的执行速度,因为它的相似度计算少了,能够克服k-means收敛于局部最小的缺点。 二分k-means算法的一般流程如下所示: 把所有数据初始化为一个簇,将这个簇分为两个簇。 选择满足条件的可以分解的簇。选择条件综合考虑簇的元素个数以及聚类代价(也就是误差平方和SSE),误差平方和的公式如下所示,其中 表示权重值, 表示该簇所有点的平均值。 使用k-means算法将可分裂的簇分为两簇。 一直重复2、3步,直到满足迭代结束条件。
  • 参数说明 参数 子参数 参数说明 input_features_str - 输入的列名以逗号分隔组成的字符串,例如: "column_a" "column_a,column_b" cluster_feature_vector_col - 算子输入的特征向量列的列名,默认为"model_features" prediction_col - 算子输出的预测label的列名,默认为"prediction" k - 想要聚类的个数,默认为2 max_iter - 最大迭代次数,默认为100 min_divisible_cluster_size - 值如果大于等于1,它表示一个可切分簇的最小点数量;如果值小于1,它表示可切分簇的点数量占总数的最小比例,该值默认为1
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, "outer_pipeline_stages": None, "input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false", "helpTip": ""} "cluster_feature_vector_col": "model_features", # @param {"label": "cluster_feature_vector_col", "type": "string", "required": "true", "helpTip": ""} "prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true", "helpTip": ""} "k": 2, # @param {"label": "k", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "min_divisible_cluster_size": 1.0 # @param {"label": "min_divisible_cluster_size", "type": "number", "required": "true", "range": "(0,none)", "helpTip": ""}}bisecting_kmeans____id___ = MLSBisectingKmeans(**params)bisecting_kmeans____id___.run()# @output {"label":"pipeline_model","name":"bisecting_kmeans____id___.get_outputs()['output_port_1']","type":"PipelineModel"}
  • 样例 inputs = { "predict_dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""} "probability_col": "probability", # @param {"label": "probability_col", "type": "string", "required": "true", "helpTip": ""} "prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string", "required": "true", "helpTip": ""} "label_index_col": "label_index" # @param {"label": "label_index_col", "type": "string", "required": "true", "helpTip": ""}}binary_class_evaluation____id___ = MLSBinaryClassEvaluation(**params)binary_class_evaluation____id___.run()# @output {"label":"dataframe","name":"binary_class_evaluation____id___.get_outputs()['output_port_1']","type":"DataFrame"}
  • 参数说明 参数 子参数 参数说明 input_features_str - 输入的列名以逗号分隔组成的字符串,例如: "column_a" "column_a,column_b" cluster_feature_vector_col - 算子输入的特征向量列的列名,默认为"model_features" prediction_col - 算子输出的预测label的列名,默认为"prediction" probability_col - 算子输出的预测概率列的列名,默认为"probability" k - 要聚类的个数,默认为2 max_iter - 最大迭代次数,默认为100 tol - 收敛阈值,默认为0.01
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, "outer_pipeline_stages": None, "input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false", "helpTip": ""} "cluster_feature_vector_col": "model_features", # @param {"label": "cluster_feature_vector_col", "type": "string", "required": "true", "helpTip": ""} "prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true", "helpTip": ""} "probability_col": "probability", # @param {"label": "probability_col", "type": "string", "required": "true", "helpTip": ""} "k": 2, # @param {"label": "k", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "tol": 0.01 # @param {"label": "tol", "type": "number", "required": "true", "range": "(0,none)", "helpTip": ""}}gaussian_mixture____id___ = MLSGaussianMixture(**params)gaussian_mixture____id___.run()# @output {"label":"pipeline_model","name":"gaussian_mixture____id___.get_outputs()['output_port_1']","type":"PipelineModel"}
  • 样例 inputs = { "predict_dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""} "prediction_col": "prediction" # @param {"label": "prediction_col", "type": "string", "required": "true", "helpTip": ""}}regression_evaluation____id___ = MLSRegressionEvaluation(**params)regression_evaluation____id___.run()# @output {"label":"dataframe","name":"regression_evaluation____id___.get_outputs()['output_port_1']","type":"DataFrame"}
  • 参数说明 参数 子参数 参数说明 b_use_default_encoder - 是否使用默认编码,默认为True input_features_str - 输入的列名以逗号分隔组成的字符串,例如: "column_a" "column_a,column_b" label_col - 目标列 regressor_feature_vector_col - 算子输入的特征向量列的列名,默认为"model_features" max_depth - 树的最大深度,默认为5 max_bins - 最大分箱数,默认为32 min_instances_per_node - 节点分割时,要求子节点必须包含的最少实例数,默认为1 min_info_gain - 最小信息增益,默认为0.0
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean", "required": "true", "helpTip": ""} "input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false", "helpTip": ""} "outer_pipeline_stages": None, "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""} "regressor_feature_vector_col": "model_features", # @param {"label": "regressor_feature_vector_col", "type": "string", "required": "true", "helpTip": ""} "max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "true", "range": "[0.0,none)", "helpTip": ""} "impurity": "variance"}dt_regressor____id___ = MLSDecisionTreeRegression(**params)dt_regressor____id___.run()# @output {"label":"pipeline_model","name":"dt_regressor____id___.get_outputs()['output_port_1']","type":"PipelineModel"}
  • 参数说明 参数 子参数 参数说明 user_col - 用户id所在的列名 item_col - 项目id所在的列名 rating_col - 评分所在的列名 recommend_nums - 推荐物品的个数,默认为10 prediction_col - 预测列列名,默认为"prediction" cold_start_strategy - 冷启动策略,默认为"nan" alpha - 矩阵分解的正则化系数,默认为1.0 implicit_prefs - 是否使用隐含偏好,默认为Flase max_iter - 最大迭代次数,默认为50 non_negative - 是否使用非负限制,默认为False rank - 因子分解的秩,默认为10 reg_param - 正则化系数,默认为0.0
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean", "required": "true", "helpTip": ""} "input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false", "helpTip": ""} "outer_pipeline_stages": None, "label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"} "regressor_feature_vector_col": "model_features", # @param {"label": "regressor_feature_vector_col", "type": "string", "required": "true", "helpTip": ""} "max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "true", "range": "[0.0,none)", "helpTip": ""} "subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "true", "range": "(0.0,1.0]", "helpTip": ""} "loss_type": "squared", # @param {"label": "loss_type", "type": "enum", "required": "true", "options": "squared,absolute", "helpTip": ""} "max_iter": 20, # @param {"label": "max_iter", "type": "integer", "required": "true", "range": "(0,2147483647]", "helpTip": ""} "step_size": 0.1, # @param {"label": "step_size", "type": "number", "required": "true", "range": "(0.0,none)", "helpTip": ""} "impurity": "variance"}gbt_regressor____id___ = MLSGBTRegression(**params)gbt_regressor____id___.run()# @output {"label":"pipeline_model","name":"gbt_regressor____id___.get_outputs()['output_port_1']","type":"PipelineModel"}
  • 样例 inputs = { "dataframe": None # @input {"label":"dataframe","type":"DataFrame"}}params = { "inputs": inputs, "user_col": "", # @param {"label":"user_col","type":"string","required":"true","helpTip":""} "item_col": "", # @param {"label":"item_col","type":"string","required":"true","helpTip":""} "rating_col": "", # @param {"label":"rating_col","type":"string","required":"true","helpTip":""} "recommend_nums": 10, # @param {"label":"recommend_nums","type":"integer","required":"false","range":"(0,2147483647)","helpTip":""} "prediction_col": "prediction", # @param {"label":"prediction_col","type":"string","required":"false","helpTip":""} "cold_start_strategy": "nan", # @param {"label":"cold_start_strategy","type":"string","required":"false","helpTip":""} "alpha": 1, # @param {"label":"alpha","type":"number","required":"false","range":"(none,none)","helpTip":""} "implicit_prefs": False, # @param {"label":"implicit_prefs","type":"boolean","required":"false","helpTip":""} "max_iter": 10, # @param {"label":"max_iter","type":"integer","required":"false","range":"(0,2147483647)","helpTip":""} "non_negative": False, # @param {"label":"non_negative","type":"boolean","required":"false","helpTip":""} "rank": 10, # @param {"label":"rank","type":"integer","required":"false","range":"(0,2147483647)","helpTip":""} "reg_param": 0.1 # @param {"label":"reg_param","type":"number","required":"false","range":"(none,none)","helpTip":""} }als____id___ = MLSALS(**params)als____id___.run()# @output {"label":"pipeline_model","name":"als____id___.get_outputs()['output_port_1']","type":"PipelineModel"}# @output {"label":"dataframe","name":"als____id___.get_outputs()['output_port_2']","type":"DataFrame"}
  • 参数说明 参数 子参数 参数说明 b_use_default_encoder - 是否使用默认编码,默认为True input_features_str - 输入的列名以逗号分隔组成的字符串,例如: "column_a" "column_a,column_b" label_col - 目标列 regressor_feature_vector_col - 算子输入的特征向量列的列名,默认为"model_features" max_depth - 树的最大深度,默认为5 max_bins - 最大分箱数,默认为32 min_instances_per_node - 节点分割时,要求子节点必须包含的最少实例数,默认为1 min_info_gain - 节点是否分割要求的最小信息增益,默认为0.0 subsampling_rate - 学习每棵决策树用到的训练集的抽样比例,默认为1.0 loss_type - 损失函数类型,支持squared、absolute,默认为"squared" max_iter - 最大迭代次数,默认为20 step_size - 步长,默认为0.1
  • 概述 “梯度提升树回归”节点用于生成回归模型,是一种基于决策树的迭代回归算法。该算法采用迭代的思想不断地构建决策树模型,每棵树都是通过梯度优化损失函数而构建,从而达到从基准值到目标值的逼近。算法思想可简单理解成:后一次模型都是针对前一次模型预测出错的情况进行修正,模型随着迭代不断地改进,从而获得比较好的预测效果。 梯度提升树回归的损失函数为均方差损失函数,如下所示: 其中,N 表示样本数量,xi 表示样本i 的特征,yi 表示样本i 的标签,F(xi) 表示样本i 预测的标签。
  • 参数说明 参数 子参数 参数说明 input_features_str - 输入的列名以逗号分隔组成的字符串,例如: "column_a" "column_a,column_b" label_col - 目标列 regressor_feature_vector_col - 算子输入的特征向量列的列名,默认为"model_features" prediction_col - 算子输出的预测label的列名,默认为"prediction" objective - 目标函数,默认为"regression" max_depth - 树的最大深度,默认为-1 num_iteration - 迭代次数,默认为100 learning_rate - 学习率,默认为0.1 num_leaves - 叶子数目,默认为31 max_bin - 最大分箱数,默认为255 bagging_fraction - bagging的比例,默认为1 bagging_freq - bagging的频率,默认为0 bagging_seed - bagging时的随机数种子,默认为3 early_stopping_round - 提前结束迭代的轮数,默认为0 feature_fraction - 特征的比例,默认为1.0 min_sum_hessian_in_leaf - 一个叶子上最小hessian和。取值区间为[0, 1],默认为1e-3 boost_from_average - 是否将初始分数调整为标签的平均值,以加快收敛速度,,默认为True boosting_type - 提升方法的提升类型。 可选值有:gbdt、gbrt、rf、dart、goss,默认为gbdt lambda_l1 - L1正则化系数,默认为0.0 lambda_l2 - L2正则化系数,,默认为0.0 num_batches - 如果大于0,在训练中将数据集分割成不同的批次,默认为0 parallelism - 学习树时的并行方法,支持data_parallel, voting_parallel,默认为"data_parallel"
共100000条