iis服务器助手广告广告
返回顶部
首页 > 资讯 > 数据库 >基于华为云的在线拍卖数据分析
  • 353
分享到

基于华为云的在线拍卖数据分析

mysqlhadoophive大数据 2023-09-30 21:09:35 353人浏览 八月长安
摘要

实验设备与平台: mapReduce服务 MRS弹性公网IP弹性云服务器ECS基于aarch64架构的Mysql,MiniConda,Sqoop 1 数据集准备 删除csv文件中无意义的特征:

实验设备与平台:

1 数据集准备
  1. 删除csv文件中无意义的特征:ReturnsAccepted
  2. 通过WinSCP将csv文件传输到华为云文件系统/home/zkpk/raw/位置
mkdir /home/zkpkmkdir /home/zkpk/raw
  1. 通过shell命令去除文件的首行字段
cd /home/zkpk/raw/sed -i '1d' TrainingSet.csvsed -i '1d' TestSet.csv
  1. 将csv文件上传到hdfs
hadoop fs -put TrainingSet.csv /zkpk/raw/hadoop fs -put TestSet.csv /zkpk/raw/hadoop fs -ls /zkpk/raw/
2 数据集预处理
  1. 启动Hive,并创建数据库zkpk
SHOW DATABASES;CREATE DATABASE zkpk;SHOW DATABASES;
  1. 对训练集创建外部表traingingset_log并导入数据
CREATE EXTERNAL TABLE zkpk.trainingset_log(EbayID STRING,QuantitySold INT,Price FLOAT,PricePercent FLOAT,StartingBidPercent FLOAT,SellerName STRING,SellerClosePercent DOUBLE,CateGory INT,PersonID STRING,StartingBid FLOAT,AvgPrice FLOAT,EndDay STRING,HitCount INT,AuctionAvgHitCount INT,ItemAuctionSellPercent INT,SellerSaleAvgPriceRatio DOUBLE,SellerAvg DOUBLE,SellerItemAvg INT,AuctionHitCountAvgRatio INT,BestOffer DOUBLE,IsHOF INT,ItemListedCount INT,AuctionCount INT,AuctionSaleCount INT,SellerAuctionCount INT,SellerAuctionSaleCount INT,AuctionMedianPrice FLOAT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE;load data inpath '/zkpk/raw/TrainingSet.csv' into table trainingset_log;
  1. 测试集创建外部表testset_log并导入数据
CREATE EXTERNAL TABLE zkpk.testset_log(EbayID STRING,QuantitySold INT,Price FLOAT,PricePercent FLOAT,StartingBidPercent FLOAT,SellerName STRING,SellerClosePercent DOUBLE,Category INT,PersonID STRING,StartingBid FLOAT,AvgPrice FLOAT,EndDay STRING,HitCount INT,AuctionAvgHitCount INT,ItemAuctionSellPercent INT,SellerSaleAvgPriceRatio DOUBLE,SellerAvg DOUBLE,SellerItemAvg INT,AuctionHitCountAvgRatio INT,BestOffer DOUBLE,IsHOF INT,ItemListedCount INT,AuctionCount INT,AuctionSaleCount INT,SellerAuctionCount INT,SellerAuctionSaleCount INT,AuctionMedianPrice FLOAT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE;load data inpath '/zkpk/raw/TestSet.csv' into table testset_log;
3 数据集分析处理
  1. 统计 TrainingSet 中拍买成功交易的平均成交价并保存
CREATE TABLE avg_price(avg_price FLOAT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;Insert OVERWRITE TABLE avg_priceSELECT avg(Price) FROM trainingset_log WHERE QuantitySold=1;
  1. 统计 TrainingSet 中金牌卖家的拍卖成功率,降序排列并保存
CREATE TABLE success_rate_temp(SellerName STRING,Rate DOUBLE)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;INSERT OVERWRITE TABLE success_rate_temp SELECT SellerName,sum(QuantitySold)/count(QuantitySold)FROM trainingset_log WHERE IsHOF=1 GROUP BY SellerName;CREATE TABLE success_rate_desc(SellerName STRING,Rate DOUBLE)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;INSERT OVERWRITE TABLE success_rate_desc SELECT * FROM success_rate_temp ORDER BY Rate DESC;drop table success_rate_temp;
  1. 统计TrainingSet和TestSet中周一到周日,每天拍卖成功的数量及拍卖成功率并保存
CREATE TABLE train_day_rate(EndDay STRING,Success INT,Rate DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;INSERT OVERWRITE TABLE train_day_rate SELECT EndDay,sum(QuantitySold),sum(QuantitySold)/count(QuantitySold) FROM trainingset_log GROUP BY EndDay;
CREATE TABLE test_day_rate(EndDay STRING,Success INT,Rate DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;INSERT OVERWRITE TABLE test_day_rate SELECT EndDay,sum(QuantitySold),sum(QuantitySold)/count(QuantitySold) FROM testset_log GROUP BY EndDay;
  1. 筛选出TrainingSet和TestSet数据中的EbayID,Quantitiysold字段,保存为train_label文件和test_label文件
CREATE TABLE train_label(EbayID STRING, QuantitySold Int)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE;INSERT OVERWRITE TABLE train_label SELECTEbayID, QuantitySold FROM trainingset_log;
CREATE TABLE test_label(EbayID STRING, QuantitySold Int)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE;INSERT OVERWRITE TABLE test_label SELECTEbayID, QuantitySold FROM testset_log;
  1. 从TrainingSet和TestSet数据中删除的SellerName,QuantiySold,EndDay字段,并将数据导出保存为train_data文件和test_data文件
CREATE TABLE train_data (EbayID STRING,Price FLOAT,PricePercent FLOAT,StartingBidPercent FLOAT,SellerClosePercent DOUBLE,Category INT,PersonID STRING,StartingBid FLOAT,AvgPrice FLOAT,HitCount INT,AuctionAvgHitCount INT,ItemAuctionSellPercent INT,SellerSaleAvgPriceRatio DOUBLE,SellerAvg DOUBLE,SellerItemAvg INT,AuctionHitCountAvgRatio INT,BestOffer DOUBLE,IsHOF INT,ItemListedCount INT,AuctionCount INT,AuctionSaleCount INT,SellerAuctionCount INT,SellerAuctionSaleCount INT,AuctionMedianPrice FLOAT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE;INSERT OVERWRITE TABLE train_data SELECT EbayID,Price,PricePercent,StartingBidPercent,SellerClosePercent,Category,PersonID,StartingBid,AvgPrice,HitCount,AuctionAvgHitCount,ItemAuctionSellPercent,SellerSaleAvgPriceRatio,SellerAvg,SellerItemAvg,AuctionHitCountAvgRatio,BestOffer,IsHOF,ItemListedCount,AuctionCount,AuctionSaleCount,SellerAuctionCount,SellerAuctionSaleCount,AuctionMedianPrice FROM trainingset_log;
CREATE TABLE test_data (EbayID STRING,Price FLOAT,PricePercent FLOAT,StartingBidPercent FLOAT,SellerClosePercent DOUBLE,Category INT,PersonID STRING,StartingBid FLOAT,AvgPrice FLOAT,HitCount INT,AuctionAvgHitCount INT,ItemAuctionSellPercent INT,SellerSaleAvgPriceRatio DOUBLE,SellerAvg DOUBLE,SellerItemAvg INT,AuctionHitCountAvgRatio INT,BestOffer DOUBLE,IsHOF INT,ItemListedCount INT,AuctionCount INT,AuctionSaleCount INT,SellerAuctionCount INT,SellerAuctionSaleCount INT,AuctionMedianPrice FLOAT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE;INSERT OVERWRITE TABLE test_data SELECT EbayID,Price,PricePercent,StartingBidPercent,SellerClosePercent,Category,PersonID,StartingBid,AvgPrice,HitCount,AuctionAvgHitCount,ItemAuctionSellPercent,SellerSaleAvgPriceRatio,SellerAvg,SellerItemAvg,AuctionHitCountAvgRatio,BestOffer,IsHOF,ItemListedCount,AuctionCount,AuctionSaleCount,SellerAuctionCount,SellerAuctionSaleCount,AuctionMedianPrice FROM testset_log;
4 数据集导出
  1. 安装并配置mysql环境
# 安装wgetyum -y install wget# 远程下载Mysql压缩包wget https://mirrors.tuna.tsinghua.edu.cn/mysql/downloads/MySQL-8.0/mysql-8.0.26-1.el8.aarch64.rpm-bundle.tar# 解压MySQL并安装tar -xvf mysql-8.0.26-1.el8.aarch64.rpm-bundle.taryum install *.rpm# 启动MySQL服务并查看运行状态systemstl start mysqlsystemstl status mysql# 设置开机启动MySQL服务systemctl enable mysqldsystemctl daemon-reload# 查看临时数据库密码grep 'temporary passWord' /var/log/mysqld.log# 修改密码 未修改密码策略ALTER USER 'root'@'localhost' IDENTIFIED BY '20001215,Cj';# 启动MySQL服务mysql -uroot -p
  1. 设置远程连接用户:通过主机的Navicat for MySQL连接云服务器的MySQL
# 创建远程连接用户create user 'zkpk'@'%' identified by '20001215,Cj';# 授予用户权限grant all on *.* to 'zkpk'@'%';# 更改加密方式ALTER USER 'zkpk'@'%' IDENTIFIED BY '20001215,Cj' PASSWORD EXPIRE NEVER;# 刷新权限flush privileges;
  1. 从HDFS导出数据到本地文件系统
# 查看表格在HDFS上的存储位置show create table 表格名;
# 查看表格在HDFS上的数据存放位置下的文件内容hadoop fs -ls location# 导出表格到本地文件系统hadoop fs -get location /home/zkpk/
  1. 从HDFS导出数据到MySQL数据库
# 获取表格在HDFS的存储位置use zkpk;show create table 表格名;
# 将HDFS中文件存储到本地文件系统hadoop fs -get location /home/zkpk/result/# 登录MySQLmysql --local_infile=1 -u root -p
# 在MySQL中创建对应表格# 加载本地文件系统的表格到MySQLLOAD DATA LOCAL INFILE '/home/zkpk/result/textfile' INTO TABLE zkpk.表格名 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
5 数据可视化展示
  1. 安装配置miniconda,并管理依赖包
wget Https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py39_4.11.0-linux-aarch64.shsh Miniconda3-py39_4.11.0-Linux-aarch64.sh
# 进入conda环境conda activate# 加载依赖项conda install 依赖项# 运行python文件Python 路径+文件名# 退出conda环境conda deactivate
  1. 编写可视化程序
import pymysqlimport pandas as pdimport matplotlib.pyplot as pltdb = pymysql.connect(host='localhost', user='root', password='20001215,Cj', database='zkpk')cursor = db.cursor()sql = 'select EndDay, Success from train_day_rate'# sql = 'select EndDay, Rate from train_day_rate'# sql = 'select EndDay, Success from test_day_rate'# sql = 'select EndDay, Rate from test_day_rate'cursor.execute(sql)data = cursor.fetchall()cursor.close()db.close()print(data)df = pd.DataFrame(list(data),columns=['endDay', 'amount'])plt.figureplt.xlabel('day')plt.ylabel('amount')plt.bar(df['endDay'],df['amount'])plt.show()
  1. 华为云DLV组件实现可视化展示
6 拍卖成功率预测
  1. 从数据库获取数据集train_labeltrain_datatest_labeltest_data
import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltimport pymysqldef getdataset():    # 连接数据库    # db = pymysql.connect(host='localhost', user='root', password='20001215,Cj', database='zkpk')    db = pymysql.connect(host='124.70.59.198', user='zkpk', password='20001215,Cj', database='zkpk')    cursor = db.cursor()    # 读取数据集 训练集    sql = 'select * from train_data'    cursor.execute(sql)    data = cursor.fetchall()    train_set= pd.DataFrame(list(data),columns=['EbayID','Price', 'PricePercent', 'StartingBidPercent',                    'SellerClosePercent','Category','PersonID',                    'StartingBid', 'AvgPrice', 'HitCount',                      'AuctionAvgHitCount',                    'ItemAuctionSellPercent', 'SellerSaleAvgPriceRatio',                    'SellerAvg', 'SellerItemAvg', 'AuctionHitCountAvgRatio',                    'BestOffer', 'IsHOF', 'ItemListedCount',                    'AuctionCount', 'AuctionSaleCount',                    'SellerAuctionCount', 'SellerAuctionSaleCount',                    'AuctionMedianPrice'])    print("the shape of train_set:", train_set.shape)    #训练集label    sql = 'select QuantitySold from train_label'    cursor.execute(sql)    data = cursor.fetchall()    train_label= pd.DataFrame(list(data),columns=['QuantitySold'])    print("the shape of train_label:", train_label.shape)    #测试集    sql = 'select * from test_data'    cursor.execute(sql)    data = cursor.fetchall()    test_set= pd.DataFrame(list(data),columns=['EbayID','Price', 'PricePercent', 'StartingBidPercent',                    'SellerClosePercent','Category','PersonID',                    'StartingBid', 'AvgPrice', 'HitCount',                    'AuctionAvgHitCount',                    'ItemAuctionSellPercent', 'SellerSaleAvgPriceRatio',                    'SellerAvg', 'SellerItemAvg', 'AuctionHitCountAvgRatio',                    'BestOffer', 'IsHOF', 'ItemListedCount',                    'AuctionCount', 'AuctionSaleCount',                    'SellerAuctionCount', 'SellerAuctionSaleCount',                    'AuctionMedianPrice'])    #测试集label    sql = 'select QuantitySold from test_label'    cursor.execute(sql)    data = cursor.fetchall()    test_label= pd.DataFrame(list(data),columns=['QuantitySold'])    cursor.close()    db.close()    #去掉与拍卖成功概率无关的特征EbayID    train_data = train_set.drop(['EbayID'], axis=1)    test_data = test_set.drop(['EbayID'], axis=1)    n_items, n_features = train_data.shape    #the number of total features    train_data.head()    return train_set, train_data, train_label, test_data, test_label
  1. 拍卖成功预测

随机小批量梯度下降法:minibatchSGDClassification.py

import pandas as pdimport matplotlib.pyplot as pltimport pymysqlimport numpy as npfrom GetDataSet import getdatasetfrom sklearn.linear_model import SGDClassifierfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import precision_score, recall_score, f1_scoretrain_set, train_data, train_label, test_data, test_label = getdataset()# The results of mini_batch learning for SGDClassifier in the training process were drawndef plot_learning(clf, title):    plt.figure()    # Record the prediction of the last training result in this training    validationScore = []    # Record the forecast situation after adding this training result    trainScore = []    # Minimum training frequency    mini_batch = 1000    n_items = train_set.shape[0]    for i in range(int(np.ceil(n_items / mini_batch))):        x_batch = train_data[i * mini_batch: min((i + 1) * mini_batch, n_items)]        y_batch = train_label[i * mini_batch: min((i + 1) * mini_batch, n_items)]        if i > 0:            validationScore.append(clf.score(x_batch, y_batch))        clf.partial_fit(x_batch, y_batch, classes=range(5))        if i > 0:            trainScore.append(clf.score(x_batch, y_batch))        plt.plot(trainScore, label="train_score")        plt.plot(validationScore, label="validation_score")        plt.xlabel("Mini_batch")        plt.ylabel("Score")        plt.grid()        plt.title(title)        plt.savefig('test.jpg')# 对训练数据进行正则化scaler = StandardScaler()train_data = scaler.fit_transform(train_set.drop(['EbayID'], axis=1))#SGD二分类clf = SGDClassifier(penalty='l2', alpha=0.0004)plot_learning(clf, 'SGDClassifier')test_data = scaler.fit_transform(test_data)train_pred = clf.predict(train_data)test_pred = clf.predict(test_data)print("SGDClassifier training performance on testing dataset:")print("\tPrecision:%1.3f" % precision_score(test_label, test_pred, average='micro'))print("\tRecall:%1.3f" % recall_score(train_label, train_pred))print("\tF1:%1.3f \n" % f1_score(train_label, train_pred))

决策树:ExecutiveTree

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltimport pymysqlfrom sklearn import treefrom GetDataSet import getdatasettrain_set, train_data, train_label, test_data, test_label = getdataset()clf = tree.DecisionTreeClassifier()clf = clf.fit(train_data, train_label)print("Precision: ", clf.score(test_data, test_label))

逻辑回归:LogisticRegression.py

import pandas as pdimport numpy as npfrom sklearn.linear_model import LoGISticRegression as LRfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import classification_reportfrom GetDataSet import getdataset#获取数据集train_set, train_data, train_label, test_data, test_label = getdataset()#对train_data进行标准化std = StandardScaler()train_data = std.fit_transform(train_data)#进行逻辑回归二分类lg = LR(C=1.0)lg.fit(train_data, train_label)test_predict = lg.predict(test_data)print("准确率: " ,lg.score(test_data, test_label))

来源地址:https://blog.csdn.net/weixin_52430122/article/details/123619595

您可能感兴趣的文档:

--结束END--

本文标题: 基于华为云的在线拍卖数据分析

本文链接: https://www.lsjlt.com/news/422064.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com    QQ/279061341

本篇文章演示代码以及资料文档资料下载

下载Word文档到电脑,方便收藏和打印~

下载Word文档
猜你喜欢
  • oracle怎么显示表的字段
    如何显示 oracle 表的字段 在 Oracle 数据库中,可以使用 DESC 命令显示表的字段。 语法: DESC table_name 参数: table_name:要显示字段的表...
    99+
    2024-05-14
    oracle
  • oracle怎么看所有的表
    在 oracle 数据库中查看所有表的步骤:连接到数据库运行查询:select table_name from user_tables; 如何使用 Oracle 查看所有表 ...
    99+
    2024-05-14
    oracle
  • oracle怎么显示行数
    如何使用 oracle 显示行数 在 Oracle 数据库中,有两种主要方法可以显示行数: 1. 使用 COUNT 函数 SELECT COUNT(*) FROM table_n...
    99+
    2024-05-14
    oracle
  • oracle怎么显示百分比
    oracle中显示百分比的方法有:使用百分号“%”;使用to_char()函数;使用format()函数(oracle 18c及更高版本);创建自定义函数。 Oracle 显...
    99+
    2024-05-14
    oracle
  • oracle怎么删除列
    oracle 中删除列的方法有两种:1)使用 alter table table_name drop column column_name 语句;2)使用 drop colum...
    99+
    2024-05-14
    oracle
  • sql怎么查看表的索引
    通过查询系统表,可以获取表的索引信息,包括索引名称、是否唯一、索引类型、索引列和行数。常用系统表有:mysql 的 information_schema.statistics、postg...
    99+
    2024-05-14
    mysql oracle
  • sql怎么查看索引
    您可以使用 sql 通过以下方法查看索引:show indexes 语句:显示表中定义的索引列表及其信息。explain 语句:显示查询计划,其中包含用于执行查询的索引。informat...
    99+
    2024-05-14
  • sql怎么查看存储过程
    如何查看 sql 存储过程的源代码:使用 show create procedure 语句直接获取创建脚本。查询 information_schema.routines 表的 routi...
    99+
    2024-05-14
  • sql怎么查看视图表
    要查看视图表,可以使用以下步骤:使用 select 语句获取视图中的数据。使用 desc 语句查看视图的架构。使用 explain 语句分析视图的执行计划。使用 dbms 提供...
    99+
    2024-05-14
    oracle python
  • sql怎么查看创建的视图
    可以通过sql查询查看已创建的视图,具体步骤包括:连接到数据库并执行查询select * from information_schema.views;查询结果将显示视图的名称、...
    99+
    2024-05-14
    mysql
软考高级职称资格查询
编程网,编程工程师的家园,是目前国内优秀的开源技术社区之一,形成了由开源软件库、代码分享、资讯、协作翻译、讨论区和博客等几大频道内容,为IT开发者提供了一个发现、使用、并交流开源技术的平台。
  • 官方手机版

  • 微信公众号

  • 商务合作