fasttext的python实现---

fasttext python 2023-01-31 01:01:51 154人浏览泡泡鱼

Python 官方文档：入门教程 => 点击学习

摘要

转载请注明作者和出处：Http://blog.csdn.net/john_bh/ 一、简介二、FastText原理2.1 模型架构2.2 层次SoftMax2.3 N-gram特征三、基于fastText实现文本分类3.1 fast

转载请注明作者和出处：Http://blog.csdn.net/john_bh/

一、简介
二、FastText原理
- 2.1 模型架构
- 2.2 层次SoftMax
- 2.3 N-gram特征
三、基于fastText实现文本分类
- 3.1 fastText有监督学习分类
- 3.2 fastText有监督学习分类
三、总结
- 3.1 fastText和Word2vec的区别
- 3.2 小结

fasttext是facebook开源的一个词向量与文本分类工具，在2016年开源，典型应用场景是“带监督的文本分类问题”。提供简单而高效的文本分类和表征学习的方法，性能比肩深度学习而且速度更快。

fastText结合了自然语言处理和机器学习中最成功的理念。这些包括了使用词袋以及n-gram袋表征语句，还有使用子字(subword)信息，并通过隐藏表征在类别间共享信息。我们另外采用了一个softmax层级(利用了类别不均衡分布的优势)来加速运算过程。

这些不同概念被用于两个不同任务：

有效文本分类 ：有监督学习
学习词向量表征：无监督学习

举例来说：fastText能够学会“男孩”、“女孩”、“男人”、“女人”指代的是特定的性别，并且能够将这些数值存在相关文档中。然后，当某个程序在提出一个用户请求（假设是“我女友现在在儿？”），它能够马上在fastText生成的文档中进行查找并且理解用户想要问的是有关女性的问题。

fastText方法包含三部分，模型架构，层次SoftMax和N-gram特征。

2.1 模型架构

fastText的架构和word2vec中的CBOW的架构类似，因为它们的作者都是Facebook的科学家Tomas Mikolov，而且确实fastText也算是words2vec所衍生出来的。

Continuous Bog-Of-Words：

fastText

fastText 模型输入一个词的序列（一段文本或者一句话)，输出这个词序列属于不同类别的概率。
序列中的词和词组组成特征向量，特征向量通过线性变换映射到中间层，中间层再映射到标签。
fastText 在预测标签时使用了非线性激活函数，但在中间层不使用非线性激活函数。

fastText 模型架构和 Word2Vec 中的 CBOW 模型很类似。不同之处在于，fastText 预测标签，而 CBOW 模型预测中间词。

__label__2 , birchas chaim , yeshiva birchas chaim is a orthodox jewish mesivta high school in lakewood township new jersey . it was founded by rabbi shmuel zalmen stein in 2001 after his father rabbi chaim stein asked him to open a branch of telshe yeshiva in lakewood . as of the 2009-10 school year the school had an enrollment of 76 students and 6 . 6 classroom teachers ( on a fte basis ) for a student–teacher ratio of 11 . 5 1 . 
__label__6 , motor torpedo boat pt-41 , motor torpedo boat pt-41 was a pt-20-class motor torpedo boat of the united states navy built by the electric launch company of bayonne new jersey . the boat was laid down as motor boat submarine chaser ptc-21 but was reclassified as pt-41 prior to its launch on 8 july 1941 and was completed on 23 july 1941 . 
__label__11 , passiflora picturata , passiflora picturata is a species of passion flower in the passifloraceae family . 
__label__13 , naya din nai raat , naya din nai raat is a 1974 bollywood drama film directed by a . bhimsingh . the film is famous as sanjeev kumar reprised the nine-role epic perfORMance by sivaji ganesan in navarathri ( 1964 ) which was also previously reprised by akkineni nageswara rao in navarathri ( telugu 1966 ) . this film had enhanced his status and reputation as an actor in hindi cinema .

其中前面的label是前缀，也可以自己定义，label后接的为类别。

具体代码：

# -*- coding:utf-8 -*-
import pandas as pd
import random
import fasttext
import jieba
from sklearn.model_selection import train_test_split

cate_dic = {'technology': 1, 'car': 2, 'entertainment': 3, 'military': 4, 'sports': 5}
"""
函数说明：加载数据
"""
def loadData():


    #利用pandas把数据读进来
    df_technology = pd.read_csv("./data/technology_news.csv",encoding ="utf-8")
    df_technology=df_technology.dropna()    #去空行处理

    df_car = pd.read_csv("./data/car_news.csv",encoding ="utf-8")
    df_car=df_car.dropna()

    df_entertainment = pd.read_csv("./data/entertainment_news.csv",encoding ="utf-8")
    df_entertainment=df_entertainment.dropna()

    df_military = pd.read_csv("./data/military_news.csv",encoding ="utf-8")
    df_military=df_military.dropna()

    df_sports = pd.read_csv("./data/sports_news.csv",encoding ="utf-8")
    df_sports=df_sports.dropna()

    technology=df_technology.content.values.tolist()[1000:21000]
    car=df_car.content.values.tolist()[1000:21000]
    entertainment=df_entertainment.content.values.tolist()[:20000]
    military=df_military.content.values.tolist()[:20000]
    sports=df_sports.content.values.tolist()[:20000]

    return technology,car,entertainment,military,sports

"""
函数说明：停用词
参数说明：
    datapath：停用词路径
返回值：
    stopwords:停用词
"""
def getStopWords(datapath):
    stopwords=pd.read_csv(datapath,index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
    stopwords=stopwords["stopword"].values
    return stopwords

"""
函数说明：去停用词
参数：
    content_line：文本数据
    sentences：存储的数据
    cateGory：文本类别
"""
def preprocess_text(content_line,sentences,category,stopwords):
    for line in content_line:
        try:
            segs=jieba.lcut(line)    #利用结巴分词进行中文分词
            segs=filter(lambda x:len(x)>1,segs)    #去掉长度小于1的词
            segs=filter(lambda x:x not in stopwords,segs)    #去掉停用词
            sentences.append("__lable__"+str(category)+" , "+" ".join(segs))    #把当前的文本和对应的类别拼接起来，组合成fasttext的文本格式
        except Exception as e:
            print (line)
            continue

"""
函数说明：把处理好的写入到文件中，备用
参数说明：

"""
def writeData(sentences,fileName):
    print("writing data to fasttext format...")
    out=open(fileName,'w')
    for sentence in sentences:
        out.write(sentence.encode('utf8')+"\n")
    print("done!")

"""
函数说明：数据处理
"""
def preprocessData(stopwords,saveDataFile):
    technology,car,entertainment,military,sports=loadData()    

    #去停用词，生成数据集
    sentences=[]
    preprocess_text(technology,sentences,cate_dic["technology"],stopwords)
    preprocess_text(car,sentences,cate_dic["car"],stopwords)
    preprocess_text(entertainment,sentences,cate_dic["entertainment"],stopwords)
    preprocess_text(military,sentences,cate_dic["military"],stopwords)
    preprocess_text(sports,sentences,cate_dic["sports"],stopwords)

    random.shuffle(sentences)    #做乱序处理，使得同类别的样本不至于扎堆

    writeData(sentences,saveDataFile)

if __name__=="__main__":
    stopwordsFile=r"./data/stopwords.txt"
    stopwords=getStopWords(stopwordsFile)
    saveDataFile=r'train_data.txt'
    preprocessData(stopwords,saveDataFile)
    #fasttext.supervised():有监督的学习
    classifier=fasttext.supervised(saveDataFile,'classifier.model',lable_prefix='__lable__')
    result = classifier.test(saveDataFile)
    print("P@1:",result.precision)    #准确率
    print("R@2:",result.recall)    #召回率
    print("Number of examples:",result.nexamples)    #预测错的例子

    #实际预测
    lable_to_cate={1:'technology'.1:'car',3:'entertainment',4:'military',5:'sports'}

    texts=['中新网 日电 2018 预赛 亚洲区 强赛 中国队 韩国队 较量 比赛 上半场 分钟 主场 作战 中国队 率先 打破 场上 僵局 利用 角球 机会 大宝 前点 攻门 得手 中国队 领先']
    lables=classifier.predict(texts)
    print(lables)
    print(lable_to_cate[int(lables[0][0])])

    #还可以得到类别+概率
    lables=classifier.predict_proba(texts)
    print(lables)

    #还可以得到前k个类别
    lables=classifier.predict(texts，k=3)
    print(lables)

    #还可以得到前k个类别+概率
    lables=classifier.predict_proba(texts，k=3)
    print(lables)

3.2 fastText有监督学习分类

# -*- coding:utf-8 -*-
import pandas as pd
import random
import fasttext
import jieba
from sklearn.model_selection import train_test_split

cate_dic = {'technology': 1, 'car': 2, 'entertainment': 3, 'military': 4, 'sports': 5}
"""
函数说明：加载数据
"""
def loadData():


    #利用pandas把数据读进来
    df_technology = pd.read_csv("./data/technology_news.csv",encoding ="utf-8")
    df_technology=df_technology.dropna()    #去空行处理

    df_car = pd.read_csv("./data/car_news.csv",encoding ="utf-8")
    df_car=df_car.dropna()

    df_entertainment = pd.read_csv("./data/entertainment_news.csv",encoding ="utf-8")
    df_entertainment=df_entertainment.dropna()

    df_military = pd.read_csv("./data/military_news.csv",encoding ="utf-8")
    df_military=df_military.dropna()

    df_sports = pd.read_csv("./data/sports_news.csv",encoding ="utf-8")
    df_sports=df_sports.dropna()

    technology=df_technology.content.values.tolist()[1000:21000]
    car=df_car.content.values.tolist()[1000:21000]
    entertainment=df_entertainment.content.values.tolist()[:20000]
    military=df_military.content.values.tolist()[:20000]
    sports=df_sports.content.values.tolist()[:20000]

    return technology,car,entertainment,military,sports

"""
函数说明：停用词
参数说明：
    datapath：停用词路径
返回值：
    stopwords:停用词
"""
def getStopWords(datapath):
    stopwords=pd.read_csv(datapath,index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
    stopwords=stopwords["stopword"].values
    return stopwords

"""
函数说明：去停用词
参数：
    content_line：文本数据
    sentences：存储的数据
    category：文本类别
"""
def preprocess_text(content_line,sentences,stopwords):
    for line in content_line:
        try:
            segs=jieba.lcut(line)    #利用结巴分词进行中文分词
            segs=filter(lambda x:len(x)>1,segs)    #去掉长度小于1的词
            segs=filter(lambda x:x not in stopwords,segs)    #去掉停用词
            sentences.append(" ".join(segs))
        except Exception as e:
            print (line)
            continue

"""
函数说明：把处理好的写入到文件中，备用
参数说明：

"""
def writeData(sentences,fileName):
    print("writing data to fasttext format...")
    out=open(fileName,'w')
    for sentence in sentences:
        out.write(sentence.encode('utf8')+"\n")
    print("done!")

"""
函数说明：数据处理
"""
def preprocessData(stopwords,saveDataFile):
    technology,car,entertainment,military,sports=loadData()    

    #去停用词，生成数据集
    sentences=[]
    preprocess_text(technology,sentences,stopwords)
    preprocess_text(car,sentences,stopwords)
    preprocess_text(entertainment,sentences,stopwords)
    preprocess_text(military,sentences,stopwords)
    preprocess_text(sports,sentences,stopwords)

    random.shuffle(sentences)    #做乱序处理，使得同类别的样本不至于扎堆

    writeData(sentences,saveDataFile)


if __name__=="__main__":
    stopwordsFile=r"./data/stopwords.txt"
    stopwords=getStopWords(stopwordsFile)
    saveDataFile=r'unsupervised_train_data.txt'
    preprocessData(stopwords,saveDataFile)

    #fasttext.load_model:不管是有监督还是无监督的，都是载入一个模型
    #fasttext.skipgram(),fasttext.cbow()都是无监督的，用来训练词向量的

    model=fasttext.skipgram('unsupervised_train_data.txt','model')
    print(model.words)    #打印词向量

    #cbow model
    model=fasttext.cbow('unsupervised_train_data.txt','model')
    print(model.words)    #打印词向量

3.1 fastText和word2vec的区别

相似处：

图模型结构很像，都是采用embedding向量的形式，得到word的隐向量表达。
都采用很多相似的优化方法，比如使用Hierarchical softmax优化训练和预测中的打分速度。

不同处：

模型的输出层：word2vec的输出层，对应的是每一个term，计算某term的概率最大；而fasttext的输出层对应的是分类的label。不过不管输出层对应的是什么内容，起对应的vector都不会被保留和使用；
模型的输入层：word2vec的输出层，是 context window 内的term；而fasttext 对应的整个sentence的内容，包括term，也包括 n-gram的内容；

两者本质的不同，体现在 h-softmax的使用：

Wordvec的目的是得到词向量，该词向量最终是在输入层得到，输出层对应的 h-softmax
也会生成一系列的向量，但最终都被抛弃，不会使用。
fasttext则充分利用了h-softmax的分类功能，遍历分类树的所有叶节点，找到概率最大的label（一个或者N个）

3.2 小结

总的来说，fastText的学习速度比较快，效果还不错。fastText适用与分类类别非常大而且数据集足够多的情况，当分类类别比较小或者数据集比较少的话，很容易过拟合。

可以完成无监督的词向量的学习，可以学习出来词向量，来保持住词和词之间，相关词之间是一个距离比较近的情况；
也可以用于有监督学习的文本分类任务，（新闻文本分类，垃圾邮件分类、情感分析中文本情感分析，电商中用户评论的褒贬分析）

您可能感兴趣的文档:

点击免费下载>>软考高级考试备考技巧/历年真题/备考精华资料

--结束END--

本文标题: fasttext的python实现---

本文链接: https://www.lsjlt.com/news/184376.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

本篇文章演示代码以及资料文档资料下载

下载Word文档到电脑，方便收藏和打印～

下载Word文档

去做题

猜你喜欢

fasttext的python实现---

转载请注明作者和出处：http://blog.csdn.net/john_bh/ 一、简介二、FastText原理2.1 模型架构2.2 层次SoftMax2.3 N-gram特征三、基于fastText实现文本分类3.1 fast...

99+

2023-01-31

fasttext python
Python深度学习之FastText实现文本分类详解

FastText是一个三层的神经网络，输入层、隐含层和输出层。 FastText的优点：使用浅层的神经网络实现了word2vec以及文本分类功能，效果与深层网络差不多，节约资源，...

99+

2022-11-11
Python实现基于Fasttext的商品评论数据分类的操作流程

在以往的文本分类型的任务中，基本的流程主要是就是：文本数据加载数据清洗分词向量化分类模型训练性能评估这里面比如向量化和模型搭建是独立的两个节点，可以自由地进行设计，当然了也是一份...

99+

2022-11-11
Python的几种实现

Python自身作为一门编程语言，它有多种实现。这里的实现指的是符合Python语言规范的Python解释程序以及标准库等。这些实现虽然实现的是同一种语言，但是彼此之间，特别是与CPython之间还是有些差别的。下面分别列出几个主要的实现。...

99+

2023-01-31

几种 Python
python之switch的实现

防伪码：忘情公子著 switch是一种语法结构，在大多数的编程语言当中，都提供了switch语法结构。 switch语句的作用与优点： switch语句用于编写多分支结构的程序，类似于if... elif... else（if多分...

99+

2023-01-31

python switch
python实现java的hashcod

def convert_n_bytes(n, b): bits = b * 8 return (n + 2 ** (bits - 1)) % 2 ** bits - 2 ** (bits - 1)def convert_4_byte...

99+

2023-01-31

python java hashcod
python实现进度条的多种实现

有时候在使用Python处理比较耗时操作的时候，为了便于观察处理进度，这时候就需要通过进度条将处理情况进行可视化展示，以便我们能够及时了解情况。这对于第三方库非常丰富的Python来说，想要实现这一功能并不是什么难事...

99+

2022-06-02

python 进度条
python单链表的实现

''' 当加入第一个node节点的时候,会有几个值,(这里的self.tail.next 其实就是node.next) head = item = tail = Node(object element1 memory) item = hea...

99+

2023-01-31

链表 python
Python实现简单的API

代码实现 # coding:utf-8 import json from urlparse import parse_qs from wsgiref.simple_server import make_server # 定...

99+

2023-01-31

简单 Python API
Python交互Redis的实现

模块(Redis) Ubuntu sudo pip3 install redis 使用流程 import redis # 创建数据库连接对象 r = redis.Redis(host='127.0.0.1',port=6...

99+

2022-08-10

Python交互Redis
Python实现Singleton模式的

使用python实现设计模式中的单例模式。单例模式是一种比较常用的设计模式，其实现和使用场景判定都是相对容易的。本文将简要介绍一下python中实现单例模式的几种常见方式和原理。一方面可以加深对python的理解，另一方面可以更加深入的了...

99+

2023-01-30

模式 Python Singleton
现实世界中的 Python

Python 有多稳定？非常稳定。自 1991 年起大约每隔 6 到 18 个月就会推出新的稳定发布版，这种状态看来还将持续下去。目前主要发布版本的间隔通常为 18 个月左右。开发者也会推出旧版本的“问题修正”发布版，因此现有发...

99+

2023-01-31

现实世界 Python
Apriori算法的python实现

原始链接：基于Python的机器学习实战：Apriori 原始链接里的代码是在python2下写的，有的地方我看的不是太明白，在这里，我把它修改成能在python3下运行了，还加入了一些方便自己理解的注释。 Apriori算法的pyspa...

99+

2023-01-31

算法 Apriori python
python实现的WebSocket客户

安装 sudo pip install websocket-client 示例客户端代码： #!/usr/bin/python from websocket import create_connection ws = creat...

99+

2023-01-31

客户 python WebSocket
python字典的hashtable实现

python字典的key-value原理属于hashtable的范畴。关于hashtable的一篇博客关于python字典实现原理的一篇博客 ...

99+

2023-01-31

字典 python hashtable
pypy -- 用python实现的py

pypy 分为两部分：一个 python 的实现和一个编译器： pypy provides infrastructure for building interpreters in [r]python. This infrastructu...

99+

2023-01-31

pypy python py
KNN算法的Python实现

# KNN算法思路：#-----------------------------------------------------##step1:读入数据，存储为链表#step2:数据预处理，包括缺失值处理、归一化等#step3:设置K值#s...

99+

2023-01-31

算法 KNN Python
python实现的简版iconv

系统管理中，经常涉及的文件编码就是UTF8和GB1803，下面是实现iconv简化功能（UTF8,GB18030互转）的python代码：def to_unicode(str_a): if type(str_a) is unicode...

99+

2023-01-31

python iconv
用python 实现linux 的wc

#!/usr/bin/env python """file name: opt_wc.py""" import os import sys from optparse import OptionParser def opt(): ...

99+

2023-01-31

python linux wc
Python实现自己的AOP

Java中的AOP可以用JDK的动态代理和cglib来实现，将需要拦截的方法前后可以额外添加些功能。 Python中有许多方法去实现AOP，现在先介绍第一种比较简单的： 1）利用with...as... Python的with....

99+

2023-01-31

自己的 Python AOP