搜索资源列表
word2vec
- word2vec 实现词向量生成,用于自然语言分析,数据挖掘,人工智能等-word2vec achieve word vector generation for natural language analysis, data mining and artificial intelligence. . . . .
Hadoop
- 使用hadoop开发,可以对输入文件中出现的关键词统计词频并进行不同文本词频统计高低的排序,本代码需要用户自行定义关键词和输入文件-Use hadoop development, can appear in the input file keyword statistics word frequency and low frequency statistics different sort of text, the code requires a user-defined keywords an
WordCount2
- 基于hadoop1.x的wordcount程序,jar包是全的,只要设置一下即可使用-a word count program depend on hadoop 1.x with all jar files needed,easy to use
Enhancedtextmining
- 强化版本文本挖掘流程,包含分词,分类聚类,分词结果评估等-Enhanced version of the text mining process, including word segmentation, classification clustering, segmentation results uation, etc.
1
- 检测中文文章的相似度,首先对中文文章分词处理,然后提取特征,计算特征向量夹角。检验是否相似-Similarity detection Chinese article, the first article of the Chinese word processing and feature extraction, feature vector angle calculation. Test whether similar
distributed_word_embedding-master
- The Distributed Word Embedding tool is a parallelization of the Word2Vec algorithm on top of our DMTK parameter server. It provides an efficient scaling to industry size solution for word embedding. -The Distributed Word Embedding tool is a paralle
distributed_skipgram_mixture-master
- The Distributed Multisense Word Embedding(DMWE) tool is a parallelization of the Skip-Gram Mixture [1] algorithm on top of the DMTK parameter server. It provides an efficient scaling to industry size solution for multi sense word embedding. -The Di
LDA-topic-model
- 首先声明,这是别人写的LDA主题模型代码,本人测试过,可以运行,但是输出跟输出有点不尽人意,输入的是词的序号和该词在文档中出现的次数,要是可以直接读取文档就完美了。输出是主题以及词在该主题出现的概率,其中得到的主题我就看不懂了,不知道是算法问题,还是因为我的水平有限。在研究LDA主题模型的朋友,可以下载试一下-First statement, which is written by someone else LDA topic model code, I tested, you can run,
mixBern
- Just like EM of Gaussian Mixture Model, this is the EM algorithm for fitting Bernoulli Mixture Model. GMM is useful for clustering real value data. However, for binary data (such as bag of word feature) Bernoulli Mixture is more suitable.
NLPLibSVM
- libsvm分词训练集的java版本。包括libsvm.jar以及训练集样本-Libsvm version of the Java word segmentation training set. Including libsvm.jar and training set samples
ICTCLAS_api
- 用于为指定文本进行分词操作。按照不同的词性进行分词。-Used to specify the text for the operation of word segmentation. According to different parts of speech.
kmeansClassifier
- 该程序实现了keans分类,使用IK分词技术实现分词。-The program implements the k means classification, the use of IK word segmentation technology to achieve word segmentation.
Naive-bayes
- 本文以拼写检查作为例子,讲解Naive Bayes分类器是如何实现的。对于用户输入的一个单词(words),拼写检查试图推断出最有可能的那个正确单词(correct)。当然,输入的单词有可能本身就是正确的。比如,输入的单词thew,用户有可能是想输入the,也有可能是想输入thaw。为了解决这个问题,Naive Bayes分类器采用了后验概率P(c|w)来解决这个问题。P(c|w)表示在发生了w的情况下推断出c的概率。为了找出最有可能c,应找出有最大值的P(c|w),即求解问题-In this
fenciledebeiyesi
- 中文文本分词系统+基于贝叶斯算法的文本分类源码,用matlab实现。-Chinese word segmentation system+ based on Bayes text classification source code, using matlab implementation.
WordFrequenceCount
- 基于文本的词频计算,对文本内的单词进行统计,可统计上万单词,一次输出。-Based on the text of the word frequency calculation, the words within the text statistics, statistics can be tens of thousands of words, an output.
Sogou-character-porfile
- 介绍人物标签处理的过程,从数据采集,分词,预处理,算法选择以及结果展示方面来介绍相关过程。-This paper introduces the process of character label processing, and introduces the process of data acquisition, word segmentation, preprocessing, algorithm selection and result display.
Crawler.tar
- 利用了python3.5编写了一个爬虫,爬取豆瓣上电影《声之形》的评论,并统计评论词的频率,制作了词云(Using python3.5 to write a crawler, climb the comments on the movie "sound shape", and statistics the frequency of the comment word, making the word cloud)
textclustering-master
- 对于大文本进行挖掘聚类,该方法不考虑文字词语出现的频率信息,考虑上下文语境,将所有的字根据预定义的特征进行词位特征学习,获得一个训练模型。然后对待分字符串的每一个字进行词位标注,最后根据词位定义获得最终的分词结果。(Digging for large text clustering, the method does not consider the text word frequency of information, considering the context, all the words