Jiahong的个人博客

凡事预则立不预则废

GitHub——主题推荐算法

Posted on 2018-05-30

Topic Suggestions for Millions of Repositories

Github官网原文

总体流程图

分为七个步骤：

Readme 预处理与清除

(Readme preprocessing and cleanup)

移除不要的文本块(Remove unwanted blocks)
- 去除没用文本：一些块是没用的，比如code, table 和image链接等
文本划分(Text segmentation)
- 提取有用的文本：一个启发式的README tagger，借助格式分析：分析缩进(indentation)，空格(spacing)，和反撇号( `, backtick)等决定是否是有用信息【说明：这里语法分析是没必要的，我们只关心有用的文本，其他的都是为噪音(noise)】
- 提取到有用文本后，删除拓展部分：HTML标签，路径和其他处理出来的部分等
- 最后将剩下的文本进行粗粒度分割【用标点符号(punctuation marks)和README中的一些符号，比如临近哈希符号(contiguous hash symbols)】

生成候选主题

(Generate candidate topics)

用自定义的停用词去将词单元划分出来(Use custom stop words to break text into word units)
- 停用词去除
- n-gram分段小于等于4(测试发现小于4的比较好，太长的主题过于具体了，不合适)

移除噪音主题

(Eliminate noisy topics)

用逻辑回归模型删减“bad”主题(Use a logistic regression model to prune “bad” topics)
- 监督逻辑回归模型(supervised logistic regression model)主要针对除了频数比较小的外，一些不好的，比如”great”, “cool”等，这里的模型是一个分类模型，分为两类(good[positive] and bad(negative)), 我们称之为关键词过滤模型
  - 手动收集大约300个数据集作为训练集
  - 单个动词一般都是Bad类请教师兄
  - 其他的(Other features we used were occurrence of user names, n-gram size of a phrase, length of a phrase, and numeric content within a phrase.)
  - 以后打算加入回馈机制(来自用户的)去更新这个关键词过滤模型：接受度高的词作为positive的，接受度低的作为停用词或者negative的
移除不满足最小频数的主题(Eliminate topics not satisfying minimum frequency count)

给主题评分

(Score Topics)

用混合tf-idf分数，以主题频率和n元词作为打分标准(Use combination of tf-idf scores, topic frequency and n-gram size for scoring)
- 评估多种指标后，选择了点互信息指标(PMI)： Pointwise Mutual Information
- 考虑另一种指标tf-idf：参考论文Using TF-IDF to Determine Word Relevance in Document Queries.pdf或者Python的实现
- The second approach we tried uses the average tf-idf scores of individual words in a phrase weighted by the phrase frequency (if it’s more than one word long) and n-gram size.
- 两种方法比较：
  - PMI：强调独特性，越特殊的短语评分越高，但有些可能只是拼写错误(Typo)或者是没有被删除的代码片段
  - tf-idf：不强调独特性，最终选择是tf-idf,原因是这个指标较好的平衡了独特性和主题与仓库(repository)的相关程度
- 在tf-idf的基础上还添加了一下其他的比如boosting scores等方法
- 下面是TF-IDF的说明*

规范化主题

(Canonicalize topics)

使用内部词典规范主题形式(Use an in-house dictionary to suggest canonical form of topics)
- 解决文字层面的差别和变化等,比如下面四个主题
  
  neural-network
  neural-networks
  neuralnetwork
  neuralnetworks

移除相似的主题

(Eliminate near similar topics)

用基于Jaccard相似性评分的贪心算法(Greedy approach using Jaccard similarity scoring)
- motivation：在得到Top-N的主题后，有些主题其实很相似，虽然都有用，但是他们只是在不同粒度描述了同一个主题而已，因此我们需要删除一些重复的，比如下面的例子
  
  machine learning
  deep learning
  general-purpose machine learning
  machine learning library
  machine learning algorithms
  distributed machine learning
  machine learning framework
  deep learning library
  support vector machine
  linear regression
- method：两个短语的相似性计算使用的是基于词的Jaccard相似性(两个短语的差集与并集的比值，因为它对较短的短语很有效，而且分数是[0-1]的，很方便设置阈值(thresholds))，用贪心算法，如果两个主题相似，去除分数较低的那一个，上面的例子去除相似主题后的结果是：
  
  machine learning
  deep learning
  support vector machine
  linear regression

返回Top-K主题作为最终结果