Skip to content

moonfolk/Geometric-Topic-Modeling

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

Geometric-Topic-Modeling

This is a Python 2 implementation of Geometric Dirichlet Means algorithm for topic inference (M. Yurochkin, X. Nguyen NIPS 2016) and Conic Scan-and-Cover algorithms for nonparametric topic modeling (M. Yurochkin, A. Guha, X. Nguyen NIPS 2017). Code written by Mikhail Yurochkin.

Overview

This is a simple demonstration of GDM, CoSAC and Gibbs sampler (from lda package) on simulated data. More extensive guide is in preparation.

all_func.py Implements data simulation according to LDA model, GDM algorithm and projection estimate of topic proportions $\theta$

geom_tm.py Implements CoSAC algorithm for sparse document-term matrix and wraps it as scikit-learn class

tester_CoSAC.py contains a simulated example

Implementation is designed to be used in the interactive mode (e.g. Python IDE like Spyder).

Usage guide for GDM algorithm

gdm(wdfn, K, ncores=-1)

wdfn: $M \times V$ matrix of normalized document-term counts

K: number of topics to fit

ncores: CPUs to use for k-means

Returns: topic estimates

Usage guide for CoSAC algorithm

geom_tm(delta=0.4, prop_discard=0.5, prop_n=0.01, verbose=False)

Parameters:

delta: cosine cone radius $\omega$

prop_discard: quantile to compute $\mathcal{R}$

prop_n: proportion of data to be used as outlier threshold $\lambda$

verbose: if True, plots as in Figure 2 will be printed

Methods:

fit_a(data, cent)

data: sparse $M \times V$ matrix of normalized document-term counts

cent: data mean $\hat C_p$

Returns: a_betas_: topic estimates from Algorithm 2 without spherical k-means step K_: estimated number of topics

fit_sph(data, cent, init=None, it=10)

data: sparse $M \times V$ matrix of normalized document-term counts

cent: data mean $\hat C_p$

init, it: if None and fit_a was run, will complete Algorithm 2 with \emph{it} spherical k-means iterations

Returns: sph_betas_: updated topics sph_clust_: cluster assignments

fit_all(data, cent, it=5)

Full run of Algorithm 2 with \emph{it} spherical k-means post processing iterations

About

Fast geometric algorithms for Topic Modeling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages