| Package Name | Comment |
| dragon.config |
Loading various resources and applications specified by a XML-based configuration file.
Package Specification
|
| dragon.ir.classification |
A Package for Text Classification; Feature Selection Methods and Evaluation Program are also included.
Package Specification
All classifiers should implement the interface of Classifier. Besides, the classifier should support the use of different feature selectors.
In other words, a user can select different feature selectors without changing the code of the underlying classifier.
|
| dragon.ir.classification.featureselection |
Feature Selectors for Text Classification or Other Related Applications.
Package Specification
To create your own feature selectors, one should implement the interface called FeatureSelector. One can extend the Abstract Feature Selector instead
of coding from the scratch. A feature selector could be either supervised or unsupervised. Please read Yiming Yang's paper for more details.
Yiming Yang and Jan O. Pedersen, A comparative study on feature selection in text categorization, Proceedings of {ICML}-97,
14th International Conference on Machine Learning, pp. 412--420 [PDF]
|
| dragon.ir.classification.multiclass |
Reducing multi-class classification to binary classifiers.
Package Specification
Some classifiers such as support vector machines (SVM) can only handle two-class classification problem. If the classification involves
multi classes, it has to reduce the multi-class classification to a set of binary classifiers and then combine results from all binary
classifiers to predict the label of an example.
|
| dragon.ir.clustering |
A package for document clustering and its evaluation
Package Specification
The toolkit implements two common clustering approaches, the agglomerative approach and the K-Means approach.
These two approaches have many variants in terms of similarity measures. The toolkit encapsulates the details
of different similarity measures into the implementations of two interfaces, Doc Distance and Cluster Model, respectively.
The Doc Distance interface computes the distance between two documents and is designed for agglomerative clustering approaches.
The Cluster Model interface computes the distance between a document and a cluster or the generative probability of a document by a cluster model.
To evaluate cluster quality, please call dragon.ir.clustering.ClusteringEva.
|
| dragon.ir.clustering.clustermodel |
Various cluster models for paritional clustering approach
Package Specification
The Cluster Model interface computes the distance between a document and a cluster or the generative probability of a document by a cluster model.
|
| dragon.ir.clustering.docdistance |
Various similarity metrics for pair-wised documents.
Package Specification
The Doc Distance interface computes the distance between two documents and is designed for agglomerative clustering approaches.
|
| dragon.ir.clustering.featurefilter |
Feature Selectors for Text Clustering.
Package Specification
To create your own feature selectors, one should implement the interface called FeatureSelector. One can extend the Abstract Feature Selector instead
of coding from the scratch.
|
| dragon.ir.index |
A package for doucument indexing and indexing result read.
Package Specification
There are two important interfaces. One is indexer which index articles in a corpus. The other is index reader which read out indexing results.
The dragon toolkit supports two modes of indexing. The first mode saves the indexing results into disk-based files and ususally fit for large
collections. The second mode keeps all information in the memory. The second mode is very conveient and fast, but for small collections only, e.g.
collections for text summarization.
|
| dragon.ir.index.sentence |
A package for sentence-level indexing and indexing result read
Package Specification
The sentence indexing is almost the same as basic indexing except that it splits an article into sentences first and then treats each
sentence as a document for indexing. In other words, in the doc-term matrix resulted by sentence indexing, each document actually
denotes a sentence. The sentence indexing could be used for sentence-level summarizations. The sentence indexing also has two modes, online
mode and disk-based mode.
|
| dragon.ir.index.sequence |
A package for sequence-senstive indexing and indexing result read
Package Specification
The sequence indexing simply converts each word to an integer-based index and record a sequence of indices. It may be used for any
sequence-sensitive applications. The sentence indexing also has two modes, online mode and disk-based mode.
|
| dragon.ir.kngbase |
A package for knowledge base creation, storage and read.
Package Specification
A knowledge base is actually a matrix for semantic relationships between various concepts. For example, a row may stand for a multiword phrase,
e.g. space program and a column denote a signle word. Thus a cell in the matrix may be interpreted as the probability of the phrase beging translated
to the single word semantically.
|
| dragon.ir.query |
A package for structured query as well as the conversion from the natural language to the structured query.
Package Specification
TREC ad-hoc retrieval tasks ususally describe the query topics in natural language. One can call the implemented query generators to convert
natural language descriptions into structured query for search.
|
| dragon.ir.search |
A package for text retrieval and its evaluation
Package Specification
The toolkit provides a well-defined framework for text retrieval. The first step is to generate a query according to the topic descriptions
(such as TREC Topic files). Please refer to the package of dragon.ir.query for query generation. The second step is to create a searcher.
Since there are so many different retrieval models, the toolkit creates an interface called Smoother to hide the implementation details of
different models. Thus, the routine for searching is the same for different models. One can simply call a full rank searcher or a partial
rank searcher. The toolkit has implemented various language model smoothing methods as well as traditional probabilistic and vector space
models. Pseudo-relevance feedback and query expansion are two frequently used techniques for improving the effectiveness of IR. One can call
a feedback searcher or an expansion searcher to incorporate these two techniques, respectively. The details of the feedback approaches and
query expansion approaches are encapsulated into the implantation class of two interfaces, Feedback and Expansion, respectively. To evaluate
the IR performance using TREC protocol, please call dragon.ir.search.evaluate.TrecEva.
|
| dragon.ir.search.evaluate |
Java implementation of TREC evaluation program.
Package Specification
|
| dragon.ir.search.expand |
Various query expansion approaches are included.
Package Specification
Query expansions techniques are used before starting a real search. Prior knowledge is often used for query expansions. To develop a new query expansion
approach, one should implement an interface called QueryExpansion. To use query expansion technique for retrieval, one can simply call a query
expansion searcher (dragon.ir.search.QueryExpansionSearcher)
|
| dragon.ir.search.feedback |
Various pseudo-relevance feedback models.
Package Specification
Basically, a feedback will return a new query given the original query and an intial searcher. To develop a new feedback, one should implement
an interface called Feedback. To use pseudo-relevance feedback, one can simply call a feedback searcher (dragon.ir.search.FeedbackSearcher)
|
| dragon.ir.search.smooth |
Various term importance scoring algorithms including language models,traditional probabilistic and vector space models.
Package Specification
The toolkit has implemented various language model smoothing methods as well as traditional probabilistic and vector space models. To create
a new smoother, one should implement an interface called Smoother. Basically, given a document, a query term, and its frequency in the document,
the smoother should return a score to the searcher. For language models, the score means the probability of the doucment generating the term.
|
| dragon.ir.summarize |
A package for generic multi-document summarization and its evaluation
Package Specification
To develop a summarizer, one should implement the interface called GenericMultiDocSummarizer. The evalute the quality of the machine-generated summary,
one can call the ROUGE program. The summarization often deals with a small number of documents. Thus, it is not required to index documents before
summarization. The summarizer can index the documents online calling the online sentence indexer and then find out represenatitve sentences.
|
| dragon.ir.topicmodel |
Various topic models such as LDA, Apsect Model and Simple Mixture Model
Package Specification
The toolkit implements several classific topic models such as LDA, Apsect Model and Simple Mixture Model. It also provides a program to output the models
to an Excel spreadsheet.
|
| dragon.matrix |
A package for matrix (both dense matrix and sparse matrix) storage, read, write and operations.
Package Specification
The dragon toolkit uses its own technique for sparse matrix. All sparse matrix classes should implement an interface called SparseMatrix. The
toolkit has included three implmentations, flat sparse matrix, super sparse matrix, and giant sparse matrix for matrix in different size. The flat
sparse matrix load all data into memory and thus very fast, but fit for small dataset only. The super sparse matrix will load index into the memory
and cache a given number of rows. The giant sparse matrix load nothing into the memory except caching the most recent row.
|
| dragon.matrix.factorize |
Functions related to matrix factorization such as SVD and NMF
Package Specification
|
| dragon.matrix.vector |
Define the structure of vectors and implment some algorithms related to vectors such as HITS and Power Method.
Package Specification
|
| dragon.ml.seqmodel.crf |
A package for CRF-based training and labeling algorithms
Package Specification
The implementation of CRF is adapted from http://crf.sourceforge.net/. We just clean or reorganize
original package for the purpose of making it more readable. So please give the credit to the original authors.
|
| dragon.ml.seqmodel.data |
The prepration for sequencial data.
|
| dragon.ml.seqmodel.evaluate |
The evaluation program for sequence labeling tasks.
|
| dragon.ml.seqmodel.feature |
Various feature types for conditional random field applications.
Package Specification
A feature type is actually a feature generator. It can generate a given type of features from a fragment of a sequence. One should implement the interface called
FeatureType to create own feature types. For saving time, one can extend the Abstract FeatureType instead of coding from the sctratch.
|
| dragon.ml.seqmodel.model |
Define frequently used graphical sequential models
Package Specification
|
| dragon.nlp |
Define data structures used for natural language processing
Package Specification
Document, Paragraph, Sentence and Word are designed for text parsing and tokenization during natural language processing. Three different
type of concepts are implemented for different use. A token often corresponds to a single word. A phrase consists of mutilple adjacent words.
A term denotes an ontological concepts. Thus it can consist of multiple adjacent words as does a phrase. Moreover, it often has a unique entry ID
defined in the domain ontology.
|
| dragon.nlp.compare |
Various comparators such as sorting according to index, weight, frequency and name.
Package Specification
|
| dragon.nlp.extract |
Various concept extractors and relationship extractors.
Package Specification
The toolkit defines three types of concept extractors. The first is token extractor, which extracts a sequence of individual words from a
sentence or a document. The second is phrase extractor, namely extracting multiword phrases from a sentence or a document. The phrase
extractor needs a phrase dictionary as input; the phrase dictionary could be automatically built by phrase tools such as Xtract. The third
is term extractor, which extracts ontological terms from a sentence or a document.
|
| dragon.nlp.ontology |
A framework for ontology data structures as well as the implementation of UMLS and MeSH ontologies.
Package Specification
An ontology includes three parts in the notion of the dragon toolkit. The first part is the extraction of ontological concepts from texts.
The second is the semantic network. The third is the smililarity metrics pair-wised ontological concepts.
|
| dragon.nlp.ontology.mesh |
An implementaiton of MeSH ontology.
Package Specification
|
| dragon.nlp.ontology.umls |
The implementation of UMLS ontologies.
Package Specification
The extaction of UMLS concepts has two implementations. One is based on extract string match. The other one is based on approximate match which
yield much higher recall while retaining the precision.
|
| dragon.nlp.tool |
Integration of external NLP tools such as taggers and stemmers.
Package Specification
|
| dragon.nlp.tool.lemmatiser |
A high precise english lemmatiser adapted from WordNet.
Package Specification
The implementation is very similiar to WordNet. A large number of exceptions of medical words are included in addition to WordNet exceptions.
|
| dragon.nlp.tool.xtract |
An implementation of Xtract which is for collocation extraction.
Package Specification
See the below paper for the details of the algorithm.
Smadja, F., ��Retrieving collocations from text: Xtract��, Computational Linguistics, 1993, 19(1), pp. 143--177
|
| dragon.onlinedb |
A collection of java files for textual corpus preparation.
Package Specification
The toolkit provides convenient ways to read out articles from text collections with various format. The interface called CollectionReader
defines the methods for article extraction from collections. The interface called ArticleParser has a method parse which can parse a sequence of
text into an article.
|
| dragon.onlinedb.amazon |
Functions related to the download of customer reivews from Amazon.com.
|
| dragon.onlinedb.bibtex |
Package for BibTeX format
Package Specification
|
| dragon.onlinedb.citeulike |
CiteULike Website Tag Query and Article Parser
Package Specification
|
| dragon.onlinedb.dm |
Article parsers for frequently used data mining collections such as 20-Newsgroup and Reuters collection.
|
| dragon.onlinedb.isi |
Article parsers for ISI-formatted Bibliometric data.
Package Specification
The entry page for downloading ISI bibliometeric data http://www.thomsonisi.com
|
| dragon.onlinedb.pubmed |
Functions related to download abstracts from PubMed
Package Specification
|
| dragon.onlinedb.searchengine |
Programming Interfaces for Search Engines such as Google
Package Specification
|
| dragon.onlinedb.trec |
Functions related to read articles from TREC-styled collections.
|
| dragon.util |
Frequtently used miscellaneous functions for math, database, format, http, file reading and writing.
|