Dragon Toolkit

Home
Java Source Code / Java Documentation
1.6.0 JDK Core
2.6.0 JDK Modules
3.6.0 JDK Modules com.sun
4.6.0 JDK Modules com.sun.java
5.6.0 JDK Modules sun
6.6.0 JDK Platform
7.Ajax
8.Apache Harmony Java SE
9.Aspect oriented
10.Authentication Authorization
11.Blogger System
12.Build
13.Byte Code
14.Cache
15.Chart
16.Chat
17.Code Analyzer
18.Collaboration
19.Content Management System
20.Database Client
21.Database DBMS
22.Database JDBC Connection Pool
23.Database ORM
24.Development
25.EJB Server
26.ERP CRM Financial
27.ESB
28.Forum
29.Game
30.GIS
31.Graphic 3D
32.Graphic Library
33.Groupware
34.HTML Parser
35.IDE
36.IDE Eclipse
37.IDE Netbeans
38.Installer
39.Internationalization Localization
40.Inversion of Control
41.Issue Tracking
42.J2EE
43.J2ME
44.JBoss
45.JMS
46.JMX
47.Library
48.Mail Clients
49.Music
50.Natural Language Processing
51.Net
52.Parser
53.PDF
54.Portal
55.Profiler
56.Project Management
57.Report
58.RSS RDF
59.Rule Engine
60.Science
61.Scripting
62.Search Engine
63.Security
64.Sevlet Container
65.Source Control
66.Swing Library
67.Template Engine
68.Test Coverage
69.Testing
70.UML
71.Web Crawler
72.Web Framework
73.Web Mail
74.Web Server
75.Web Services
76.Web Services apache cxf 2.2.6
77.Web Services AXIS2
78.Wiki Engine
79.Workflow Engines
80.XML
81.XML UI
Java Source Code / Java Documentation  » Natural Language Processing » Dragon Toolkit 
License:
URL:
Description:
Package NameComment
dragon.config Loading various resources and applications specified by a XML-based configuration file.

Package Specification

dragon.ir.classification A Package for Text Classification; Feature Selection Methods and Evaluation Program are also included.

Package Specification

All classifiers should implement the interface of Classifier. Besides, the classifier should support the use of different feature selectors. In other words, a user can select different feature selectors without changing the code of the underlying classifier.
dragon.ir.classification.featureselection Feature Selectors for Text Classification or Other Related Applications.

Package Specification

To create your own feature selectors, one should implement the interface called FeatureSelector. One can extend the Abstract Feature Selector instead of coding from the scratch. A feature selector could be either supervised or unsupervised. Please read Yiming Yang's paper for more details.

Yiming Yang and Jan O. Pedersen, A comparative study on feature selection in text categorization, Proceedings of {ICML}-97, 14th International Conference on Machine Learning, pp. 412--420 [PDF]
dragon.ir.classification.multiclass Reducing multi-class classification to binary classifiers.

Package Specification

Some classifiers such as support vector machines (SVM) can only handle two-class classification problem. If the classification involves multi classes, it has to reduce the multi-class classification to a set of binary classifiers and then combine results from all binary classifiers to predict the label of an example.
dragon.ir.clustering A package for document clustering and its evaluation

Package Specification

The toolkit implements two common clustering approaches, the agglomerative approach and the K-Means approach. These two approaches have many variants in terms of similarity measures. The toolkit encapsulates the details of different similarity measures into the implementations of two interfaces, Doc Distance and Cluster Model, respectively. The Doc Distance interface computes the distance between two documents and is designed for agglomerative clustering approaches. The Cluster Model interface computes the distance between a document and a cluster or the generative probability of a document by a cluster model. To evaluate cluster quality, please call dragon.ir.clustering.ClusteringEva.
dragon.ir.clustering.clustermodel Various cluster models for paritional clustering approach

Package Specification

The Cluster Model interface computes the distance between a document and a cluster or the generative probability of a document by a cluster model.
dragon.ir.clustering.docdistance Various similarity metrics for pair-wised documents.

Package Specification

The Doc Distance interface computes the distance between two documents and is designed for agglomerative clustering approaches.
dragon.ir.clustering.featurefilter Feature Selectors for Text Clustering.

Package Specification

To create your own feature selectors, one should implement the interface called FeatureSelector. One can extend the Abstract Feature Selector instead of coding from the scratch.
dragon.ir.index A package for doucument indexing and indexing result read.

Package Specification

There are two important interfaces. One is indexer which index articles in a corpus. The other is index reader which read out indexing results. The dragon toolkit supports two modes of indexing. The first mode saves the indexing results into disk-based files and ususally fit for large collections. The second mode keeps all information in the memory. The second mode is very conveient and fast, but for small collections only, e.g. collections for text summarization.
dragon.ir.index.sentence A package for sentence-level indexing and indexing result read

Package Specification

The sentence indexing is almost the same as basic indexing except that it splits an article into sentences first and then treats each sentence as a document for indexing. In other words, in the doc-term matrix resulted by sentence indexing, each document actually denotes a sentence. The sentence indexing could be used for sentence-level summarizations. The sentence indexing also has two modes, online mode and disk-based mode.
dragon.ir.index.sequence A package for sequence-senstive indexing and indexing result read

Package Specification

The sequence indexing simply converts each word to an integer-based index and record a sequence of indices. It may be used for any sequence-sensitive applications. The sentence indexing also has two modes, online mode and disk-based mode.
dragon.ir.kngbase A package for knowledge base creation, storage and read.

Package Specification

A knowledge base is actually a matrix for semantic relationships between various concepts. For example, a row may stand for a multiword phrase, e.g. space program and a column denote a signle word. Thus a cell in the matrix may be interpreted as the probability of the phrase beging translated to the single word semantically.
dragon.ir.query A package for structured query as well as the conversion from the natural language to the structured query.

Package Specification

TREC ad-hoc retrieval tasks ususally describe the query topics in natural language. One can call the implemented query generators to convert natural language descriptions into structured query for search.
dragon.ir.search A package for text retrieval and its evaluation

Package Specification

The toolkit provides a well-defined framework for text retrieval. The first step is to generate a query according to the topic descriptions (such as TREC Topic files). Please refer to the package of dragon.ir.query for query generation. The second step is to create a searcher. Since there are so many different retrieval models, the toolkit creates an interface called Smoother to hide the implementation details of different models. Thus, the routine for searching is the same for different models. One can simply call a full rank searcher or a partial rank searcher. The toolkit has implemented various language model smoothing methods as well as traditional probabilistic and vector space models. Pseudo-relevance feedback and query expansion are two frequently used techniques for improving the effectiveness of IR. One can call a feedback searcher or an expansion searcher to incorporate these two techniques, respectively. The details of the feedback approaches and query expansion approaches are encapsulated into the implantation class of two interfaces, Feedback and Expansion, respectively. To evaluate the IR performance using TREC protocol, please call dragon.ir.search.evaluate.TrecEva.
dragon.ir.search.evaluate Java implementation of TREC evaluation program.

Package Specification

dragon.ir.search.expand Various query expansion approaches are included.

Package Specification

Query expansions techniques are used before starting a real search. Prior knowledge is often used for query expansions. To develop a new query expansion approach, one should implement an interface called QueryExpansion. To use query expansion technique for retrieval, one can simply call a query expansion searcher (dragon.ir.search.QueryExpansionSearcher)
dragon.ir.search.feedback Various pseudo-relevance feedback models.

Package Specification

Basically, a feedback will return a new query given the original query and an intial searcher. To develop a new feedback, one should implement an interface called Feedback. To use pseudo-relevance feedback, one can simply call a feedback searcher (dragon.ir.search.FeedbackSearcher)
dragon.ir.search.smooth Various term importance scoring algorithms including language models,traditional probabilistic and vector space models.

Package Specification

The toolkit has implemented various language model smoothing methods as well as traditional probabilistic and vector space models. To create a new smoother, one should implement an interface called Smoother. Basically, given a document, a query term, and its frequency in the document, the smoother should return a score to the searcher. For language models, the score means the probability of the doucment generating the term.
dragon.ir.summarize A package for generic multi-document summarization and its evaluation

Package Specification

To develop a summarizer, one should implement the interface called GenericMultiDocSummarizer. The evalute the quality of the machine-generated summary, one can call the ROUGE program. The summarization often deals with a small number of documents. Thus, it is not required to index documents before summarization. The summarizer can index the documents online calling the online sentence indexer and then find out represenatitve sentences.
dragon.ir.topicmodel Various topic models such as LDA, Apsect Model and Simple Mixture Model

Package Specification

The toolkit implements several classific topic models such as LDA, Apsect Model and Simple Mixture Model. It also provides a program to output the models to an Excel spreadsheet.
dragon.matrix A package for matrix (both dense matrix and sparse matrix) storage, read, write and operations.

Package Specification

The dragon toolkit uses its own technique for sparse matrix. All sparse matrix classes should implement an interface called SparseMatrix. The toolkit has included three implmentations, flat sparse matrix, super sparse matrix, and giant sparse matrix for matrix in different size. The flat sparse matrix load all data into memory and thus very fast, but fit for small dataset only. The super sparse matrix will load index into the memory and cache a given number of rows. The giant sparse matrix load nothing into the memory except caching the most recent row.
dragon.matrix.factorize Functions related to matrix factorization such as SVD and NMF

Package Specification

dragon.matrix.vector Define the structure of vectors and implment some algorithms related to vectors such as HITS and Power Method.

Package Specification

dragon.ml.seqmodel.crf A package for CRF-based training and labeling algorithms

Package Specification

The implementation of CRF is adapted from http://crf.sourceforge.net/. We just clean or reorganize original package for the purpose of making it more readable. So please give the credit to the original authors.
dragon.ml.seqmodel.data The prepration for sequencial data.
dragon.ml.seqmodel.evaluate The evaluation program for sequence labeling tasks.
dragon.ml.seqmodel.feature Various feature types for conditional random field applications.

Package Specification

A feature type is actually a feature generator. It can generate a given type of features from a fragment of a sequence. One should implement the interface called FeatureType to create own feature types. For saving time, one can extend the Abstract FeatureType instead of coding from the sctratch.
dragon.ml.seqmodel.model Define frequently used graphical sequential models

Package Specification

dragon.nlp Define data structures used for natural language processing

Package Specification

Document, Paragraph, Sentence and Word are designed for text parsing and tokenization during natural language processing. Three different type of concepts are implemented for different use. A token often corresponds to a single word. A phrase consists of mutilple adjacent words. A term denotes an ontological concepts. Thus it can consist of multiple adjacent words as does a phrase. Moreover, it often has a unique entry ID defined in the domain ontology.
dragon.nlp.compare Various comparators such as sorting according to index, weight, frequency and name.

Package Specification

dragon.nlp.extract Various concept extractors and relationship extractors.

Package Specification

The toolkit defines three types of concept extractors. The first is token extractor, which extracts a sequence of individual words from a sentence or a document. The second is phrase extractor, namely extracting multiword phrases from a sentence or a document. The phrase extractor needs a phrase dictionary as input; the phrase dictionary could be automatically built by phrase tools such as Xtract. The third is term extractor, which extracts ontological terms from a sentence or a document.
dragon.nlp.ontology A framework for ontology data structures as well as the implementation of UMLS and MeSH ontologies.

Package Specification

An ontology includes three parts in the notion of the dragon toolkit. The first part is the extraction of ontological concepts from texts. The second is the semantic network. The third is the smililarity metrics pair-wised ontological concepts.
dragon.nlp.ontology.mesh An implementaiton of MeSH ontology.

Package Specification

dragon.nlp.ontology.umls The implementation of UMLS ontologies.

Package Specification

The extaction of UMLS concepts has two implementations. One is based on extract string match. The other one is based on approximate match which yield much higher recall while retaining the precision.
dragon.nlp.tool Integration of external NLP tools such as taggers and stemmers.

Package Specification

dragon.nlp.tool.lemmatiser A high precise english lemmatiser adapted from WordNet.

Package Specification

The implementation is very similiar to WordNet. A large number of exceptions of medical words are included in addition to WordNet exceptions.
dragon.nlp.tool.xtract An implementation of Xtract which is for collocation extraction.

Package Specification

See the below paper for the details of the algorithm.

Smadja, F., ��Retrieving collocations from text: Xtract��, Computational Linguistics, 1993, 19(1), pp. 143--177
dragon.onlinedb A collection of java files for textual corpus preparation.

Package Specification

The toolkit provides convenient ways to read out articles from text collections with various format. The interface called CollectionReader defines the methods for article extraction from collections. The interface called ArticleParser has a method parse which can parse a sequence of text into an article.
dragon.onlinedb.amazon Functions related to the download of customer reivews from Amazon.com.
dragon.onlinedb.bibtex Package for BibTeX format

Package Specification

dragon.onlinedb.citeulike CiteULike Website Tag Query and Article Parser

Package Specification

dragon.onlinedb.dm Article parsers for frequently used data mining collections such as 20-Newsgroup and Reuters collection.
dragon.onlinedb.isi Article parsers for ISI-formatted Bibliometric data.

Package Specification

The entry page for downloading ISI bibliometeric data http://www.thomsonisi.com
dragon.onlinedb.pubmed Functions related to download abstracts from PubMed

Package Specification

dragon.onlinedb.searchengine Programming Interfaces for Search Engines such as Google

Package Specification

dragon.onlinedb.trec Functions related to read articles from TREC-styled collections.
dragon.util Frequtently used miscellaneous functions for math, database, format, http, file reading and writing.
w__ww___.j___ava___2__s_.___co__m__ | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.