Skip to Content

Pentaho 3.2 Data Integration: Beginner's Guide Table Of Contents


Table of Contents

Preface
Chapter 1: Getting started with Pentaho Data Integration
Chapter 2: Getting Started with Transformations
Chapter 3: Basic data manipulation
Chapter 4: Controlling the Flow of Data
Chapter 5: Transforming Your Data with JavaScript Code and the JavaScript Step
Chapter 6: Transforming the Row Set
Chapter 7: Validating Data and Handling Errors
Chapter 8: Working with Databases
Chapter 9: Performing Advanced Operations with Databases
Chapter 10: Creating Basic Task Flows
Chapter 11: Creating Advanced Transformations and Jobs
Chapter 12: Developing and Implementing a Simple Datamart
Chapter 13: Taking it Further
Appendix A: Working with Repositories
Appendix B: Pan and Kitchen: Launching Transformations and Jobs from the Command Line
Appendix C: Quick Reference: Steps and Job Entries
Appendix D: Spoon Shortcuts
Appendix E: Introducing PDI 4 Features
Appendix F: Pop Quiz Answers
Index

  • Chapter 1: Getting started with Pentaho Data Integration
    • Pentaho Data Integration and Pentaho BI Suite
      • Exploring the Pentaho Demo
    • Pentaho Data Integration
      • Using PDI in real world scenarios
        • Loading data warehouses or data marts
        • Integrating data
        • Data cleansing
        • Migrating information
        • Exporting data
        • Integrating PDI using Pentaho BI
    • Installing PDI
    • Time for action – installing PDI
    • Launching the PDI graphical designer: Spoon
    • Time for action – starting and customizing Spoon
      • Spoon
        • Setting preferences in the Options window
        • Storing transformations and jobs in a repository
      • Creating your first transformation
    • Time for action – creating a hello world transformation
      • Directing the Kettle engine with transformations
      • Exploring the Spoon interface
      • Running and previewing the transformation
  • Time for action – running and previewing the hello_world transformation
  • Installing MySQL
  • Time for action – installing MySQL on Windows
  • Time for action – installing MySQL on Ubuntu
  • Summary
  • Chapter 2: Getting Started with Transformations
    • Reading data from files
    • Time for action – reading results of football matches from files
      • Input files
        • Input steps
      • Reading several files at once
    • Time for action – reading all your files at a time using a single Text file input step
    • Time for action – reading all your files at a time using a single Text file input step and regular expressions
      • Regular expressions
    • Grids
  • Sending data to files
  • Time for action – sending the results of matches to a plain file
    • Output files
      • Output steps
    • Some data definitions
      • Rowset
      • Streams
    • The Select values step
  • Getting system information
  • Time for action – updating a file with news about examinations
    • Getting information by using Get System Info step
    • Data types
      • Date fields
      • Numeric fields
    • Running transformations from a terminal window
  • Time for action – running the examination transformation from a terminal window
  • XML files
  • Time for action – getting data from an XML file with information about countries
    • What is XML
      • PDI transformation files
    • Getting data from XML files
      • XPath
      • Configuring the Get data from XML step
    • Kettle variables
      • How and when you can use variables
  • Summary
  • Chapter 3: Basic data manipulation
    • Basic calculations
    • Time for action – reviewing examinations by using the Calculator step
      • Adding or modifying fields by using different PDI steps
        • The Calculator step
        • The Formula step
    • Time for action – reviewing examinations by using the Formula step
    • Calculations on groups of rows
    • Time for action – calculating World Cup statistics by grouping data
      • Group by step
    • Filtering
    • Time for action – counting frequent words by filtering
      • Filtering rows using the Filter rows step
    • Looking up data
    • Time for action – finding out which language people speak
      • The Stream lookup step
    • Summary
  • Chapter 4: Controlling the Flow of Data
    • Splitting streams
    • Time for action – browsing new PDI features by copying a dataset
      • Copying rows
      • Distributing rows
    • Time for action – assigning tasks by distributing
    • Splitting the stream based on conditions
    • Time for action – assigning tasks by filtering priorities with the Filter rows step
      • PDI steps for splitting the stream based on conditions
    • Time for action – assigning tasks by filtering priorities with the Switch/ Case step
    • Merging streams
    • Time for action – gathering progress and merging all together
      • PDI options for merging streams
    • Time for action – giving priority to Bouchard by using Append Stream
    • Summary
  • Chapter 5: Transforming Your Data with JavaScript Code and the JavaScript Step
    • Doing simple tasks with the JavaScript step
    • Time for action – calculating scores with JavaScript
      • Using the JavaScript language in PDI
      • Inserting JavaScript code using the Modified Java Script Value step
        • Adding fields
        • Modifying fields
        • Turning on the compatibility switch
      • Testing your code
    • Time for action – testing the calculation of averages
      • Testing the script using the Test script button
  • Enriching the code
  • Time for action – calculating flexible scores by using variables
    • Using named parameters
    • Using the special Start, Main, and End scripts
    • Using transformation predefined constants
  • Reading and parsing unstructured files
  • Time for action – changing a list of house descriptions with JavaScript
    • Looking at previous rows
  • Avoiding coding by using purpose-built steps
  • Summary
  • Chapter 6: Transforming the Row Set
    • Converting rows to columns
    • Time for action – enhancing a films file by converting rows to columns
      • Converting row data to column data by using the Row denormalizer step
      • Aggregating data with a Row denormalizer step
    • Time for action – calculating total scores by performances by country
      • Using Row denormalizer for aggregating data
    • Normalizing data
    • Time for action – enhancing the matches file by normalizing the dataset
      • Modifying the dataset with a Row Normalizer step
      • Summarizing the PDI steps that operate on sets of rows
    • Generating a custom time dimension dataset by using Kettle variables
    • Time for action – creating the time dimension dataset
      • Getting variables
    • Time for action – getting variables for setting the default starting date
      • Using the Get Variables step
  • Summary
  • Chapter 7: Validating Data and Handling Errors
    • Capturing errors
    • Time for action – capturing errors while calculating the age of a film
      • Using PDI error handling functionality
      • Aborting a transformation
    • Time for action – aborting when there are too many errors
      • Aborting a transformation using the Abort step
    • Fixing captured errors
  • Time for action – treating errors that may appear
    • Treating rows coming to the error stream
  • Avoiding unexpected errors by validating data
  • Time for action – validating genres with a Regex Evaluation step
    • Validating data
  • Time for action – checking films file with the Data Validator
    • Defining simple validation rules using the Data Validator
  • Cleansing data
  • Summary
  • Chapter 8: Working with Databases
    • Introducing the Steel Wheels sample database
      • Connecting to the Steel Wheels database
    • Time for action – creating a connection with the Steel Wheels database
      • Connecting with Relational Database Management Systems
    • Exploring the Steel Wheels database
  • Time for action – exploring the sample database
    • A brief word about SQL
    • Exploring any configured database with the PDI Database explorer
  • Querying a database
  • Time for action – getting data about shipped orders
    • Getting data from the database with the Table input step
    • Using the SELECT statement for generating a new dataset
      • Making flexible queries by using parameters
  • Time for action – getting orders in a range of dates by using parameters
    • Making flexible queries by using Kettle variables
  • Time for action – getting orders in a range of dates by using variables
  • Sending data to a database
  • Time for action – loading a table with a list of manufacturers
    • Inserting new data into a database table with the Table output step
    • Inserting or updating data by using other PDI steps
  • Time for action – inserting new products or updating existent ones
  • Time for action – testing the update of existing products
    • Inserting or updating data with the Insert/Update step
  • Eliminating data from a database
  • Time for action – deleting data about discontinued items
    • Deleting records of a database table with the Delete step
  • Summary
  • Chapter 9: Performing Advanced Operations with Databases
    • Preparing the environment
    • Time for action – populating the Jigsaw database
      • Exploring the Jigsaw database model
    • Looking up data in a database
      • Doing simple lookups
    • Time for action – using a Database lookup step to create a list of products to buy
      • Looking up values in a database with the Database lookup step
    • Doing complex lookups
  • Time for action – using a Database join step to create a list of suggested products to buy
    • Joining data from the database to the stream data by using a Database join step
  • Introducing dimensional modeling
  • Loading dimensions with data
  • Time for action – loading a region dimension with a Combination lookup/update step
  • Time for action – testing the transformation that loads the region dimension
    • Describing data with dimensions
      • Loading Type I SCD with a Combination lookup/update step
    • Keeping a history of changes
  • Time for action – keeping a history of product changes with the Dimension lookup/update step
  • Time for action – testing the transformation that keeps a history of product changes
    • Keeping an entire history of data with a Type II slowly changing dimension
  • Loading Type II SCDs with the Dimension lookup/update step
  • Summary
  • Chapter 10: Creating Basic Task Flows
    • Introducing PDI jobs
    • Time for action – creating a simple hello world job
      • Executing processes with PDI jobs
        • Using Spoon to design and run jobs
      • Using the transformation job entry
    • Receiving arguments and parameters in a job
    • Time for action – customizing the hello world file with arguments and parameters
      • Using named parameters in jobs
    • Running jobs from a terminal window
    • Time for action – executing the hello world job from a terminal window
    • Using named parameters and command-line arguments in transformations
    • Time for action – calling the hello world transformation with fixed arguments and parameters
    • Deciding between the use of a command-line argument and a named parameter
    • Running job entries under conditions
    • Time for action – sending a sales report and warning the administrator if something is wrong
      • Changing the flow of execution on the basis of conditions
      • Creating and using a file results list
    • Summary
  • Chapter 11: Creating Advanced Transformations and Jobs
    • Enhancing your processes with the use of variables
    • Time for action – updating a file with news about examinations by setting a variable with the name of the file
      • Setting variables inside a transformation
    • Enhancing the design of your processes
    • Time for action – generating files with top scores
      • Reusing part of your transformations
    • Time for action – calculating the top scores with a subtransformation
      • Creating and using subtransformations
    • Creating a job as a process flow
  • Time for action – splitting the generation of top scores by copying and getting rows
    • Transferring data between transformations by using the copy /get rows mechanism
  • Nesting jobs
  • Time for action – generating the files with top scores by nesting jobs
    • Running a job inside another job with a job entry
    • Understanding the scope of variables
  • Iterating jobs and transformations
  • Time for action – generating custom files by executing a transformation for every input row
    • Executing for each row
  • Summary
  • Chapter 12: Developing and Implementing a Simple Datamart
    • Exploring the sales datamart
      • Deciding the level of granularity
    • Loading the dimensions
    • Time for action – loading dimensions for the sales datamart
    • Extending the sales datamart model
    • Loading a fact table with aggregated data
    • Time for action – loading the sales fact table by looking up dimensions
      • Getting the information from the source with SQL queries
      • Translating the business keys into surrogate keys
        • Obtaining the surrogate key for a Type I SCD
        • Obtaining the surrogate key for a Type II SCD
        • Obtaining the surrogate key for the Junk dimension
        • Obtaining the surrogate key for the Time dimension
    • Getting facts and dimensions together
    • Time for action – loading the fact table using a range of dates obtained from the command line
    • Time for action – loading the sales star
    • Getting rid of administrative tasks
    • Time for action – automating the loading of the sales datamart
    • Summary
  • Chapter 13: Taking it Further
    • PDI best practices
    • Getting the most out of PDI
      • Extending Kettle with plugins
      • Overcoming real world risks with some remote execution
      • Scaling out to overcome bigger risks
    • Integrating PDI and the Pentaho BI suite
      • PDI as a process action
      • PDI as a datasource
      • More about the Pentaho suite
    • PDI Enterprise Edition and Kettle Developer Support
    • Summary
  • Appendix A: Working with Repositories
    • Creating a repository
    • Time for action – creating a PDI repository
      • Creating repositories to store your transformations and jobs
    • Working with the repository storage system
    • Time for action – logging into a repository
      • Logging into a repository by using credentials
        • Defining repository user accounts
      • Creating transformations and jobs in repository folders
      • Creating database connections, partitions, servers, and clusters
      • Backing up and restoring a repository
    • Examining and modifying the contents of a repository with the Repository explorer
    • Migrating from a file-based system to a repository-based system and vice-versa
    • Summary
  • Appendix E: Introducing PDI 4 Features
    • Agile BI
    • Visual improvements for designing transformations and jobs
      • Experiencing the mouse-over assistance
    • Time for action – creating a hop with the mouse-over assistance
      • Using the mouse-over assistance toolbar
    • Experiencing the sniff-testing feature
    • Experiencing the job drill-down feature
    • Experiencing even more visual changes
  • Enterprise features
  • Summary
  • Appendix F: Pop Quiz Answers
    • Chapter 1
      • PDI data sources
      • PDI prerequisites
      • PDI basics
    • Chapter 2
      • formatting data
    • Chapter 3
      • concatenating strings
    • Chapter 4
      • data movement (copying and distributing)
      • splitting a stream
    • Chapter 5
      • finding the seven errors
    • Chapter 6
      • using Kettle variables inside transformations
    • Chapter 7
      • PDI error handling
    • Chapter 8
      • defining database connections
      • database datatypes versus PDI datatypes
      • Insert/Update step versus Table Output/Update steps
      • filtering the first 10 rows
    • Chapter 9
      • loading slowly changing dimensions
      • loading type III slowly changing dimensions
    • Chapter 10
      • defining PDI jobs
    • Chapter 11
      • using the Add sequence step
      • deciding the scope of variables
    • Chapter 12
      • modifying a star model and loading the star with PDI
    • Chapter 13
      • remote execution and clustering

Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Open Source Content Management Customer Relationship Management e-Commerce e-Learning Java Linux Servers Networking & Telephony PHP Web Graphics & Video Web Development
Enterprise BPEL Microsoft Oracle SOA Web Services
Other Packt Books .Net Web Graphics & Video Beginner Guides Cookbooks