SQL Server provides several "standard" techniques by which
to read and write to files but, just occasionally, they aren't quite up to
the task at hand – especially when dealing with large strings or relatively
unstructured data. Phil Factor provides some T-SQL stored procedures, based
on use of the FileSystem Object (FSO), that may just get you out of a tight
corner… For more techniques, see
The TSQL of Text Files
SQL Server has never been short of ways to read from and write to files
and it is always better to use the standard techniques provided by SQL
Server where possible. However, most of them are really designed for reading
and writing tabular data and aren't always trouble-free when used with large
strings or relatively unstructured data.
For reading tabular data from a file, whether character-delimited or
binary, there is nothing that replaces the hoary old Bulk Copy Program (BCP),
which underlies more esoteric methods such as Bulk Insert. It is possible to
read text-based delimited files with ODBC, simple files can be read and
written-to using xp_cmdshell, and you will find that OSQL is
wonderful for writing results to file, but occasionally I've found I need to
do more than this.
Thankfully, when armed with OLE Automation and the FileSystem Object (FSO),
all sorts of things are possible. The FileSystem Object was introduced into
Windows to provide a single common file-system COM interface for scripting
languages. It provides a number of handy services that can be accessed from
TSQL. In this article, I provide examples of stored procedures that use this
interface to allow you to:
- Read lines of text from a file
- Read a file into a SQL Server data type
- Write out to a file
- Get file details
- Tidy up XML, XHTML or HTML code
I'll provide a few details on the FSO along the way, but let's start with
examples of some of these procedures in action. You'll need to enable OLE
Automation on your test server in order to follow along.
Reading lines from a file
I have often pined for a simple function that will read information a
line at a time, and to present to me a 'fake table' where each line of text
is a row, with a primary key based on the line number. With such a function,
one can then do one's own parsing and checking of data.
Well, here it is. Create the
uftReadFileAsTable stored procedure in your test database, and try
it out with something like:
Select
line from
Dbo.uftReadfileAsTable('MyPath','MyFileName')
where line
not like
'#%'
--where line doesnt begin with a hash
Just fill in an existing file name and path to the file you wish to read,
instead of 'MyPath' and 'MyFileName', and away you go.
This is a method I use for reading web logs and gathering usage
statistics. It is also useful where the data feed has to be validated before
one can parse it into the final SQL data format.
Reading a file into a SQL Server data type
This is all very well, but how about something that reads a file in one
gulp into a varchar or XML datatype? Perhaps you need to extract data from
HTML, XHTML or some other format. Create the
ufsReadfileAsString procedure and try something like…
Select
dbo.ufsReadfileAsString
('MyPath','MyFileName')
Writing out a file
No problem – just create
spWriteStringToFile and try:
execute
spWriteStringToFile 'This article describes
how to fully access the
local filesystem from SQL Server. It shows a
way of reading and writing data to file, and
accessing the details of the server's
filesystem using OLE Automation to access
the filesystem object'
,
'MyPath','MyFileName'
The path you use instead of 'MyPath' will have to exist, in this example.
Getting file details
If you need to find out the attributes of a particular file, then try out
spFileDetails:
Execute
spFileDetails 'c:\autoexec.bat'
More on the FSO and OLE Automation
There are all sorts of things you can do with the FSO. You can copy
files, move files, create folders, delete files, get the names of special
directories, and so on. This may sound esoteric, but sometimes the simplest
backup procedures require such operations. I pause here just to give some
minimum necessary background details on FSO, and on OLE Automation.
The Filesystem Object, on which all the stored procedures in this article
rely, is a component of the scripting runtime library. As it is a COM
object, it is readily accessible from the set or stored procedures built-in
to SQL Server called the OLE Automation Stored Procedures, These allow a
connection, through T-SQL commands, to create and use COM-based objects.
OLE Automation
Each OLE Automation stored procedure returns an integer code that is the
HRESULT returned by the underlying OLE Automation operation. When an
error happens, the HRESULT returns an error code rather than the 0
which signals success, which then has to be turned into a meaningful error
message with the sp_OAGetErrorInfo procedure.
The full set of OLE automation procedures are:
- sp_OACreate
- sp_OADestroy
- sp_OAGetProperty
- sp_OASetProperty
- sp_OAMethod
- sp_OAGetErrorInfo
- sp_OAStop
The attached files have many examples on how to use these procedures and
they are well covered on Books on Line, so I won't bother to repeat the
details here.
The FileSystem Object
The FileSystem Object is a COM object that is provided primarily for
scripting languages such as Jscript, ASP, and VBscript. It provides
everything that is necessary for file or folder operations. It also allows
for most simple file reading and writing operations, and to gather
information about drives, directories and files.
The FSO allows you to manipulate and shred path names, to copy, move,
delete or create files or folders. You can check for their existence or get
a number of objects and collections. These are…
This provides information about a logical drive (physical, or
network) attached to the system, such as its share name, type, total
space, and how much room is available
This will give a list of the physical or logical drives attached to
the system and includes all drives, regardless of type. (Removable-media
drives need not have media inserted for them to appear in this
collection).
This will allow you to create, delete, or move a file, and to find
out the file name, path, and other properties. It has a method to open a
file as a text stream.
This provides a list of all files contained within a folder.
The Folder object has methods that allow you to create, delete, or
move folders. Also there are several properties that return folder
names, creation date, paths and so on.
This provides a list of all the folders within a Folder.
This object is extraordinarily useful as it provides the means to
read and write to text files.
All of these methods, properties and collections are readily available
from Books on Line and
W3Schools.
I also find Dino Esposito's book 'Windows Script Host' (Wrox 1999) very
useful
The HTML Tidier
We end up with a rather more esoteric procedure that we will throw into
the pot just to show what can be achieved with FSO. This is a procedure that
corrects and tidies up any XML, XHTML or HTML. It will take any file and
format it prettily and check for any errors. It is an example which
illustrates several techniques and shows how one can integrate a DOS
application.
This stored procedure requires the latest version of HTMLTidy
(March 2007) to be installed on your server. I put it in System32, as
it is a simple MSDOS.EXE file.
HTML Tidy was written by Dave Raggett, who works on assignment to the
World Wide Web Consortium where he is the W3C lead for Voice and Multimodal.
It has become the standard way of checking XML and HTML files for syntax. It
is a remarkable tool that combines a Lint and a prettifier. It has now been
adopted by an enthusiastic band of people who work through W3C and
SourceForge to maintain it. It also requires you to set the configuration
that I have embedded in the source of the stored procedure, but which you
can supply as a parameter to over-ride the defaults I have chosen. Before
twiddling with these parameters you'd be best advised to check the
documentation at
http://tidy.sourceforge.net/docs/quickref.html
Once everything is in place, all you have to do is to supply the input as
a string and the
spHTMLtidy stored procedure will put the tidied version in the
output string and fill another string with all the warning and error
information.
Declare
@ProcessedFile varchar(max)
Declare @Messages
varchar(max)
Declare @UnprocessedFile
varchar(max)
Select @UnprocessedFile
= '<html>
<head <title>awful HTML</head>
<body>
<p>This is my first real HTML<br><div>
<table><tr><th>What a mess</tr>
<tr><td></td>this is supposed to be in the table</tr><td></td>
<tr> this is awful<td>and this is a line</td></tr>
</table>
</body>
</html>
'
Execute spHTMLtidy
@UnprocessedFile, @processedFile
output,
@Messages output
Select [after]=@ProcessedFile
This will produce a tidied up XHTML like this…
<?xml
version="1.0"
encoding="us-ascii"?>
<!DOCTYPE
html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
awful HTML
</title>
</head>
<body>
<p>
This is my first real HTML
<br
/>
</p>
<div>
this is supposed to be in the table this
is awful
<table>
<tr>
<th>
What a mess
</th>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>
and this is a line
</td>
</tr>
</table>
</div>
</body>
</html>
If you are using SQL Server 2000, then I'm sorry to say you will be
restricted to source code of 8000 characters, but I guessed that this
routine would be more useful in SQL Server 2005 which has a native XML data
type for which this routine would be ideal.
Summary
SQL Server is not designed as a means of file and directory manipulation,
and for most of the time it is not the appropriate place in an application
to do it. However, now and then in the life of a DBA or database developer,
it is nice to have the functionality to hand and, for this sort of work, the
File System Object is ideal. I've used it mostly in combination with DMO for
automating routine DBA chores, such as copying and updating server
configurations. It is particularly handy for creating test cells, and test
configurations. I also find it is ideal for the occasional one-off task,
such as importing and extracting the data from a large number of XHTML
files, or indexing a large text-based repository of data.
Phil returns to this theme in
The TSQL of Text Files and also, with Robyn Page, in workbenches such as
Importing Text-based data: Workbench. they show more advanced ways of
processing text-based information in
Getting HTML Data: Workbench and
TSQL Regular Expression Workbench
Even Phil Factor codes faster with SQL Prompt 4...download a free trial version and check it out.