ADO.NET 2.0: Understanding the Changes

Mar 4, 08:13 pm
tags:

Author: Sahil Malik
Editor: David Schultz
Reviewer(s): Akash Aggarwal, Damien Foggon

Introduction

One of the most important parts of the architecture of many applications is the data access layer. Not only is it at the heart of many applications, it is also hard to imagine an application that does not work with some kind of data. And along with this, each generation of Microsoft technology has brought with it a new incarnation of data access methodology. There have been various flavors in the names of DAO, RDO, ADO, and finally with the .NET framework came ADO.NET, that part of the Microsoft .NET framework that allows you to interact with your application's data source. While there were very few changes in ADO.NET between .NET frameworks 1.0 and 1.1, there are quite a few significant changes between ADO.NET 1.1 and 2.0, and in this article I'll work through some of those changes.

Every framework or development platform brings with it a culture and a mindset. However, what sets ADO.NET apart from previous data access technologies is that in most previous data access technologies, functionalities such as disconnected data access, XML functionality, connection pooling etc. were retrofitted into an existing architecture. ADO.NET has been designed from the ground up with these considerations in mind. For you as a system architect or a developer, this means better performance, simpler code, and much improved flexibility. However, more profound than the technologies themselves is their impact and the power they give you as a developer. With power however comes responsibility and unfortunately misuse of power. As a completely redesigned generation one product, ADO.NET in .NET versions 1.0 and 1.1 did fairly well but very often developers hit a wall when trying to solve common problems. ADO.NET 2.0 not only considers many of these problems, it in fact includes many new enhancements that make working with ADO.NET a pleasure, and make it harder to introduce architectural mistakes into your design.

This article focuses on the most common done operations in any data access architecture, and the common issues faced in the .NET framework 1.1. It then compares those with .NET framework 2.0 and the seemingly insignificant, but very well thought out enhancements in the framework.

Without further ado, let us begin with a rather common gripe about the .NET 1.1 framework, DataSet serialization.

System Requirements

To run the code sample you should have the following:

  • Windows XP / 200X or greater
  • The .NET framework 2.0 Beta 2
  • Visual Studio 2005 Beta 2
  • MSDE or SQL Server 2005 with the Adventureworks database installed

Installing and Compiling the Sample Code

The sample download for this article contains a VS.NET 2005 Beta 2 solution written in C#. The solution contains two projects - AsyncCommand and DS_Serialization .

To install the sample, unzip the download to a directory and open the ArticleCode.sln file in Visual Studio 2005. Be sure to change the connectionstring within the button click event to match your SQL Server environment.

DataSet Serialization

In a typical enterprise application, it is difficult, if not impossible for every part of the application to remain constantly connected with the database. While it is possible to leverage mechanisms such as connection pooling to avoid the requirement of a constantly open connection between every client and the database server, equally important is the presence of an object that can act as the cache in which you can easily store disconnected data.

The role of a disconnected cache that allows you to store disconnected data is served by the DataSet object and its constituents such as DataTable , DataRelation etc. which appear under the System.Data namespace. By the virtue of appearing in the System.Data namespace and not under any specific provider, the message is given that these objects are independent of a specific .NET data provider such as SqlClient or OracleClient . In other words, to hold data from an Oracle database, or a SQL Server database, you would, in both cases use the same DataSet class or even a single instance of the DataSet class. As a matter of fact, the DataSet almost mimics itself to be a small relational database - except that it is completely in memory.

Now, just because the DataSet masquerades itself to be a small in memory relational database doesn't mean it is one, and most definitely it should not be abused as one. While it is unreasonable to expect the DataSet to have the power of a database, what a DataSet does provide you is the ease by which it can be sent over process and machine boundaries.

Sending the DataSet over process and machine boundaries requires the DataSet object to go through a dehydration process commonly referred to as serialization. Serialization refers to the process of converting an in memory object to a serial stream of bytes. The problem however is that the DataSet in the .NET framework insists on serializing itself as XML, even if you specify to serialize it using the BinaryFormatter - which obviously results in an extremely bloated serialized stream. Consider the following code:

BinaryFormatter bf = new BinaryFormatter();
FileStream fs = new FileStream(
  System.AppDomain.CurrentDomain.BaseDirectory.ToString() + "ds.dat", 
  FileMode.OpenOrCreate);
DataSet ds = GiveMeADataSet() ;
bf.Serialize(fs, ds);
fs.Close();

The code above will work either in ADO.NET v1.1 or 2.0. When the code is run however, and the file ds.dat is opened in notepad, it will similar to Figure 1.

Figure 1. DataSet serialization – the old way

As you can see, the entire DataSet contents have been serialized as XML, even though you clearly specified the BinaryFormatter as the serializer of your choice. There are three significant downsides to this:

1. Given that XML tags are verbose and repeated frequently, the above serialized stream is huge.

2. It takes longer to serialize and deserialize the data.

3. Another hidden problem is when a DataSet is sent over time zone boundaries, the time zone shift, thanks to XML serialization, occurs twice - once during serialization, and second time during deserialization.

Clearly these can be serious problems if the data is of any size. ADO.NET 2.0 lets you solve all these problems by making a very minor change to your code as shown below.

BinaryFormatter bf = new BinaryFormatter();
FileStream fs = new FileStream(
  System.AppDomain.CurrentDomain.BaseDirectory.ToString() + "ds.dat", 
  FileMode.OpenOrCreate);
DataSet ds = GiveMeADataSet() ;
ds.RemotingFormat = SerializationFormat.Binary


bf.Serialize(fs, ds);
fs.Close();

Thus by changing a newly available property RemotingFormat to SerializationFormat.Binary , you force the serialization format of the DataSet to a true binary format. Not only is this more efficient, it also does not suffer from the time zone shift problem described above. The default value however is still XML for backward compatibility.

In the code download for this article, you will find a project called DS_Serialization . Simply run the project with both SerializationFormat.Binary and SerializationFormat.Xml . You will see that the deserialization process with SerializationFormat.Xml takes about tens of times longer than using the SerializationFormat.Binary .

Updates to the DataTable

DataSets have the ability to store hierarchical data in multiple DataTable s. While that sounds useful, in many cases the complexity is not worth it. By having too many DataTable s in your code, your concurrency check logic tends to get extremely complex. Also, by having more than one DataTable inside a DataSet , the time required to Merge() or perform a GetChanges() on the DataSet rises significantly.

Even though it might not be completely unavoidable to use DataSet s with multiple tables in them, in many situations you will find it more reasonable to work with a single DataTable instead. However, in .NET 1.1, DataTable s were not serializable which meant, you had to put them inside a DataSet in order to serialize them and send them over a web service for instance.

Not only that, by the virtue of the disconnected nature of these objects, frequently you would find yourself with the need to extract the changes out of a DataSet or DataTable and the need to merge two DataSet s or DataTable s together. In order to be able to do a Merge() on a DataTable in the .NET framework 1.1 you had to first put it inside a DataSet and then merge the DataSet instead, similarly to extract changes out of a DataTable , you had to first put it in DataSet and then perform GetChanges() on the DataSet instead.

ADO.NET 2.0 however provides you with a fully serializable DataTable with methods such as ReadXml() , WriteXml() , ReadXmlSchema() , WriteXmlSchema() etc. and the same Merge() and GetChanges() abilities as the DataSet (albeit you still can't merge with a DataRow array at the DataTable level).

Therefore, when your logic allows, you can now leverage a smaller memory footprint object, the DataTable in lieu of the DataSet .

Streaming Support in DataTables

In a bid to reduce roundtrips to the database, it is quite tempting to load as much data as possible in one database call. While that is a good practice, this often leads to the tendency to have large in memory objects as instances of DataTable s or DataSet s. Something I frequently like to refer to as "Abuse of DataSets as Relational Databases" . This is a bit like charging you 110% tax, either way, you can't win.

Also given the nature of the disconnected objects within ADO.NET ( DataTable s, DataSet s, DataColumn s, DataRow s etc.) you may not want to work with the original copy in many instances - for example because you might disturb the row states. Another situation could be where you might want to fill more than one DataTable with the same data, but you do not wish to pass an open database connection around. Not passing the connection around would make sure that you do not keep the connection open for an inordinately long period of time, thus impacting connection pooling performance.

In ADO.NET 1.1, you could either clone the disconnected data cache by serializing and deserializing it to a new object instance and having a complete copy before use or you could pass around a new instance of an open SqlDataReader to achieve the same results.

ADO.NET 2.0 however has a new CreateDataReader() method on the DataTable object, which returns you a DataTableReader object. The DataTableReader object, just like the SqlDataReader object, implements the IDbDataReader interface, which means it allows you to work with the data contents in a streaming forward only, read only manner.

The one big difference between DataTableReader and DataReader however is that, the DataTableReader object is indeed disconnected from the database, thus while you are using a DataTableReader object, you are not locking a valuable resource such as an open database connection.

The SqlDataReader or the IDataReader in general provide you with a forward only, read only firehose cursor to the queried data. While it does hold an open connection, it is however a much faster way to access data in a single user scenario. As the number of users may increase and if your application does a lot of processing between individual rows, a fully disconnected architecture is a much better choice.

However for situations where you wish to do a quick read of information - perfectly suited to a SqlDataReader object - you could not databind the SqlDataReader directly to a Windows form datagrid. Interestingly enough though, an ASP.NET datagrid will databind directly to a SqlDataReader , but that is only because the SqlDataReader object implements IEnumerable in addition to IDataReader . Even then it suffers with the major disadvantage of keeping the connection open for however long it may take to databind with the datagrid (or any other control for that matter). Databinding however is an expensive process - and, depending upon the exact query, it may take up to ten times as much time to bind the data as it did to read it from the database in the first place (this of course depends on how optimum your SQL query is). This clearly keeps the connection open for much longer than required, and thus affecting the performance of a multi user, highly concurrent application extremely negatively.

While ADO.NET 1.1 left you with no choice, (with the exception of loading DBDataRecords in an Arraylist yourself), ADO.NET 2.0 provides a new method called Load() on a DataTable to directly load a DataTable from a DataReader . This is easily achieved using the following code snippet:

DataTable dt = new DataTable() ;
dt.Load(myDataReader);

This in fact not only allows you to create a DataTable from a SqlDataReader objects directly, it also gives you a convenient mechanism to clone a DataTable with minimal code using the DataTableReader object. You can also specify two additional parameters to control the behavior of the DataTable load when it already contains data. This is done by specifying a LoadOption to an overload as shown below:

dt.Load(
   myReader,
   LoadOption.OverwriteChanges,
   new FillErrorEventHandler(errorHandler))

The third parameter passed to the function above is the event handler which will be triggered in case an error occurs during the fill operation.

Batch Updates - Saving on Database Roundtrips

DataAdapter s are awesome. They act as the sentry between the disconnected data cache objects (for example objects such as the DataSet , DataTable etc.) and the connected world of databases. They provide you with four properties, SelectCommand , InsertCommand , UpdateCommand , and DeleteCommand where you can specify Command objects of your choice. When you call a method such as DataAdapter.Fill() , the object being filled - DataSet or DataTable is filled with the results of the execution of the command held in the SelectCommand property.

Similarly, as you make changes to DataRow s contained inside the various DataTable s, their RowState changes accordingly. The RowState allows the DataAdapter to uniquely identify individual rows that might need updates, inserts, or deletes.

Thus for every changed row, the DataAdapter will communicate with the database, execute the necessary command (while putting in values for parameters), and then for every subsequent row it will do the same process all over again - this time possibly with a different command, depending upon the RowState .

Therein lays the problem - too many network roundtrips.

But why are too many network roundtrips a problem? Well, because the speed of light is limited, and our brilliant scientists will probably not break it in our lifetimes. About 5 years back, if I were to ping a server in Tokyo from Washington DC, it took me half a second. But downloading 500K took about half an hour. But today, while it takes maybe a few seconds to download 500K (if that), pinging still takes about the same time. This is because my signal is transmitted over copper wires, optical fibers, or something similar, and my signal speed is limited by either the speed of light in the best case, or a constant fraction of the speed of light. Not to mention the millions of transistors en-route that take their own sweet time toggling between the digital 1 and 0 states.

In order to save on too many network roundtrips, .NET 2.0 has introduced a new property on SqlDataAdapter called UpdateBatchSize which simply gets or sets the number of rows that are processed in each round-trip to the server. You can in fact set it to 0 and allow SqlDataAdapter to automatically use the largest batch size the server can handle. But it doesn't simply concatenate a number of queries and execute them in a batched fashion.

Batch updating works in a rather interesting fashion. For example let's take the insert command as shown below:

INSERT INTO Customer (id, name) VALUES (@id, @name);

and the update command's text as shown below:

UPDATE Customer SET name = @name WHERE id=@id;

Now, the logical way to think is that, maybe the DataAdapter , at least in the case of SqlClient should just batch these two commands together as shown below:

INSERT INTO Customer (id, name) VALUES (@id, @name);
UPDATE Customer SET name = @name WHERE id=@id;

The problem is immediately apparent. The same parameter name is used multiple times. Now maybe the DataAdapter could parse through the query contents and rename the parameters, but not only is that a lot of work that requires a lot of code, (which means it is prone to a lot of bugs), there is also an upper limit of 2100 parameters that a SqlCommand can have. Because individual commands may have a varying number of parameters, batching the commands is not the answer. Another significant disadvantage is that the query structure can change with the DataTable change, which means query plans cached will not be used optimally in the database.

Instead, by setting a BatchSize on the SqlDataAdapter object, it instead uses a much simpler approach at the TDS protocol level. Put in simplistic terms, if a single command execution is done using a command start marker, command execution, followed by command end marker, for batched statements, instead of executing one command, and putting an end marker on the communication, ADO.NET simply puts another command start marker and executes another command. This way, it saves on multiple round trips in a simple and elegant manner under the scenes.

In order to see batching work, the only difference you will see as a programmer is a significant performance gain. If you wanted to delve behind the scenes however, you would have to use a network sniffer to examine the packets as SqlProfiler will still show you the individual commands being executed one by one.

Asynchronous Command Execution

When architecting an application, you want the application to be quick and responsive to the end user's whims and commands. This can be tough to achieve as, for various reasons, it is difficult to come up with a data model that performs efficiently in all scenarios. While a normalized data model may work very well for inserts, it may not work well for selects. While it is true that the first venue of solution to slow running queries is a better data model, the harsh reality is that every now and then you may need to run queries that take excessively long durations to run. At times the problem may be completely unavoidable because there is simply too much data to process or the process is extremely complex and long running.

In .NET 1.1 you could have gotten around this problem by firing off another thread, creating a DataReader on that thread and then having that thread notify the main thread that it was ready to retrieve the data. The right way to then show the data would be to use the Form.Invoke() method to switch thread contexts and then databind the fetched data on the UI in the context of the thread the Windows forms UI was created upon.

If you didn't have to read the previous paragraph at least twice to understand what I just said you are probably a really advanced .NET programmer. And if you did find yourself scratching your head a little bit, take heart, this was a fairly complicated workaround to a situation that might occur much too often. Keeping that in mind, .NET 2.0 introduced asynchronous execution on many commands using the popular Begin/End method paradigm and IAsyncResult data types.

This is actually much simpler than it sounds. All it means is that instead of executing the ExecuteReader() method directly, you would instead leverage a pair of methods called BeginExecuteReader() and EndExecuteReader() . The Begin/End pattern repeats itself all over the .NET framework. If there was a method called XXX that is allowed to be executed asynchronously, I would immediately look for a BeginXXX() and a matching EndXXX() method.

In other words, when you wish to execute a command asynchronously, you would initiate the execution of the command by executing the code as shown below:

AsyncCallback callback = new AsyncCallback(DataReaderIsReady);
IAsyncResult asyncresult =  
   testCommand.BeginExecuteReader(callback, testCommand);

Notable in the above two lines is the AsyncCallback object callback , which lets the framework know that when the command is done executing it should call a method called DataReaderIsReady() , which can be seen in the code below:

private void DataReaderIsReady(IAsyncResult result)
{
  MessageBox.Show("Results Load Complete", "I'm Done");
  SqlCommand testCommand = (SqlCommand)result.AsyncState;
  SqlDataReader sqlDr = testCommand.EndExecuteReader(result);
  if (sqlDr.HasRows)
 {
    foreach (DbDataRecord rec in sqlDr)
    {
      dbRecordsHolder.Add(rec);
    }
  }
  sqlDr.Close();
  testCommand.Connection.Dispose(); // Call atleast close, if not dispose.
}

The DataReaderIsReady() method is called once BeginExecuteReader() is done with its processing. When you started BeginExecuteReader() , you passed in the Command object. That object can now be retrieved using the IAsyncResult.AsyncState property. Once you have the SqlCommand object back, you can now get the prepared SqlDataReader by executing the EndExecuteReader() method. From then on you can use the SqlDataReader object, sqlDr just like any other SqlDataReader .

Just remember to close and dispose of the connection as a good coding practice in the end. Do note however that you don't have the using block or code limited to a single method's scope to do this for you here, so you need to remember to do this cleanup. Not doing this clean up will lead to a serious performance degradation and runtime errors as the connection pool gets used up that will manifest themselves only in production environments under high load conditions. This would be a perfect example of additional power, which might result in misuse.

A rather thoughtful addition to ADO.NET 2.0 and SQL Server 2005 is that any features that could potentially be misused, as a result for example of bad architecture or simply poor knowledge by the programmer, needs to be consciously allowed before it can be used. As mentioned above, asynchronous command execution although essential to an application has the possibility of being misused or improperly used. In order to be able to use it, you have to consciously make the decision and allow it in your connection string. A change needs to be made to the connection string to allow your code to use asynchronous commands. To allow asynchronous commands, you need to add the following to your connection string:

Asynchronous Processing=true

You can find a full running demonstration of executing a command asynchronously in the code download in the AsyncCommand project.

Integration with System.Transactions

As applications grow more and more complex, their needs grow. At times, it may be a good idea to split the application database into multiple databases or you may have more then a single datasource. However, in order to maintain data integrity, you might want to wrap a number of commands as one atomic block that either fails or succeeds all together. The standard method to do this is by using a database transaction.

While the SqlTransaction or OracleTransaction objects work just fine for one database, when the transaction needs to span over multiple databases you need to look elsewhere. Commonly referred to as distributed transactions, these transactions are usually coordinated by the Microsoft Distributed Transaction coordinator (MSDTC).

Before ADO.NET 2.0, in order to implement a distributed transaction, you had to leverage the System.EnterpriseServices namespace's ServicedComponent class. In order to create a component that enlisted itself in distributed transactions, you had to inherit from the ServicedComponent class as shown below:

[Transaction]
public class DoWork : ServicedComponent
{
  [AutoComplete]
  void WorkUnit(int someparameter)
  {
  // Do the work, throw exception to terminate transaction 
  // or else transaction commits
  }
}

Not only did you have to inherit from ServicedComponent , you also had to strongly name the assembly and put it in the GAC. This obviously is extremely inconvenient and inflexible.

A slightly better approach that could have been taken in Windows XP SP2 and Windows 2003 operating systems was by using the ServiceConfig class. This could be done as shown in the code below.

using System;
using System.EnterpriseServices;
namespace SDSample
{
  class MyClass
  {
    [MTAThread]       
    static void Main(string[] args)
    {
       ServiceConfig cfg = new ServiceConfig(); 
       cfg.Transaction = TransactionOption.Required; 
       ServiceDomain.Enter(cfg);
       try
       {
          // Write transactional code here
       }
       catch(Exception e)
       {
          Console.WriteLine(e.Message);
          // Exception occurred - abort the transaction.
          ContextUtil.SetAbort();
       }
       finally
       {
          ServiceDomain.Leave(); 
       }
    }
  }
}

Again, this approach wasn't perfect. Not only was it limited to certain operating systems, it also meant that as long as you had a single connection enlisting in the transaction, even the single connection would pay the same performance penalty that a full fledged distributed transaction would.

Various connection objects in ADO.NET 2.0's such as the SqlConnection object on the other hand makes it easy for you by working with the newly introduced System.Transactions namespace in the .NET 2.0 framework. The usage can be seen in the code below:

using (TransactionScope ts = new TransactionScope())
{
  using (SqlConnection cn = new SqlConnection(connectionString1))
  {
    SqlCommand cmd = cn.CreateCommand();
    cmd.CommandText = 
      "Insert into Monkey(MonkeyName) Values ('Bill Ryan')";
    cn.Open();
    cmd.ExecuteNonQuery();
    cn.Close();
  }
  using (SqlConnection cn = new SqlConnection(connectionString2))
  {
    SqlCommand cmd = cn.CreateCommand();
    cmd.CommandText = 
      "Insert into Monkey(MonkeyName) Values ('Frans Bouma')";
    cn.Open();
    cmd.ExecuteNonQuery();
    cn.Close();
  }
  ts.Complete();
}

In the above code, I have instantiated a TransactionScope object. Once this is done the two SqlConnection objects and all commands executing on them contained inside the TransactionScope automatically enlist themselves into the current running transaction. In other words, they both enlist themselves within one MSDTC transaction.

But it gets even better - if the second connection had never come into the picture, this would simply be a ReadCommitted transaction within one database. MSDTC kicks in as soon as a second Resource Manager (RM) comes into the picture that is either not in the same app Domain, or is non-durable. You need to be aware that MSDTC does generate lots of network traffic, and it could get stuck because of firewalls, or simply escalates the isolation level. Also with the introduction of the MSDTC, the transaction is promoted to a non-lightweight transaction - which is no longer managed by LTM (Lightweight Transaction Manager), but instead by the MSDTC.

In other words, you almost don't have to write any code to take advantage of distributed transactions. Also, you definitely don't have to write any code in order to take advantage of the light weight transaction manager.

Conclusion

In this article, I have touched upon a few of the many ground breaking changes ADO.NET 2.0 brings. Like the rest of the framework, a lot of thoughtful additions have been made that facilitate the programming of commonly required tasks. Bear in mind that I've only covered a few of the changes, which I consider particularly significant - there are other new features too. Many of these changes are small but significant changes such as better support for transactional updates from a DataTable , multiple active result sets, support for the XML data type, and lastly of course the new CLR support inside the SQL Server 2005 database.

Founders at Work


    1. Quite Usefull article.



    1. There is also another change. Previously I had some code that had been ported through various versions of VB to .NET 1.1. This used DataTable.Rows.Add(Object()) passing in a string array containing the column data. This worked fine with the items in the string array being assigned to the columns in the added row. Under .NET 2.0 I find this string array is now stored in the first column and the rest are set to DBNulls. The solution is to stick with the specification of the .Add method and pass in an object array rather than a string array.




Add your comments

Please keep your comments relevant to this blog entry: inappropriate or purely promotional comments may be removed. To add hyperlink, please follow this example: "your link text":http://your.link.url