Ever since I went to a full dedicated server for my websites a little over 6 months ago, I’ve had zero problems. Zero as in none. 100% uptime for months on end. This has been great, so good in fact that I almost forgot what it was like doing battle with shared hosting services. If you’re at the point where you are questioning whether to go from shared services to a dedicated server (and by this I mean shared as in VPS or shared as in ‘all on one server’) – do it now. Even though it costs a little bit more, the savings in less downtime more than offset the difference.
But all good things come to an end, and about 12 hours ago, my site started going up and down like a yo-yo. Once all the usual suspects have been sorted out (did I forget to renew the domain?) it became clear that this was a software issue on the server itself.
If you use DotNetNuke as your platform of choice, and you’re faced with this situation, the first place to always go and look is the DotNetNuke scheduler.
This is a great piece of software that allows even people on shared hosting to run scheduled services. I posted a couple of weeks back the horror of having to work with a web platform that doesn’t have something like this built in.
But with great power comes the possibility of bad things happening. And so it can with the scheduler.
The problem with all scheduled services (and by this I mean the universe of unattended software, not restricted to any one platform) is that they can go rogue. By rogue I mean consuming all of the resources of the host machine, and generally causing crashing and destruction.
And so it goes that this would appear to be the cause in this case.
This blog post is going to be about the troubleshooting process that I went through, to solve this problem. Why post this? So that others, if they find themselves in the same situation, might find the information and manage to solve the problem themselves.
Finding out what is wrong with your non-responsive DotNetNuke website
The first thing to do is get the server to the point where you can actually access it. Sometimes this might require a reboot – if so, get that done.
If you can get a remote desktop connection to the server, or even just a remote Sql Server connection, get that going.
If you can get an FTP connection going to the server, upload an ‘app_offline.html’ file to the website root. This essentially shuts down further traffic to the site. See here for more information about app_offline.htm. The first thing to do is to stop the server crashing again.
Once you’ve got the ability to look at your DNN install, try and ascertain the size of both the ScheduleHistory and EventLog tables. If these are large, you’ve probably identified your problem.
In my case the ScheduleHistory table had grown to over 4 million rows. Yikes. Now that’s a problem. It should be in the thousands, not the millions. Note there is no exactly correct number, because the number depends on traffic and how many scheduled items you have, as well as how many errors you get.
When your ScheduleHistory or EventLog tables grow to this size, they basically can’t be deleted properly, because deleting that amount of data from the tables fills up the transaction logs in your database. The DNN Scheduler has two tasks in it, one which deletes excess rows from the SiteLog table, and one which deletes excess rows from the ScheduleHistory table. And when you try and run a backup on that size of data, the backup can take forever and also consume too much CPU and memory. Result : dead server. The Scheduled task can’t delete the rows, and it kills the server stone dead while trying to. It’s in a death spiral, and the only solution is to clean out those tables.
But before you do that, you need to know what is causing the error. If you’ve got a Sql Connection, I recommend selecting the top 50 or so rows from both the ScheduleHistory and the EventLog tables.
The commands are these:
select top 50 * from dnn_EventLog order by LogCreateDate Desc
select top 50 * from dnn_ScheduleHistory order by startDate Desc
Note: dont’ do a ‘select *’ from these tables, or you’ll just cause more grief. It’s imperative to limit the size of the result set.
When doing this, I recommend copy/pasting the ‘LogProperties’ column of the EventLog table into a text file, and renaming it something like ‘errors.xml’ in your local machine (put it in a temp directory). Then open that file by double clicking on it – this will open up the xml output in a browser window. Reading raw XML with no formatting is an exercise in generating headaches, so I recommend doing it this way to save you grief. You can copy/paste the LogProperties from multiple rows into the one Xml file, and have a way of viewing a selection of the errors.
Of course, if you can still actually log onto your server and view the Admin –> Log Viewer page, then do it that way!
The Discovered Issue - PurgeModuleCache Failure
What I found from all of this was this error appearing over and over again:
5/22/2012 8:24:12 PM
Scheduler Event Failure
THREAD ID 20 TYPE DotNetNuke.Services.ModuleCache.PurgeModuleCache EXCEPTI
THREAD ID: 20
TYPE: DotNetNuke.Services.ModuleCache.PurgeModuleCache
EXCEPTION: String was not recognized as a valid DateTime.
RESCHEDULED FOR: 5/22/2012 8:24:42 PM
SOURCE: STARTED_FROM_BEGIN_REQUEST
ACTIVE THREADS: -1
FREE THREADS: 2
READER TIMEOUTS: 0
WRITER TIMEOUTS: 0
IN PROGRESS: 0
IN QUEUE: 9
Just below it, in every case, were two other exceptions, which looked like this:
DefaultDataProvider: DotNetNuke.Data.SqlDataProvider, DotNetNuke.SqlDataProvider
ExceptionGUID: 5965365b-9f4c-40ab-b3fc-1b3268ded02e
InnerException: String was not recognized as a valid DateTime.
FileName:
FileLineNumber: 0
FileColumnNumber: 0
Method: System.DateTimeParse.Parse
StackTrace:
Message: System.FormatException: String was not recognized as a valid DateTime. at System.DateTimeParse.Parse(String s, DateTimeFormatInfo dtfi, DateTimeStyles styles) at DotNetNuke.Services.ModuleCache.FileProvider.IsFileExpired(String file) at DotNetNuke.Services.ModuleCache.FileProvider.PurgeExpiredItems(Int32 portalId) at DotNetNuke.Services.ModuleCache.PurgeModuleCache.DoWork()
So, from this it was pretty obvious that the ‘Purge Module Cache’ scheduled task was failing every time it ran. Each time it failed, it added 3 exceptions to the event log, and another two entries to the ScheduleHistory table. Eventually so many errors are logged that the PurgeHistoryLog scheduled items start failing as well, which then sets off a snowball effect, the end result of which is an unresponsive server.
Now, in my particular case I just happen to have a copy of the source code of the particular DNN version I was running (6.0.1, as it happened to be). So I went into the DotNetNuke.Services.ModuleCache.FileProvider.IsFileExpired source, and took a look at what was there:
private bool IsFileExpired(string file)
{
StreamReader oRead = null;
try
{
oRead = File.OpenText(file);
DateTime expires = DateTime.Parse(oRead.ReadLine(), CultureInfo.InvariantCulture);
if (expires < DateTime.UtcNow)
{
return true;
}
else
{
return false;
}
}
finally
{
if (oRead != null)
{
oRead.Close();
}
}
}
Straight away, I could see that there was a looming bug in this code - there is no handled exception if the first line of the read file doesn't contain a valid date.
So, next thing to do was to check the source code of the latest version of DotNetNuke available (6.1.5 at time of writing).
private bool IsFileExpired(string file)
{
StreamReader oRead = null;
try
{
oRead = File.OpenText(file);
DateTime expires = DateTime.Parse(oRead.ReadLine(), CultureInfo.InvariantCulture);
if (expires < DateTime.UtcNow)
{
return true;
}
else
{
return false;
}
}
catch
{
//if check expire time failed, then force to expire the cache.
return true;
}
finally
{
if (oRead != null)
{
oRead.Close();
}
}
}
Here we can see that the bug has been fixed - there is now a catch statement to catch an exception if the file doesn't exist, can't be read or doesn't include a valid date/time.
The Easy Fix!
Once you have worked out what the problem is, then you no longer need the records in the EventLog or ScheduleHistory tables. But you can't just delete them - deleting records puts them into the transaction log file of the database, so the deletes could be reversed if necessary to rollback the database. This is what causes half the problems in the first place. Instead you need to Truncate the tables. In Sql Server, 'Truncate' is the command used to delete the rows from the table and don't bother with adding them to the transaction log. Think of it like deleting something from your computer and not putting it in the recycle bin.
The commands for this are:
truncate table ScheduleHistory
truncate table EventLog
Now you've identified the error, cleaned out the oversized database tables, but you still haven't fixed the original problem.
At this point, it's obvious that the correct solution is to upgrade DotNetNuke to the latest version. So that's what I did. Problem solved - no more logs filled up with errors, site not crashing, everyone happy. All that was needed was to delete the app_offline file, and let visitors back onto the site.
Thanks goes to Antony Slater of WEBPC for both the server and the useful help in determining the problem.
Summary
If you face a problem like this:
- 1. Stop it from happening by restarting your server if possible, and upload an 'app_offline.htm' to your site
- 2. Ascertain it if is a schedule / event log oversize table related issue
- 3. Assuming (2) above is the correct diagnosis, work out what is filling the log and/or schedule history
- 4. Fix whatever that problem is (in this case, fixed by upgrading DNN to the latest version)
- 5. Truncate any oversized database tables
- 6. Remove the app_offline.htm file again
- 7. Closely monitor the server to make sure it's running OK again
More ...