It’s been several years since I last updated the skin on my site. Thanks to Ralph Williams, the site is looking really good again and it’s responsive. If you need a new design for your site, I highly recommend him. However, the process to get to this blog post was quite tumultuous – chock full of hours troubleshooting and continuous frustration spread over the past week. Simply put, my site “appeared” to be loading okay whenever I came to it, but I rarely browsed my own site. The process of rolling out my new design had revealed to me that this couldn’t be further from the truth.
Before I go any further, everything I am about to tell you about how I got the site back to being “normal,” you should know that this could all have been easily avoided. Rule #1 of running your own site end-to-end is this – NEVER neglect your site or server duties. This was my downfall. I had neglected my duties for a few years now.
What’s Old is New Again
Everything I went through were things that I used to do on a daily basis. I was one of those people that wore every hat possible and I did it well. However, it’s been several years since I wore all of the hats, so this took much longer than it should have. Despite the frustration and lost time, it was quite fun to learn about some of the new tools and methods used to fix these kinds of issues.
Every page load on my primary site (WillStrohl.com) seemed to be slower than desired. Then, about every 3rd or 4th page load, it would take an excessively long time. We’re talking up to 60 seconds or so. There are several other sites on my server as well, but their page loads only appeared to be affected when my primary site was having one of those excessively long page loads.
I had my task manager up during most of the troubleshooting and noticed that when there was an excessively long page load, both the memory and CPU levels would spike and stay pegged. My server has a decent amount of resources, so this was troubling.
The obvious thing to do was to look at the Event Log in DNN (since that’s what my site runs), and in the event log in Windows. No matter what site software/CMS you run, the Windows event log will generally have information that your web-based event log won’t have.
The DNN event log had a number of module and page load exceptions showing, but nothing that made me think that my site was running slowly due to the software. Exceptions happen and the frequency simply wasn’t there. The Windows event log was another story altogether – there were nothing but exceptions showing. The most common ones were related to the w3wp.exe process failing.
I happened to be chatting with a friend that owns arguably one of the best Windows web hosting companies out there, Applied Innovations (shameless plug for my generous web host), and he turned me on to LeanSentry. I had conveyed the same information that you now know, and he mentioned how their services allow you to get better insights into the seemingly generic errors that are found in the servers event log. Their UI leaves much to be desired, but it is indeed a great way to get to know what’s happening at a glance. What I was dealing with originally is below.
If that image looks crazy to you, you’re right. The very active parts on either end are when I was actively troubleshooting the site and server. The middle part is where I had dinner and slept. The consistent thing you’ll notice are the orange and red lines. These are lines tracking worker process crashes and application restarts. As you can see, even with little to no traffic, the server and my site were not having a good time. Something was majorly wrong.
Keeping It Simple: Cleanup First
Like I said, there are a number of sites on this server – mostly for my pet projects, friends, and family. They’re pretty much all running DNN too, so first thing first – upgrade them all to the most current version and get rid of any that don’t need to be around any longer. I had a number of sites that had been shut down over the years, but still were loaded on the server in various ways. First, I deleted all of the unnecessary sites and their references. Then, I upgraded the few that were left and disabled 51 Degrees on all of them.
With my primary site, upgrading wasn’t as simple. The others were quite clean, but I’ve done a lot on my site over the years – that, and combine my new site getting rid of some of the previous features, there was a lot more work to do. I had to upgrade a few modules and uninstall some others. In fact, I also found that one of the modules had a repeating error that couldn’t be fixed, so I had to find a replacement for it and uninstall it as well.
As a result of this cleanup, all of my page load times had decreased dramatically, but it didn’t fix the real issue. The crashes and exceptions in the Windows event log were still occurring.
The reasoning behind upgrades is primarily two-fold: performance improvements in more recent versions of DNN, and the related security updates. This is because a couple of the sites were surprisingly still running DNN 5.x! Following the site upgrades and module upgrades, I installed and ran the new Security Analyzer module by Cathal Connoly. It did indeed note a few issues on a few of my sites. No performance was gained here, but I definitely felt a little peace of mind. This was especially true of my primary site, since it had a very deeply rooted PHP file in one of the HTML editor providers.
Many of you are thinking, “Duh, Will.” The reality is that it’s far too easy to forget to run Windows Update on your server – especially when you login from time to time and see notifications that Windows updates had been applied since your login on your behalf. However, I still had 1 critical update to apply and 6 other optional updates. In total, 3 of them appeared to be related to performance improvements. This proved to be true. There was a noticeable improvement in page loads and even a noticeable improvement in the frequency of crashes in the event log, but this was not the fix.
During another IM conversation with some friends, Clint Patterson reminded me of a performance blog by Shaun Walker where the FCNMode setting resolved performance issues on DNNSoftware.com. While my site doesn’t have anywhere near the same user base and traffic, I figured it wouldn’t hurt anything to switch it. Sure enough, there was again a noticeable performance impact. My application restart times were at least were cut by half – but still, the other issues remained.
Debugging the Crashes
I was getting nowhere pretty fast with the previous attempts at narrowing down causes, so I decided to debug this further by using Debug Diagnostic Tool (Debug Diag). There’s a blog post by Tess Fernandez that walks you through debugging ASP.NET crashes incredibly well. This step didn’t help me fix anything, but it did help me to narrow my search for causes down to one of two exceptions that were occurring. A little tip here… You’re going to be generating dump files if you do this – don’t forget to save your dump files to a drive with plenty of available space.
Getting to the Bottom of it All
The issue that was appearing the most in my application monitoring and event log was the fact that w3wp.exe (the process that runs the websites) was crashing quite often. There were a few other common exceptions. One appeared to be related to Windows complaining about a culture missing. This turned out to not be the real issue, but something that happens a lot and is often expected. So, I cleared the temporary ASP.NET files and continued looking into the other common exception.
The other exception was one that kept complaining about the Lucene.Net.dll file not being found.
Unable to find assembly 'Lucene.Net, Version=188.8.131.52, Culture=neutral, PublicKeyToken=85089178b9ac3181'.
Unfortunately, not only was the file present everywhere that it should be, but it was the correct version as well. It just didn’t make sense to me at all. I poured over search after search to try and figure out why this might happen. I even tried explicitly wiring up the DLL and version in the web.config of my site. Nothing seemed to prevent the worker process crashes.
Now here’s what was interesting about this to me. I performed several searches related to this error – some specific to DNN and some much more general. Regardless, I kept seeing results appear with Umbraco mentioned. Originally, I kept ignoring those results in favor of others. At the end though, I got desperate, so I began reading those ones as well. As it turned out, many Umbraco implementations were having the exact same issue and the fix was astoundingly simple… Delete the search index files from the App_Data folder. In DNN, this folder would be as follows:
I deleted all of the files files in that folder and recycled the application pools. (Don’t worry, the files will be re-created.)
I immediately saw a performance increase like none before it. Every page was loading snappy quick. The occasional long page loads were gone. The worker process crashes appeared to be no more as well. I sat there… I clicked on my site aimlessly for about 15 minutes before I finally rejoiced! No crashes and no noticeable changes were coming back. The site was back to normal!
This fix was something so simple… and it had been so difficult to track down. The end results are shown below. Note the dramatic dip near the center when the search files were deleted. All sites on my server have run smoothly since.
At the end of the day, all of these tools and methods helped to get all of my sites back to a state of being highly responsive in both ways that a site owner would care about. I highly recommend them all. Especially LeanSentry, but it’s expensive for a non-business site owner like me. If you ever hear of them having some kind of freemium edition, don’t forget to let me know.