1

Recently I have to restart my server because it was not responding. I'm looking at the logs but I can't find anything valuable to know which the error was.

The droplet CPU was 100% for hours. Here is the screenshot:

enter image description here

When the droplet was having problems the site wasn't avaible and neither the shell access.

I don't know what else can I do to find the error or it's possible causes. Where should I start looking? What specific logs could be more useful here?

Now everything is fine, after the restart... but it could happen again.

Help me please. Thanks.

Beto Aveiga
  • 159
  • 1
  • 9

2 Answers2

2

Before messing with Nagios and the likes, I suggest you to install sar to keep your server monitored. It basically require no configuration but at the same time it collects many key stats about what is running/happening on your server.

shodanshok
  • 50,565
  • THanks @shodanshok, I will look for "sar". It's interesting to have program to monitor the performance and log about details that are not always easy to find in the logs. I will try it. Until the server is still working fine. – Beto Aveiga Nov 19 '16 at 13:08
1

Well first and foremost, do you have any monitoring tools such as Nagios, to highlight you in real time of these events? This would be a good suggestion for you to configure to monitor your server, it can do a number of SNMP commands such as:

Service Monitoring

Event Handling

Multiple Host Monitoring

For more details, look at the following add-on here:

======= 
USAGE: 
======= 

./checkProcessesviaSNMP.sh <community-string> <remote-host> <process-names> <warning> <critical> <type> 

This tool should be able to monitor a number of real-time events on your server and alert you via E-Mail (given you configure SMTP).

This solution will not stop the fault, but should give you real-time alert as to what is happening.

  • 1
    When the droplet was with having the issue was impossible to login through SSH, I would use "top" also for monitor, but the server was almost inoperable. – Beto Aveiga Nov 19 '16 at 13:05
  • @BetoAveiga Beto, I doubt there is anything you can do in that situation. If you monitor in real time (and collect the logs) potentially you can solve the issue before it occurs again. – DankyNanky Nov 20 '16 at 00:19