53

I manage 20 software engineers divided into 4 sub teams. Every team has good work standards and a high-level of ownership except one. That team has one senior guy and three juniors. Every time there is a critical bug (impacting the business), this senior guy always pushes the work to the next day by saying things like "I can't finish it today," "I will look into it tomorrow," "Do we really need it today?," or "How are we going to test that tonight?" Even when I told him I needed it now, he said he had something else to do and sneaked off when I was not there. He also told these juniors to push back their work as well.

Last week, I told them in a team meeting that I expect a higher level of ownership. If they promise something, they should do it. If there is a critical bug, they must fix it even if they have to stay late.

Today, there was a critical bug and this senior guy said the same thing again - "I can't finish it today. I have a meeting with friends and I have to go." then he sneaked out while I was talking to my manager.

This is not the mentality I want my team to have. I plan to tell him that he has to change his work style or find a new job, and waited for the answer. Is it too direct to do that? Is there an alternative way to deal with issues like this?

Update

In this particular example, the bug prevents 90+% of users from logging into the system. On average, this happens once a month this year while it happened twice last year. Critical bugs are well defined bugs which: 1) prevent users from logging into the system and 2) prevent users from purchasing products — only these two type of bugs.

What we did to prepare every release:

  1. We had thorough plans where everyone understand the requirements. We actually plan about field name and functions. I implemented for all teams the rule that requirements can't change after sprint start. We also have test cases ready before sprint start.
  2. We add buffer to all tasks, let's say if we think we can finish something in 1 day, we put 1.5 days. We found that some people always underestimate tasks.
  3. First deadline was end of Jan - it is when they think they can get it done with tests. This is another rule I implemented in all teams. POs tell us what they want and we tell them how long it will take. So, I told other teams that everything would be ready by 3rd week of Feb.
  4. By the end of Jan they said all features are done with tests in test cases. We deployed them to our test environment and found a bug where user can't login. It turned out that they did not write all the tests. I asked them how long it would take to fix the bugs and write the tests, they said two weeks.
  5. First two weeks of Feb, I told everyone that we would only test and fix critical bugs in these two weeks. Again, critical bugs are either 1. users can't login or 2. users can't purchase products in app. Everything else will be in our backlog.
  6. Week 3-4 of Feb after we released it to customers. We spent this two weeks fixing non-critical bugs (that we log from #4) which are reproducible crashes and other less important bugs like layout and etc. Again, all these fixes have tests.
  7. We released it to customers with all tests green. After deployment, we found that some numbers are off so we retested everything and found the same issue coming back - users can't login.
  8. Last time they stayed late at night, I gave them extra 2 days off.
Pierre Arlaud
  • 519
  • 1
  • 8
  • 18
Code Project
  • 3,189
  • 6
  • 15
  • 25
  • 2
    Comments are not for extended discussion; this conversation has been moved to chat. – Neo Mar 15 '19 at 12:07
  • 40
    Can I suggest that everybody rethinks their answers in the light of the above edits. – DJClayworth Mar 15 '19 at 13:25
  • More questions for CodeProject: Is this senior you write about the only person who could fix this bug? And who is responsible for pre-deployment testing? – DJClayworth Mar 15 '19 at 13:59
  • 35
    Are you telling me (on average), once a month, 90% of your user base is not able to use your service (during the day)?. Boy, you have to have bad QA. You must review your processes ASAP! – Marcel Mar 15 '19 at 15:18
  • @DJClayworth no he is not the only one who can fix it but it will be faster because he wrote it. The team owns everything from writing code, testing, and deploying. – Code Project Mar 15 '19 at 16:18
  • 2
    @CodeProject Does your team have dedicated roles for testing and deploying, or do you just have a group of developers that do everything? – 17 of 26 Mar 15 '19 at 16:31
  • 1
    Is there a consequence to this team if they stop what they're doing to immediately swap contexts? It sounds like you're really tight about releases. If they're under constant pressure and consistently burned out, and there are harsh penalties for falling behind on anything, or harsh penalties for making mistakes on critical fixes. I'd completely get them being reluctant to jump into critical bugs. – AJFaraday Mar 15 '19 at 16:43
  • @17of26 We do not have a dedicated testers. Developers do everything from writing code, writing test (unit and ui,) and releasing – Code Project Mar 15 '19 at 16:43
  • 24
    That's one of your problems. Developers do not make good testers. https://www.joelonsoftware.com/2000/04/30/top-five-wrong-reasons-you-dont-have-testers/ If instead of 20 devs you had 15 devs and 5 testers, your product would be in better shape and you'd be paying less salary. – 17 of 26 Mar 15 '19 at 16:45
  • 1
    Do you measure code coverage metrics for the tests? – Technophile Mar 15 '19 at 17:10
  • 6
    If the product is critical to your business, I don't understand why you don't have an out of hours team on rotation for situations just like this one? – JoeTomks Mar 15 '19 at 17:25
  • 39
    "they must fix it even if they have to stay late." uhh oh nope you're dead wrong –  Mar 15 '19 at 17:42
  • 3
    I agree with getting a set of dedicated testers, ideally one per developer team. One thing I haven't seen anyone mention is that the concept of estimating against time is frowned on in Agile, because humans suck at estimating time. This is why estimates are usually done with points in relation to some non-linear sequence. The idea is that after several sprints, you should have an idea of how long all of the work will take on average, but individual stories may take longer than their point estimates. I.e. a 2 point story will take a day on average, but might take 3 once in a while. – jhyatt Mar 15 '19 at 18:19
  • 2
    @17of26 I disagree with your assertion, as a developer-in-test :-) I find that having been a developer for two decades makes me a very good tester: I know how developers think, how they'll take shortcuts, and it helps me design fun tests to trip them up during integration testing :-D – Aaron F Mar 15 '19 at 18:41
  • 10
    @AaronF You are the exception to the rule because you actually like testing :) – 17 of 26 Mar 15 '19 at 18:45
  • 18
    If there is a critical bug, they must fix it even if they have to stay late. Is this a joke OP? Be less of a slave driver and more of a leader. RE: Last time they stayed late at night, I gave them extra 2 days off. This is also just stupid. Have you considered that 2 extra days don't matter if they had something important you made them miss on the one day they stayed late? Are you aware how bad sleep and burnout accumulate doing this? Their obligation to you is limited, you are using ownership as some sort of culty weapon to control them and it appears the senior sees right through it. – CL40 Mar 16 '19 at 03:03
  • 10
    Reading your update, the critical bug apparently has existed for over a month, and blocks one of the most basic functionality (login). This tells me that something is seriously wrong in the structure of your dev/integration environments, service architecture and/or planning. Staying late fixing the bug is not going to work. – frankhond Mar 16 '19 at 08:38
  • 3
    "If there is a critical bug, they must fix it even if they have to stay late." The critical bug more than likely got introduced by poor planning - maybe not enough testing or trying to rush your developers. This is your fault. You need to take responsibility for the poor planning. It's starting to sound like the senior engineer on that team is the only senior there, as they are protecting their juniors from you. – UKMonkey Mar 16 '19 at 09:28
  • 3
    @AaronF I think the core idea is that a developer makes for a bad tester of their own code. I like Joel's writing but the QA one is a bit wonky even if the overall advice is sound. I also don't like the idea that testers are cheap and you should somehow base your decision on that. A good tester is worth their weight in gold. Aside from the metaphor In some areas they really aren't cheaper than developers but run around the same price. And even then I don't think "how much do I have to pay people" should be a factor. You need testers, just like you need developers or desks. – VLAZ Mar 16 '19 at 11:07
  • 6
    "the bug prevents 90+% of users from logging into the system. On average, this happens once a month this year while it happened twice last year." If you can quote statistics on a single bug that covers a period of months (even years), then it is not a critical, because apparently it hasn't been important at all these past few months to fix, which means that your senior may be correct to ignore your call for urgency. Why is it critical now, when it apparently wasn't critical all these months? – Mark Rotteveel Mar 16 '19 at 11:28
  • 3
    "Ownership" != "company vassal". – BittermanAndy Mar 16 '19 at 15:49
  • 7
    @Aaron You misunderstood Joel's post. You shouldn't test your own code, because most people will subconsciously avoid the areas they know are brittle. A developer makes a great tester for other people's code for the reason you named. But considering the going rate for developers compared to testers that's a rather expensive proposition. – Voo Mar 16 '19 at 17:10
  • 1
    Step 5-6 is where you went wrong. Inadequate test cases (I realise tests can't cover 100% of everything, but you stated that they "didn't write all the tests" which I assume it means test that were planned to be written, but not implemented) then took the decision to release anyway with only critical bugs fixed. I would have cancelled/postponed that release in favour of a more thorough investigation of the testing etc (because of "unknown unknowns"). Not an answer because it doesn't address your problem, but is this something you considered? – seventyeightist Mar 16 '19 at 19:07
  • 7
    In this particular example, the bug prevents 90+% of users from logging into the system. On average, this happens once a month this year while it happened twice last year - say WHAT?!?!? Your product has gone BELLY UP once a month this year, and a couple times last year AND YOU ALLOW THIS TO CONTINUE?!?!? *ARE YOU KIDDING ME?!?!?!?* Your absolutely top priority is to stabilize this system - not just band-aid it, but do whatever is necessary so that this cannot happen again. You CANNOT be locking your users out, or they WILL find someone more reliable. COUNT ON IT! – Bob Jarvis - Слава Україні Mar 16 '19 at 23:36
  • 6
    This question seems to have confused "ownership" with "willing to work arbitrarily large amounts of unpaid overtime with little to no advance notice". They're not at all the same thing. Though ensuring your developers have some actual ownership (as in, shares/equity in the business) can help. Otherwise, why should they care? – aroth Mar 17 '19 at 01:18
  • 1
    @Voo Its not just that a dev may avoid a difficult area to test (anyone might). Its that a dev has certain assumptions about how things will be used that testers won't ("Why did you click that?" "Why did you do it in that order") and that if they didn't think of a scenario in dev, they won't in test either. A new set of eyes will. Which doesn't mean that the dev should do no testing, they should write unit tests. Its just that unit tests are not the only form of test and should not be relied on to find all bugs. Or as the only thing you do. You need integration and smoke tests. – Gabe Sechan Mar 17 '19 at 07:02
  • 2
    Nowhere in your lengthy question did you mention the root-cause of "[Jan] test-cases supposedly finished. Found bug where user can't login... [Feb] released... [After deployment] retested everything and found the same issue coming back - [up to 90% of] users can't login." So what's your root-cause analysis of how that slipped past your tests? Did you ever write a testcase that detected that? If no, why not? If yes, why did you release before the bug was fixed? Like, what on earth does "After deployment, we found that some numbers are off (?!) so we retested everything" mean?! – smci Mar 17 '19 at 08:20
  • 2
    Honestly it sounds like you're ineffective at testing things, even when you know where the bugs are. You need to do a post-mortem with someone and figure out what your test process flaw was on that one. Not flog the employees into unpaid overtime. I also don't like that when you mention mistakes that sound clearly like your responsibility, it's always "we" released the product knowing it had severe bugs that the testcases probably didn't cover, or that you can't measure test coverage accurately. That one's on you, stop trying to pass the buck. – smci Mar 17 '19 at 08:25
  • 4
    Unpaid overtime, daily floggings and so on do not make up for a basic lack of methodology. If you're embarrassed at what a methodology review might reveal about your process, hire an external consultant. One obvious thing mentioned already by others is people should not just be testing their own code. Hiring one tester per team is also a good idea. – smci Mar 17 '19 at 08:29
  • 6
    I really strongly object to your title. More like "Our product has severe useability bugs, we've never written testcases that adequately cover them, our coverage metrics are unreliable, yet we keep releasing new code - what should we do differently?" – smci Mar 17 '19 at 08:37
  • @seventyeightist I found that out after app is released and since it's on App Store, it's out of my hand. Our plan to tackle it is to make tests more visible on dashboard. – Code Project Mar 17 '19 at 16:24
  • 1
    How often do emergencies happen? If an employer plays the "you must stay late to fix an urgent issue" all the time, then the employee is correct that it's not genuinely urgent. – Harper - Reinstate Monica Mar 17 '19 at 18:14
  • @smci They caught the bug the first time, and again on retest. It:s the engineer in question that:s unreliable. – Mars Mar 18 '19 at 00:14
  • @Harper OP literally states it's normally twice a year, yet this year, with this team (not the other 3 under OP's care), its twice a month. And if you read OP's definition of critical, it's pretty critical.... – Mars Mar 18 '19 at 00:16
  • Do the other teams get these "critical bugs" as well or is it just that one team with the senior dev? I'm assuming (!) that you have each team correcting their own issues (with some particular area of the product)? – seventyeightist Mar 18 '19 at 19:21
  • @Mars: totally wrong. The OP states *"the [login] bug prevents 90+% of users from logging into the system. On average, this happens once a month this year while it happened twice last year". So how I summarized it is totally correct: "severe useability bugs, we've never written testcases that adequately cover them, our coverage metrics are unreliable..."*. Do not buy the rest of what the OP says, semantics about "ownership" are a smokescreen. Clearly product has always had critical bug(s) in login, they've never shipped a release that fixed it, nor ever written a testcase that covered it – smci Mar 20 '19 at 08:59
  • @Mars: ...and the killer part is *"7. We released it to customers with all tests green. After deployment, we found that some numbers are off so we retested everything and found the same issue coming back - users can't login."* Can you spot the process errors in that? The coverage metrics (of login functionality) are garbage. That simple. Do you honestly believe that after a year of this, they have a testcase that covers it or not? Almost noone here does. Where does the OP talk about root-causing it? Nowhere. Where's the process? – smci Mar 20 '19 at 09:11
  • @smci It looks like we're reading it 2 different ways. I believe OP means 2 crit bugs all year, not 2 per month for a year. Crit bugs don't necessarily mean login bugs, so I think its wrong to assume its always the same, or even related, bugs. – Mars Mar 20 '19 at 12:37
  • @smci #7 says "no, we designed it right and someone screwed up. " or it says "the senior engineer in question lied." The reason OP isn't talking about the root-cause is because that's a separate question... Question a (root-cause) = What how do we fix it so this doesn't happen again? question b (the question being asked) = Someone on this team screwed up and the person responsible isn't willing to do what's necessary to minimize the damage – Mars Mar 20 '19 at 12:41
  • @smci I'm curious about 2 things: If the senior IS at fault, do you feel they should put in overtime to fix it? (For the sake of leniency, let's say paid OT). And under what condition is it not senior's mistake? – Mars Mar 20 '19 at 13:02
  • @Mars ^^^ It doesn't help our opinion of the OP's basic communication that armies of people here still can't decode precisely what they're saying happened, and OP now won't respond to clarify basics. ^^ To me, no it doesn't say either of those. It says the bug existed a year before this release, they never wrote tests that adequately covered it, OP was aware of that before he/she authorized this release, now they want to somehow belatedly use this as a pretext for never-ending firefighting. Clearly the OP did not have a releasable product. Do not buy the story about scapegoating the developer. – smci Mar 20 '19 at 13:04
  • @smci We're not here to discuss the reality, we're here to discuss the story. If you want to ask what I think if the situation isn't as OP says, then you should ask that question ;) – Mars Mar 20 '19 at 13:06
  • ^^ Given it is impossible to even understand precisely what the OP is blaming the SrDev for, but that all the bugs should have been caught with any decent software process, the SrDev is not responsible for the OP's behavior and general disregard for software process. I don't think many are interested in a sideshow about how many unplanned nights of overtime the OP believes they are entitled to. – smci Mar 20 '19 at 13:09
  • "We're not here to discuss the reality, we're here to discuss the story." is a very strange proposition to make. Where the OP is vague or self-contradictory on so many basics that it calls into question both their set of facts, competence and basic communication, we absolutely have to try to find the objective reality. – smci Mar 20 '19 at 13:11
  • @smci Sorry, I don't see where the OP is contradicting them self... – Mars Mar 20 '19 at 13:13
  • You know what I think the most likely situation was? A merge conflict handled incorrectly without proper retesting. Or a seemingly unrelated bug, where the person who fixed it didn't imagine the repercussion and only tested related to the bug. This happens often when tests are not automated--you test only what you fixed (yes, that is not ideal and I push for test automation, but in my experience, it isn't very common) – Mars Mar 20 '19 at 13:16
  • Which is a miss that needs to be fixed through a better process. Doesn't change the fact that it was the senior's miss or that immediate firefighting may be required. I think this question is step 1 and the process fix is step 2 – Mars Mar 20 '19 at 13:20
  • @smci By the end of Jan they said all features are done with tests in test cases. We deployed them to our test environment and found a bug where user can't login. It turned out that they did not write all the tests I did A. Oh, actually I didn't do A. And then again for this bug. I did B. Oh... I guess I didn't do B. – Mars Mar 20 '19 at 13:28
  • @CodeProject you should read https://www.amazon.com/Phoenix-Project-DevOps-Helping-Business/dp/0988262592 seems you have constant fire fighting and unplanned work in your organization – Nahum Mar 20 '19 at 13:37
  • @Mars: there's no need to keep repeating back to me the same unconvincing story which I'd already read five times before you posted. Note that every time OP is involved in a mistake, the royal 'we' slips in, never 'I'. ('We' knew critical login bug(s) existed for a year ongoing and never fixed it(/them), 'we' never wrote the testcase, 'we' decided to release it, 'we' found that some numbers are off. In your hypothesis, 'we' failed to notice the merge conflict and 'we' failed to rerun the test suite before release. This is just blame-spreading)... – smci Mar 20 '19 at 22:48
  • As to the OP deciding to release without all the tests written (and they should be prioritized, with the login bug tests high up), that was the OP right? And I don't like the vagueness of "It turned out that they did not write all the tests" a) which specific people? SrDev or others? b) how the hell did the OP not notice? c) when did "it turn out" - just before the release? That's not a software process. Trying to blame this on one failed merge by one person is not cutting it. The OP owns the process.) – smci Mar 20 '19 at 22:54
  • @smci Very interesting take! I saw that as the exact opposite! Instead of saying "SrDev/SrDev's juniors didn't write the tests," OP put themselves in there to make it a "we". to make it sound as if OP wasn't placing blame. How the hell did OP not notice? OP is running a team of 20. Chances are, OP isn't the one reviewing code, or even looking at the code period. OP might not even know how to read code. That how the hell did the OP not notice should probably be directed at SrDev...OP owns the process OP may not know anything more than what is reported to OP.... – Mars Mar 21 '19 at 05:18
  • A lot of this convo is just projecting based on our own environments/experiences and not really helpful.. I think its time to put it to rest :) – Mars Mar 21 '19 at 05:19
  • 1
    @Mars: I agree that the OP's account and overuse of 'we did X' or the passive voice, when describing things that went wrong, makes it impossible to know who did/did not do what and who was/was not responsible, and renders this objectively unanswerable. But it also doesn't inspire confidence in the OP's communication, and hence their version of things. Anyway yes might as well leave it. I doubt OP will come back and clarify the missing information. – smci Mar 21 '19 at 05:21
  • If there is a critical bug, they must fix it even if they have to stay late. How about you as a manager buddy? I hope you fix them at least half of them before asking others. Because, you are a leader and you explained nothing understandable about how you are keeping a buffer for critical bugs and how you are paying overtime. – Prasad Raghavendra Dec 09 '19 at 17:29

21 Answers21

344

You seem to be confusing two things:

  • Them working any amount of hours to meet unexpected or unplanned issues.

  • Them being responsible and providing quality work in a predictable way.

Ownership is not about the team working the whole night to fit your promises to customers. Ownership is about knowing what's in the code, how it works, having a plan and being able to tell you how and when things will be done. Ownership is developers making the right decisions so the code works correclty not just tonight, but in the years to come.

Sorry if this is a bit rough, but I've had too many managers tell me variations off your post. More often than not it boils to:

  • lack of clear mandate
  • changing requirements
  • short term focus
  • constant urgency

Would you please elaborate, in the question on what you, as a manager, did to prepare those releases, empower your team, and how you listened to their feedback? Then we can talk about ownership.

Jeffrey
  • 4,960
  • 3
  • 16
  • 30
  • 2
    Sometimes it's difficult for me to put my thoughts into words. You've done it very succinctly. Thank you. If I could up-vote this to infinity, I would. – joeqwerty Mar 14 '19 at 18:04
  • 173
    Well said, especially the last point. If everything is urgent, then nothing is. – 17 of 26 Mar 14 '19 at 18:10
  • 18
    While I kind of agree, in principle, IMHO we're ignoring an important fact. If your manager asks you to start fix something TODAY then you DO it today. You do not postpone to tomorrow because you think it can wait, no matters what. You may not finish (because rightly you may not want to work overtime) but you start ASAP, it's not your call. It's near to insubordination and as such you risk to be disciplined. – Adriano Repetti Mar 14 '19 at 18:46
  • 132
    @AdrianoRepetti I strongly disagree. Poor planning on my manager's part does not mean life-or-death urgency on my part. Yes, I do what my manager wants to the best of my ability, but I also try to keep my manager's expectations in check. If they are asking me to do something that is unreasonable, I am not going to stress myself out trying to do it. – David K Mar 14 '19 at 19:49
  • 28
    @AdrianoRepetti In the original question it mentions that the employee says that he "can't finish it today", so for all you know he does start working on it, but it isn't possible to finish until the end of the workday. – Maxim Mar 14 '19 at 20:18
  • 3
    @AdrianoRepetti I will push the "START" task button today - marking the task as "IN PROGRESS" and tomorrow I will think about working on it. – emory Mar 14 '19 at 22:37
  • 6
    Without commenting on the merits of the answer, it doesn't answer the question OP (currently has) posted. – user1717828 Mar 15 '19 at 01:06
  • 7
    +1 for the logic and explanation. But nothing here has to do with ownership, unless the employees participate in ownership (real ownership) of the company. Sometimes people talk loosely about a feeling of ownership, meaning a feeling of responsibility, but that's not ownership. A business or other property owner may want others to feel or act like they too participate in owning his property, but that does not make that ownership real. Such use is essentially ideological. – Drew Mar 15 '19 at 04:34
  • 2
    this is exactly what I've always wanted to say to my managers but never could – Pixelomo Mar 15 '19 at 05:37
  • 3
    @david as an employee you do not educate your manager disobeying his orders. People are rightly fired because of that. Yes, I can see that many people dream to do it but that doesn't make it a sensible advice. – Adriano Repetti Mar 15 '19 at 07:15
  • 5
    @DavidK Throwing around words like "poor planning" when talking about Critical Bugs is pointless rhetoric. If your manager could plan for that, they would be better served using their precognitive abilities to play the stock-market or to win the lottery. The only "poor planning" is not ensuring that there is always someone "on-call" to provide out-of-hours support when (absolutely) necessary. (And "90% of users can't even get past the login screen" is a situation I would call "absolutely necessary") – Chronocidal Mar 15 '19 at 08:52
  • 4
    How is this being upvoted? This should be a comment as it boils down to asking for rephrasing and elaborating. – LVDV Mar 15 '19 at 09:41
  • 2
    @LVDV The whole answer, including "Would you please elaborate" seems to me less of a genuine request for rephrasing and more of a thinly vailed slight at the OPs world view. And 'Re-examine your view of the situation because X, Y and Z' is kind of an answer... – Brent Hackers Mar 15 '19 at 12:30
  • 5
    @AdrianoRepetti As a manager, you don't ignore employee feedback. As a manager, you don't ignore time or resource contraints. As a manager, you don't ignore issues raised during work that are blocking progress - and still all of those are very common "manager practices". If your employee isn't obeying you, find why. Sometimes the problem is with them, but sometimes the problem is with you. – T. Sar Mar 15 '19 at 12:39
  • 20
    @Chronocidal Critical bugs in production can almost always be traced back to poor planning. if a critical bug gets into production, it means that not enough time or resources in the testing phase. Critical bugs will often get to the testing phase because not enough time or resources were spent on development. – 17 of 26 Mar 15 '19 at 12:46
  • 7
    @17of26 You are a 100% correct. Heck, this is a login error that affects 90%+ of the user base - how in the world this wasn't caught? – T. Sar Mar 15 '19 at 13:18
  • 13
    @AdrianoRepetti: Real world example: Over 6 months, about 80 of my days (= 4 months) have been spent on "urgent fire" emergencies. Every single project has developer absences every single day because some developer needs to emergency fix something in a past project. Over these 6 months, management has specifically blocked me from implementing any form of testing or code review because "it takes extra time". Projects are sometimes intentionally put on hold until they become more urgent than the others. Jumping when management tells you to jump is exactly how this situation came to be. – Flater Mar 15 '19 at 14:25
  • 2
    While I can agree on taking ownership being more about delivering quality than putting in overtime, the rest of your answer doesn't fit the updated question. Their methodology and planning seems spot-on. It looks like the team and senior dev in question don't keep to the same standards as the others by omitting mandatory tests and undermining quality. Good manager, bad dev / team performance. – Søren D. Ptæus Mar 15 '19 at 15:21
  • 5
    @SørenD.Ptæus The management failure in this scenario is not getting to the root causes of critical bugs continually making it into production. Making devs work overtime to fix the critical bugs is treating the symptoms and not the disease. – 17 of 26 Mar 15 '19 at 15:30
  • 1
    The "a critical bug affecting 90% of users shouldn't have gotten into production" hindsight approach is nice and all, but I don't get how most people here agree with working on your features instead of fixing the way the company makes money. Even if you take a "it's treating the symptoms and not the disease" perspective, the lead dev was obviously not going back to his desk to fix the disease (that would be more of a manager's duty). – R. Schmitz Mar 15 '19 at 17:08
  • 3
    Something like this - if 5% of users can't login, that's a developer problem (wrote it in a way that's got a bad path that can fail, QA didn't have visibility (or time to play)). If 90% of users can't login, that's called "QA did nothing". Or the app's UX is horrendous (i.e. the "happy path" is weird to go through and there's not enough guidance or error messaging). – Delioth Mar 15 '19 at 18:31
124

Even when I told him I needed it now, he said he had something else to do and sneaked off when I was not there.

Today, there is a critical bug and this senior guy said the same thing again - "I can't finish it today. I have a meeting with friends and I have to go." then he sneaked out while I was talking to my manager.

In both of these examples, you refer to him as sneaking off, but by your own words he told you that he wasn't going to do this work and then didn't do it. Sneaking off implies he's being deceptive or dishonest, but it sounds like he's being transparent, and you ought to recognize that. I've worked with people who say they'll handle things and then disappear, and those people deserve to be fired. Someone who informs you of their bandwidth and then follows through is different entirely. This person's integrity isn't an issue; he is only unemployable if his results aren't sufficient.

Last week, I told them in a team meeting that I expect higher level of ownership. If they promise something, they should do it. If there is a critical bug, they must fix it even if they have to stay late.

This is a reasonable statement and a level of ownership that senior engineers should generally accept with some caveats:

  1. Critical bugs must actually be critical. For example, in my own career I have stayed late to fix "critical" bugs that were then not deployed into production for two months. In those cases, it was a manager freaking out about something and wanting it now instead of actually a critical bug. Of course, there have been actually critical issues as well.
  2. Staffing levels must be generally sufficient. Meeting release dates and fixing issues are important, but if we are always late because we have 3 people doing 4+ people's work, that's a different situation.

Is there an alternative way to deal with issues like this?

Some development methodologies have built-in ways to manage these issues. In Agile development, for example, sprints are ways of promising what work will be delivered. It also includes built-in ways of measuring velocity (the amount of work being accomplished) and usually goes along with software (JIRA is the most popular one I believe) that makes whether or not a team or individuals are meeting those goals. In agile development, if you need to change course mid-sprint - like take time to fix a critical bug - it reflects that you're changing the scope inherently. Normally, you take things out in order to add whatever it is that must be added. This process makes it really easy to evaluate whether "I can't get to it today" is because he's working hard on other important goals or that he is just being difficult.

IMO, it's a fantastic method of software development that unfortunately is almost never done correctly.

UPDATE: in response to the question edits, this bug is absolutely critical in nature (at most companies it'd be called a showstopper instead of critical) and should be fixed immediately. I would follow the technique I described above - taking things off of his plate in exchange for him working on it now.

It sounds like this project has been a mess and very stressful for everyone involved, but a bug that makes it so 90% of users can't log in is worth staying late for. You need to assess whether or not this employee has completely checked out (in which case you have to help him move onto other employment) or if the project has just worn him down and he needs a break.

dbeer
  • 11,944
  • 8
  • 31
  • 39
  • 6
    I agree with most of your comment but I'm not sure Agile applies here. I'm assuming when the OP says "critical bug" he means something that has come up in released software that really has to be fixed right now (e.g. the recent Facebook outage... I suspect a lot of people were burning the midnight oil). It's true that Agile will let you measure impact on the schedule but OP doesn't even mention the existing work schedule. – DaveG Mar 14 '19 at 19:26
  • 58
    +1 for "Critical bugs must actually be critical." Just last week I saw a "critical" item that was ultimately ranked 6th in priority... I've learned, for better or worse, that "critical" is a word which can just be ignored. – zr00 Mar 14 '19 at 20:30
  • 2
    @DaveG I don't know if they are using Agile or not; I'm recommending it as a process that makes it clear what the impact of asking for a bug today more clear to all parties. My experience is that all parties involved have a better experience when the impact of escalating a bug is more transparent. – dbeer Mar 14 '19 at 21:23
  • 10
    To add to this, most engineers are flexible about working extra hours for genuinely "critical" bugs. But OP, do you then give them hour-for-hour time off in lieu? If not, expect your staff to start working to rule as this team do, because busting a gut for the company does not in the end make any sense if the company doesn't give anything back. – Graham Mar 14 '19 at 23:34
  • 3
    I love having 5 “high”, “show-stopping” defects that boil down to “this menu is lime, but is supposed to be chartreuse”. – zero298 Mar 15 '19 at 02:04
  • 1
    @Graham Quite right. I once worked for a company where my team got berated by the CEO for getting in at 11am after working till midnight the night before to meet a management-imposed deadline. I fired back that if he was unhappy, we would all be happy to henceforth keep strictly to working our contracted hours. Luckily for him, he had enough sense to shut up and never criticise any of us for arriving late again. – Mark Amery Mar 15 '19 at 15:37
  • @Graham The question has been updated; "Last time they stayed late at night, I gave them extra 2 days off." – R. Schmitz Mar 15 '19 at 17:14
  • 2
    "If they promise something, they should do it." -- It sounds like this guy is the only one who's actually living up to that, by not promising things he isn't going to do. If "do what you commit to" is the standard OP wants to impose, it's probably everybody else that needs the talking-to... – Tiercelet Mar 15 '19 at 20:23
  • 1
    While I agree with your answer, the fact bug prevents 90+% of users from logging into the system. On average, this happens once a month this year means no time is spent on testing critical functions. That time not allocated is managerial issue with "do this instead, it's more important". If that happened where I work/-ed more than once and I've warned them, I'm not staying late next time when we've not had time allowed to make sure it never happens again (e.g. unit and functional testing). – rkeet Mar 16 '19 at 09:32
  • 2
    @zarose that's called "priority inflation". You put a priority process in place: Normal (5 days) High(1 day) and critical (within the hour) and it starts getting abused. Once people find out, they're getting service faster, suddenly everything starts being critical. Where I work, we managed to put a process into place for showstoppers, where for every critical issue a team was formed to fix it, with a manager having the oversight. And to much annoyance of management I followed that process to the letter..."but I just want you to go fix it!" me: "but process". It reduced critical issues by 90%. – Pieter B Mar 16 '19 at 17:52
  • I think OP meant that OP reached out to his boss to have his boss kick subordinate's butt... and while he was doing that, subordinate departed the premises immediately. This is not the first time 'round with this sort of thing, so OP reasonalbly guessed subordinate's motive for leaving was to avoid the butt-kicking. Hence, "sneaking off". – Harper - Reinstate Monica Mar 17 '19 at 18:17
  • @PieterB Oooooooo that's a pretty nifty process I hadn't considered before! Definitely tucking that one away in my back pocket. – zr00 Mar 18 '19 at 14:31
76

In my office we use to quote the following:

“Poor planning on your part does not necessitate an emergency on mine.”

In my experience developers often are motivated to help with a problem that appeared because of a mistake on their side or something unforeseen.

But all to often issues arise that are not only unsurprising but predicted. Before you decide to give your developer an ultimatum and likely make him look for a new job, you should ask yourself the following:

  • Have you done enough to avoid "critical" bugs in the first place? Did you give developers enough time to implement testing, code reviews, refactorings and monitoring?

  • Are you making sure that new features get activated when there is enough time to fix them? (as opposed to late in the evening or on a Friday).

  • If critical bugs are common, are you paying enough for overtime or on-call duty?

  • Did the developers you want to have ownership, "own" the release process? Would they able to stop a feature release, if they think it was buggy?

  • Are your deadlines realistic and agreed on with the dev team?

If all of the questions can be answered with a clear "yes", then you might have to let go of your senior developer.

If any of the answers is "No" or "I am not sure", then I would start looking for the problem in management and fix these problems first.

Helena
  • 8,686
  • 3
  • 24
  • 51
  • 2
    I have listed things I have done to prevent critical bugs but obviously it's not enough. Of course, I give them enough time because they tell me when they think they can finish it and this time includes writing tests and code review. On top of that, I add 30-40% buffer time plus another 2 week for testing. Critical bugs wasn't a common thing until lately when we had it twice in there months. And yes, the team own the release process through ci/cd. I believe the deadline is realistic because everyone agree with it from start to end of the sprint (i check in every time we have a standup meeting – Code Project Mar 15 '19 at 03:24
  • "Have you done enough to avoid "critical" bugs in the first place? Did you give developers enough time to implement testing, code reviews, refactorings and monitoring?"

    This strongly agree with this point. There are processes that ensure good quality software. If a manager prevents his/her engineers from completing these processes then crises that occur as a result are the manager's problem, not the engineers'.

    – user2818782 Mar 15 '19 at 06:36
  • 10
    Developer who goes "not my bug, I don't care how much the company loses because of it" is... a bad person to have on a project, no matter how technically good developer they are. Any developer who doesn't care should just find a new job where they would care and/or where emergencies don't really happen and/or the company does better job at avoiding emergencies. Staying but not caring is not good for anybody. – hyde Mar 15 '19 at 10:12
  • 13
    @hyde: On the flip side, quite some companies excel in creating such developers. And the company in the question sounds like one. You can cycle the people, but that won't solve the problem. You end up creating just more cynics. – MSalters Mar 15 '19 at 11:08
  • 24
    @CodeProject "And yes, the team own the release process through ci/cd" - that's... not what Helena means. You're talking about the team controlling the mechanism by which code gets deployed. Helena is talking about the team owning the decision about whether to release - about them being able to decide whether it's advisable to release a feature as it stands, and decide not to (and to let a deadline slip) if they think it's not ready. Your comment - which focuses most on defending the deadlines you impose on them - suggests to me that they do not in fact have such ownership. – Mark Amery Mar 15 '19 at 15:28
  • TL;DR: "Mirror. Here. Use it". – user13655 Mar 17 '19 at 18:18
47

You claim lack of ownership by the team. Everything your developers build is owned by the company, not them. When you say that your employees should "own" the results of their work, does it also mean that they will receive the profits that those results make for the company? If it doesn't mean that, they don't truly own the work and you can not ask ownership from them.

If there is a critical bug, they must fix it even if they have to stay late.

Your solution to fixing critical problems by making your people stay late is convenient for the company and the employees pay the price. Again, that would be OK if they also get a share of the profits. Do they?

In this particular example, the bug prevents 90+% of users from logging in into the system. On average, this happens once a month this year while it happened twice last year.

When this happens so often and you don't install organizational procedures to reduce the impact of those errors, it is you as an organization that is at fault.

Actually, your current approach to fixing "critical" problems and your contemplation of firing your employee could be considered a sign for a dysfunctional organization. Your employee's behavior might be his way to react to that. Your update on the original question with a list of what you think you are doing right (as opposed to thinking what you might be doing wrong) also shows that you might have an issue accepting that you as a manager are a part of the problem.

There are a lot of things management can do to improve quality and reduce urgency before you ask employees to stay late:

  1. No matter how well you think that you have focus on quality, the results show that you haven't. You have to seriously improve the quality of your development process, which could mean measures like reviews, inspections, pair programming, increased testing, redesign of critical components, improved architecture and design etc. You better start analyzing the organizational issues that cause those problems instead of writing down the list of measures you have already implemented. Obviously, they are not working.
  2. Why does your employee have to stay late to fix the error? Can you do your releases in the early morning to give your developers the entire working day to fix issues?
  3. Have you thought about using feature toggles or other measures to quickly revert to the previous version of the feature to give your team time to fix the problem?
  4. You can not blame your employees for having plans for the evening when issues pop up on short notice. You can install a system of stand-by duty on days of critical releases. Then people know beforehand that they might have to stay late and can prepare accordingly.
Sefe
  • 1,030
  • 10
  • 12
  • 6
    The third point on this is very standard for critical functions on all of our workplace programs. +1 – IT Alex Mar 15 '19 at 14:23
  • Please do not do point 3. Instead containarize your application. Deploy a new version when ready. If buggy, you can instantly re-deploy the previous version. Feature toggles are a great way to have useless code floating around because it's never used/ "turned on". +1 for remainder of answer :) – rkeet Mar 16 '19 at 09:45
  • 3
    @rkeet: totally disagree. it's extremely important to be able to toggle on/off specific features at runtime without having to redeploy anything. And this has absolutely nothing to do with having your applications containerized or not. I don't want to have to involve thirdparties / release managers / platform supporters just to disable / enable a simple feature that's causing havoc if I can avoid it. – devoured elysium Mar 16 '19 at 16:43
  • @devouredelysium If a feature needs to be turned off because it's causing havoc, you have broken production. If you have broken production, that entire build is not ready. If a build is not ready, it should not be in production. Assuming the previous build works, redeploy that and fix your broken production. Containerized deployment takes like 2 seconds anyhow. If your chose third parties / rm's / platform supporters are as unreliable as it comes across as you think they are, you need to re-evaluate your choice in software. – rkeet Mar 16 '19 at 17:42
  • 1
    @rkeet - your points make sense in a server-side environment, but as has gradually come out in the comments, the question is actually about a mobile application - which is a situation where on the iOS side updates have to go through app store approval latency and on the Android side there is an untestable variety of platforms implementations, and on both where the installation of updates is subject to manual end-user approval. Feature toggles or even "this version is broken, you must update it to get to any other screen" server responses are clever attempts to deal with these realities. – Chris Stratton Mar 16 '19 at 18:13
  • 1
    @rkeet: so by your reasoning, if your right arm hurts, it must be because your whole body is damaged? If feature A is problematic (and it can even be because another system is malfunctioning and it's preferable just for safety to also disable this functionality) then you shut off feature A, you don't roll back a whole sprint worth of functionality. – devoured elysium Mar 16 '19 at 20:17
  • @ChrisStratton Hadn't read anywhere it concerned app stores / mobile apps. Then, and only then, I suppose it could make sense. Better to make sure it never reaches production. – rkeet Mar 16 '19 at 20:48
  • @devouredelysium biology doesn't come into play here. But playing on that, yes, if your arm hurts, it is the body that is damaged, is your arm not part of your body? Weird analogy. Whatever. And yes, I would roll back a whole sprint of functionality. Not throwing it away, just taking it down to be fixed. (In your analogy, you'd apparently rip off the arm to fix in a hospital, or put a tournequette on it till it's fixed...). In fact, I recently rolled back a release with 3 weeks worth of work. It causes errors (though hotfixable) it would've caused overtime. No need, roll back, tomorrow new day – rkeet Mar 16 '19 at 20:52
  • @rkeet it's a little hard to have any customers if your mobile app never "reaches production" so the mechanisms proposed are mostly definitely meant to. Arguably, the asker might consider going further and making their app essentially a viewer for dynamic content downloaded from a server and at most cached; then they actually can roll back changes to much of what matters easily. But that can be mathematically demonstrated to be the extreme of making the entire thing a collection of fine grained toggles. – Chris Stratton Mar 16 '19 at 20:55
  • 1
    @rkeet: and are customers (and other service's owners) going to be willing to have your old version running for hours (or even days) until you find the problem, fix it and properly test it? What if then it's not really fixed and you have to roll it back yet again? lol. So your plan is to have the whole office looking after you instead of properly isolating the problem (shutting it off) and calmly fix it while keeping everything else running smoothly as always? Good luck. – devoured elysium Mar 16 '19 at 21:00
  • @devouredelysium I suppose you don't mind staring at blank, not loading, pages. Or "500 Internal Server Error". Or "Incorrect login, please try again" or whatever else. I prefer a working version. If a feature is broken, yes, take the bugger offline and fix it. Don't let me, user, be troubled by your failings. – rkeet Mar 16 '19 at 21:03
  • 1
    @rkeet: if the feature is broken, as the OP original stated, you disable the whole feature. The user won't see it. You don't leave it broken. That's precisely the idea of having a killswitch. – devoured elysium Mar 16 '19 at 21:04
  • 1
    @devouredelysium You do realize that if you'd deploy a previous version, the whole broken thing is not in it right? You then fix it without bothering your end-users who judge you on their experience. When you've fixed it: re-deploy. – rkeet Mar 16 '19 at 21:19
  • @rkeet: I have written feature toggles or other measures to quickly revert to the previous version. My main point is to make it easy to revert. How to do that should be the OP's decision, but he should defenitely find a solution. – Sefe Mar 16 '19 at 21:28
  • @rkeet: and you leave all the features of all the other teams, possibly even with other minor bug fixes included, grinding to a halt? lol. I won't continue this discussion. Absolutely no one in the business does anything remotely similar to what you're describing (as the post we're commenting on testifies) if they can avoid it. – devoured elysium Mar 16 '19 at 23:03
  • @rkeet your solution to this problem requires specific project and ci/cd setup. What if db migration is not safe to be rolled back? What if other already deployed services depend on some other properly created new feature? What if new build fixed some critical bug? Etc, etc... – Askar Kalykov Mar 17 '19 at 17:10
40

Working in software this is very common.

You treat your people as professionals. You're talking ownership but then giving demands that a 'critical' bug must be fixed NOW.

Is the bug actually 'critical'? Is it the result of unclear requirements? Our old friend 'scope-creep'?

In each of these you (as the manager) need to manage expectations. Not every bug is 'critical'. Requirements can suck. Project scope changes.

Instead of demanding they drop everything for something 'critical' work with your teams to when it will be fixed. Then hold them to this estimate.

I've been putting 'critical' in quotes because after 30+ years in this field (yikes I'm old) this term is very misused. Everything can not be 'critical'.

JazzmanJim
  • 8,338
  • 1
  • 20
  • 39
  • 30
    Holding people to their estimate is pointless - it's called an estimate because they don't actually know when it'll be fixed. – Erik Mar 14 '19 at 19:47
  • @Erik In that case they should know exactly what went wrong to cause their estimate to be off. – JMac Mar 14 '19 at 20:11
  • 26
    @JMac when it's solved they will know what was wrong and where the time went, if you want to have a retrospective. But until it's solved they can only tell you what the time has gone to trying (or what other obligations have gotten in the way), and maybe their current hunch for what to check / try next. Some discussion along the way can be productive and insight can come even from the act of conversation; but there's a point where discussion and reporting itself become a self-defeating source of delay. – Chris Stratton Mar 14 '19 at 20:20
  • @ChrisStratton It doesn't really make holding people to their estimate "pointless" though. I'm not saying they need to take time away from the solution to articulate where their time went; but when someone provides an estimate they should be held to it, to a reasonable extent. If holding people to their estimates is pointless, so is getting the estimate. The fact is, estimates are useful and commonly employed, often required to organize work in a way to meet deadlines. I was mostly getting at the point that you should be able to support your estimate, and changes to it. It's not a guess. – JMac Mar 14 '19 at 20:47
  • 6
    Indeed, estimates on complex problems usually aren't meaningful, and everyone with any sense knows that. At best for something large with the right guess multiplying factor you may come out approximate on average, but the specific time sinks are rarely those expected, so it's really just a test of one's pessimism skill. – Chris Stratton Mar 14 '19 at 21:01
  • 1
    @ChrisStratton I would argue that there is a world of difference between "often wrong" and "aren't meaningful". No matter how incorrect your estimations are, there should still be meaning behind them, and any estimate you give will definitely be interpreted with meaning. The person giving the estimate isn't solely responsible that the estimate is right, the system is far too complex for that, but they should definitely be held to it to an extent, or else project planning becomes essentially useless. It's not like "we'll finish when we finish" is acceptable to most clients. – JMac Mar 14 '19 at 21:11
  • 6
    In that case you want to have people make an honest estimate, then multiply that by 5, and than tell you that amount of time, just to make sure that it will be likely that the issue is fixed within that timeframe. That's no longer an estimate but safe expectation management. –  Mar 15 '19 at 08:11
  • 2
    Here is nice talk about estimates: https://www.youtube.com/watch?v=QVBlnCTu9Ms (I don't agree in every aspect with everything, but it makes a few good points). Hint, the title is "#NoEstimates" – Frank Hopkins Mar 15 '19 at 12:21
  • 2
    "yikes I'm old" - That means your input is very valuable. – DxTx Mar 15 '19 at 16:18
  • @DT While I agree with JimmyB, that's not necessarily true. I've met a fair few older developers, managers and bosses by now. Quite a few them might've been around the block, that doesn't mean they learned from it ;) (though JimmyB seems to have picked up on things ;) ) – rkeet Mar 16 '19 at 09:40
  • @Erik I would argue that holding people to estimates is reasonable because estimates should always be overestimates. If you estimate 2 weeks, and deliver at one, no one is going to complain that they've a week to do extra testing or that you've a week to spend on some other project. Project planning completely depends on this fact; and if you want a project manager to be your friend, don't deliver late. If you deliver late on the other hand, you don't know what impact on the rest of the project you're going to have - which can result in an urgent issue. – UKMonkey Mar 16 '19 at 11:11
  • @UKMonkey once a manager told us all estimates have to be accurate and we absolutely have to finish within that timeframe. From that day on until he left the company about two months later all my estimates where "two years". I managed to finish all tasks within that time frame, which gives me 100% correctness of the estimates. For some strange reason, as far as I know none of my estimates where communicated to the customer. I wonder why. – Josef Mar 16 '19 at 14:32
  • @Josef lucky for you they didn't. Professional services engineers are expected to make 3-5x their salary from clients; and not doing so puts their salary at risk. The usual way of estimating is "how long you think it'll take x 3" – UKMonkey Mar 16 '19 at 14:59
  • @UKMonkey but if someone asks me to give a "estimate" that is the absolute maximum, I will give that. If you estimate three times what you think how long it takes there will be tasks where it takes longer than that estimate. – Josef Mar 17 '19 at 22:01
  • @Josef you've never been in the situation where the number you've given has been sent to the client and money charged based on that - you acknowledge this; but had it been, client's would've always rejected your number, and as an engineer failing to bring money to the company your contract would've been terminated.

    An estimate is an estimate sure; but if you're picking stupid numbers then you're making no friends at best, and at worst you can be putting your job on the line.

    – UKMonkey Mar 17 '19 at 22:28
  • 1
    @UKMonkey those aren't estimates, those are pessimistic "done by" dates. If people need those, they need to tell me that, and I'll give them my normal estimate x 5. But if you ask for an estimate, I'll give you an honest estimate. If numbers needs to go to people who fundamentally don't understand what an estimate is, the person I'm giving the estimate to needs to apply whatever nonsense they want to my numbers. I'd rather be honest to the people I'm working with. – Erik Mar 18 '19 at 07:39
  • @UKMonkey I am often in the situation that numbers I give are sent to clients. I often discuss numbers with clients in projects I manage. I never acknowledged anything else. If someone asks me for a worst case "done by" date, they get a worst case. – Josef Mar 18 '19 at 10:04
35

With the updated question, it is now clear that you are trying to fix the wrong problem. The senior engineer's behavior is a symptom of a fundamentally broken software development process and/or dysfunctional company.

If you have critical bugs getting into production every month, then you have at least one of the following problems:

  • Incompetent engineers
  • Unmaintainable code base
  • Inadequate testing

Given how much manpower you have at your disposal (20 engineers is a LOT of resources), it's likely a combination of all three.

My guess is that the senior engineer is fed up with the constant firefighting, and rightfully so.

You need to dig deeper and fix the underlying problems that are creating the need for people to continually work late. Convincing one engineer to work late more often is not going to help the big picture.

Now, what to do about it...

Step 1: Figure out why testing is not catching these critical bugs

The first thing you absolutely need to do is stop these critical bugs from ever reaching production. Every bug that reaches production is a failure in the testing process.

Go back over every critical bug that was discovered in production and determine exactly why it was not caught in testing. Add more automated test coverage, manual test coverage, or testing resources as necessary.

Step 2: Determine the root cause of every critical bug

For every critical bug, find out:

  • Who created the bug
  • When the bug was created
  • Why the code was being modified
  • Where the bug was introduced in the code

By doing this analysis, you will discover some patterns. Maybe there is one or two developers who keep introducing these bugs. Perhaps there is one code module that is very difficult to modify without causing problems. Or it's possible that the code as a whole is very difficult to with.

17 of 26
  • 1,712
  • 13
  • 13
  • 22
    And STOP ADDING MORE BROKEN FEATURES until the critical bugs in the existing code are fixed! – shoover Mar 15 '19 at 15:33
  • 5
    Sometimes these issues aren't even "bugs" but rather fundamentally wrong design incompatible with the environment in which the code runs; you can fix an endless stream of "bugs" surfaced by the new way that causes breakage every time the code encounters previously unseen (but perfectly compliant) behavior of interacting systems or OS layers, or you can take time to fix the underlying design mistakes. – Chris Stratton Mar 15 '19 at 15:57
  • 2
    @ChrisStratton Exactly, which is why it is absolutely necessary to understand the root cause of these critical bugs. – 17 of 26 Mar 15 '19 at 15:59
  • 2
    I very like this answer but I'd expect a SENIOR engineer to go to the manager to discuss the problem when he is "fed up" with something, to stop doing "firefighting" is not IMHO the response I want to see from a professional experienced engineer. – Adriano Repetti Mar 15 '19 at 19:17
  • Every bug that reaches production is a failure in the testing process - Almost correct. "testing" is part of development. With 20 developers, you should be employing unit / functional testing (preferably both). As such, testing is in integral part of the entire development process, not a stand-alone process. Would reword as : Every bug that reaches production is a failure in the development process. Also, I think you're missing the "management wants features XYZ by ABC". Always a giant domper on proper work. Then you get OP demanding you stay late... then you look for a new job. – rkeet Mar 16 '19 at 09:49
  • 6
    @AdrianoRepetti - who says they didn't? Developers usually get into this cynical phase when they learn that management in their company doesn't listen. – Davor Mar 16 '19 at 12:05
  • @Davor because OP didn't mention it (and probably he wouldn't even need to ask if someone did it). We're making way too many assumptions here, I'd love to keep them at the minimum whenever possible. Judging from what OP wrote all I can say is that process should be improved (it always can be) but they have a serious (technical) problem with this senior (whether or not OP is entitled to expect overtime work it really depends on the culture). – Adriano Repetti Mar 16 '19 at 14:22
  • I mean: code you wrote and you had plenty of time to test...stopped 90% users to use the service? S* happens but as a developer I'd be ashamed to answer "let's see tomorrow", no matters what my reasons are (especially if I didn't spoke out beforehand). – Adriano Repetti Mar 16 '19 at 14:27
  • 1
    @AdrianoRepetti maybe he has, and nothing happened. We're only seeing one side here – user87779 Mar 16 '19 at 16:04
  • @user definitely. That's why I was talking about assumptions. Literally everything or nothing might be true then IMHO we should (mostly) stick to what we know because OP wrote it. – Adriano Repetti Mar 16 '19 at 16:28
  • @AdrianoRepetti I would agree about that for the comments (and note that your comment on "expectations" IS based on a few assumptions). But the answers should go with different assumptions to broaden the help to people other than the OP as well. Fwiw, I would say the only things we KNOW are that there is a failure in testing and there is a failure in design. The fact is that the most important thing to correct now is testing and BROAD design practices, not one employee's (of 20) performance – user87779 Mar 16 '19 at 17:22
29

I want to make one additional point. Rushing out a bug fix often leads to technical debt. If your senior developer is questioning how it will be tested tonight then that is a good question that a senior developer should be asking! I’ve worked at places where urgency is prioritized over quality and this has had negative long term consequences. Ultimately, your team will have reduced capacity because it is always fighting fires.

  • 4
    Yes, and don't automatically assume that just because the other teams would work back and fix the bug that they're in the right on this. They may even feel the same way, but don't want to pick a fight with management over it. – Matthew Barber Mar 15 '19 at 00:52
  • The question about how are we going to test that was asked about two months ago and we fixed it by having tests running against all commits and prs. – Code Project Mar 15 '19 at 03:28
  • 1
    @CodeProject the sequence of events in your edit is actually a pretty good example of what this response is warning about. The tests you ran in late February did not catch this issue, so no, the testing concern was not actually fixed. Likely there are areas of your codebase (or its interaction with the underlying mobile OS or remote services) which are not yet properly understood, and as a result the bug fixes continue to contain unsafe assumptions that break in situations outside those you've thought to test for. Taking time to really understand it will be needed, tests can't catch all. – Chris Stratton Mar 15 '19 at 04:04
  • 13
    If you really want "ownership" you are likely going to have to be open to letting your senior people determine more of the agenda in order to include the things they need to do to really get a handle on the underlying issues. In contrast, if you dictate the goals to the degree where they can only address symptoms then in actuality you have taken ownership in a way that precludes them from having an opportunity to do so. – Chris Stratton Mar 15 '19 at 04:36
  • 1
    @ChrisStratton I agree with the part that we need to take time to understand our code base. For the second comment, we actually had 2 sprints that they decide what they wanted to work on which were refactor code, fix bugs, add more tests. Any thoughts on how to tackle this issue? – Code Project Mar 15 '19 at 07:05
  • Let's admit that the process can be definitely improved. Let's also be honest and admit that there is the remote possibility that the team is not cough cough competent enough. Cough. – Adriano Repetti Mar 15 '19 at 19:12
  • 3
    Rushing a fix can also have catastrophic short term consequences, – gnasher729 Mar 15 '19 at 21:14
23

It sounds like you have a huge testing problem. You ask why does everyone not drop all outside commitments to put out a fire but the real question is why are there fires starting every month?

Do you have any QA/Testing? Why did they not find that the first and most basic step (logging in) does not work. How did something that does not work at all get pushed to production.

Also why is your response to users not being able to log in to get everyone to stay late rushing "critical" fixes instead of having a system admin revert the update and the update can be attempted again later after the issues have been fixed.

"How are we going to test that tonight?" This is the correct response. When there is a critical issue and you are being pressured to fix it right now how will developers set aside time to properly review the changes are correct/high quality and how is QA meant to check that everything else is still working after the change. It sounds like you are also asking for these changes at the end of the day where everyone is tired and their thinking ability is at its lowest making it even more likely other issues will sneak in to this critical fix.

Qwertie
  • 337
  • 1
  • 7
  • 3
    The question and its updates make it clear that there is testing and that the testing was critical to the release decision - but also that the testing that there is, is not catching the flaws. Some modern environments have enough distinct moving parts that expecting tests to catch everything is naive, since unsafe code can work or break depending on conditions outside of the test environment. What that points to is a system with aspects that no person on the team fully understands. – Chris Stratton Mar 15 '19 at 05:15
  • 5
    @ChrisStratton If your testers are unable to actually test things then I would suggest that that is a problem itself. That also does not explain why there is no process in place to simply roll back changes that went bad. The developers likely have no control over the testing and ops procedures and are sick of constantly having to deal with failures in the process. – Qwertie Mar 15 '19 at 05:18
  • 5
    No, neither testers nor automated test harnesses can cover every eventuality. They have their role, but the set of possible interactions is larger than you can enumerate and easily involves more physical variety than you can either reasonably purchase or simulate to include in your test coverage. Patching what broke the tests last time is not enough - it has to be right by design, and not have fundamentally unsafe parts that just happen to work in all the tests tried. Software deployed to customers is not necessarily as easy to just roll back as something server side. – Chris Stratton Mar 15 '19 at 05:30
  • 1
    +1 this is the answer I was composing in my head – Phil Mar 15 '19 at 12:07
  • 12
    @ChrisStratton Both of those are true, but they should be known and accounted for. If you are unable to get sufficient test coverage, and you are not able to rollback the deployment then you should arrange in advance for a developer to be on hand to cope with any issues that may arise. I react much better to 'Can you work late next friday to cover any issues arising from the release?', than 'Its all gone to hell again, cancel your plans'. – Phil Mar 15 '19 at 12:11
  • 2
    @ChrisStratton reading OP's further comments it sounds like their idea of testing is limited to having the developers write unit tests. – Aaron F Mar 15 '19 at 17:59
  • @Aaron untrue, they described testing on multiple phone models. But neither will protect against fundamental misunderstanding of the mobile OS. – Chris Stratton Mar 15 '19 at 18:07
  • 1
    @ChrisStratton this is the comment I was referring to. It sounds to me like there are no testers and not very much testing going on. They need a QA team, not having the same developers who wrote the code test it as well. – Aaron F Mar 15 '19 at 18:36
  • @user87779 - not as much as you would think - when a mobile app sits between an often widely misunderstand mobile OS and remote services, the scope of interaction easily exceeds achievable test coverage. You might note that the asker has described a situation which passed tests but was still buggy in a way that only showed up after release. That's quite common - fundamentally wrong code seems to work until it is exposed to just the right set of circumstances to expose the flaw. Or some particular mobile device actually mis-implements an API - that is less common, but it does happen. – Chris Stratton Mar 16 '19 at 17:15
  • @user87779 - I see plenty of both. But then, I'm often engaged specifically to solve these kinds of problems - ie, others' code with unsafe assumptions that were not triggered in the original tests, or situations where the platform actually deviates from its specification. – Chris Stratton Mar 16 '19 at 17:22
10

When are people most productive? When is the team most able to handle critical bugs? There have been studies that answer said questions to when humans are best able to handle certain tasks.

You have a critical bug, and you want, a) Sr. to switch mental gears, b) Pick up a new "critical" task, c) work "till whenever" to fix it. And you expect this critical patch to work? Honestly, what do you expect for the product, the team, the team members if your wants were satisified?

Let go of your ego, and your irrational beliefs.

paulj
  • 1,298
  • 8
  • 13
  • 3
    So it's late in the day and you find a bug where 90% of users can't log into PROD... you are saying that if studies show that your best work is done at 10am in the morning that you should wait until then to work on this? That doesn't sound silly to you? – JeffC Mar 15 '19 at 16:50
  • 4
    @Jeffc If 90% of users can not log in, why was this not picked up during testing before rollout? That is not a bug, that is a system error. Where is the dedicated support team out of the four teams mentioned above? Even if rotating. – paulj Mar 15 '19 at 17:24
  • I don't disagree with your comment... but I do disagree with your answer. Either way, the point is when the PROD issue is found is the time to fix it, not to wait until the optimal, most productive time for the employees. – JeffC Mar 16 '19 at 00:39
  • 2
    @JeffC I would say it's time to roll back from the experimental branch they should be deploying this code on. Then methodically fix the issue without rushing and possibly introducing other bugs – user87779 Mar 16 '19 at 17:24
10

The term you are looking for is Discretionary Effort not Ownership.

I am assuming that your employees are meeting their contractual obligations (otherwise your course of action is clear).

You have no right to expect discretionary effort. That is what it is by definition. Fundamentally this is not something that you can speak to them about and expect a change. You are likely to get the opposite response. They are under no obligation to give it. Threats about firing them are likely to have such an overwhelming poor response, as well as being illegal.

I don't have any good suggestions on how you can improve things. The very fact that you can rely on Discretionary Effort by some of your people suggests to me the culture is not necessarily broken.

Fixing this will take time, so instead, I can offer stop-gap measures:

Fix the bus-factor of 1

Why can only a single employee resolve this issue?

Have an on-call roster

According to reimbursement agreed upon with individual employees, not what you think it is worth.

Roll out updates at better times

It may not be possible, but rolling things out at better times can increase the chance for someone to assist.

The worth of your software is a function of how well it is supported, so you shouldn't use Discretionary Effort as a crutch. If you want your software to be supported to a level, you need to ensure you have things in place to ensure it.

Gregory Currie
  • 59,575
  • 27
  • 157
  • 224
  • You missed the part where the OP said the senior told the team to go home. Bus-factor > 1. Also, I'm curious about this whole "discretionary effort" thing. Whatever term you choose to use, in the US, engineers are professionals, meaning they essentially get paid to do a job, not to work for X hours (they decide, or in most cases give up the right to decide) how long things should take. The engineers delivered a product that they greenlit, which turned out to be defective. If that is the case, I think an employer legally has the right to expect OT, even unpaid OT... – Mars Mar 15 '19 at 09:23
  • But if you have info that says otherwise, I'm all ears – Mars Mar 15 '19 at 09:24
  • 3
    I can't speak for the US, but in Australia "workers" fall into two categories, employees and contractors. Engineers may be employed as either. Most of the time, workers are employees. To the best of my knowledge, every single engineer (200+) I've worked with has been an employee. – Gregory Currie Mar 15 '19 at 09:33
  • 3
    Regarding unpaid overtime, in Australia, forcing an employee to work unpaid overtime, even if they made an error, would be considered illegal. – Gregory Currie Mar 15 '19 at 09:34
  • Interesting. In the US, software engineers have no right to overtime pay (although I think its at least common practice for an overtime pay clause to be in their contract) – Mars Mar 15 '19 at 09:35
  • 1
    I think the two terms that seperate them in the US are "contractual engineers" and "salaried engineers". You may very well be correct that most/all engineers are contractual engineers. (Also something that may be interesting is the legal definition of the term "engineer" - in Europe it has a specific well defined meaning - this may not be true for USA and "anyone" can be an engineer?) – Gregory Currie Mar 15 '19 at 09:38
  • Interesting. I've worked in the US and Japan and I see a person who said they delivered X, but delivered Y. I expect him to change X into Y ASAP or accept that I will no longer employ them to not give me what was promised. But, different cultures and laws apparently! – Mars Mar 15 '19 at 09:38
  • A "professional" (a subcategory of salaried employee) is exempt from overtime restrictions as they are considered knowledgeable enough to determine their own workload/pay balance – Mars Mar 15 '19 at 09:40
  • As far as I recall, engineer does not have a legal meaning in the US – Mars Mar 15 '19 at 09:42
  • 1
    The legal meaning of "engineer" varies by state in the US. Regarding software engineers, it's very common for us to be either contractors or employees in the US--and I've recently worked with an Australia-based contractor. (The contract involved still wouldn't permit "forcing" to work overtime.) – chrylis -cautiouslyoptimistic- Mar 15 '19 at 10:36
  • "Professional" has nothing to do with it. If you are salaried and make over a certain amount (~40k, I think, though those numbers are in flux based on the discretion of executive orders) then you are not legally entitled to paid overtime (though your company may still choose to have an overtime policy that includes extra pay, it isn't required by law). – David Rice Mar 15 '19 at 14:08
8

So, you expect your employees to give up their social and/or family lives at the drop of a hat in order to fix problems?

Are they really all that critical?

Managers always seem to think that everything is critical because saying no is hard. This is a strong potential reason why your lead dev is pushing back. They are trying to protect their boundaries because you won't. And they are trying to protect their team's boundaries because you won't.

If they truly are all that critical, then what is going wrong that allows these issues to happen?

If your product quality is that bad, then you need to move over and let your developers devise a plan to get the product back on track. Poor quality isn't just about bugs. Poor quality derails predictability. If you are consistently going off plan because your quality is this bad, then fix your quality. And you don't fix it by asking developers to do it in their personal time. If that is the expectation you set, then you are telling your developers the business does not care about quality and therefore does not value predictability. If you do not value predictability, then stop complaining.

If they truly are all critical, then why don't you plan an on-call rotation?

Not only does this protect employees' personal time and protect the business's needs, it also creates incentive for developers to fix the systemic problems that are causing them to fire fight so much. (maybe you need more or better tests, maybe you have broken legacy code, etc.)

Why don't you stay late and fix things?

You're complaining that somebody doesn't step up to work through the night to fix a problem. Why don't you work through the night to fix it? I think you'll find the same conclusions as your team lead.

Your behavior

You have threatened to fire your employees for not doing something which you yourself refuse to do. You are complaining this happens a lot, yet you have not planned for it with an on-call rotation or by repaying technical debt.

Reading your list of steps to plan a release, what stands out to me is the frequent use of "I told them to..." and the granularity of planning all the way down to function names. You plan out minor details that are easily changeable, but won't plan a support process for your product.

This is 100% your problem.

Your team

It sounds to me like you have a bunch of smart, honest, professionals who know how to make good software, but their manager likes to dictate to them how to do their job and when the manager's approach causes a problem, force them to work more hours.

Have you stepped back and asked your team how to get less critical bugs? Have you asked your team how they think they should handle responsibility for unexpected critical issues?

Your team lead is right to push back on your expectations. And I'm glad to hear that he is encouraging his team to say no to things. He is trying to protect the team because you aren't.

In my time as a team lead, I can tell you that one of the hardest but most important lessons is learning how to say no. Maybe you can learn something from this employee of yours.

Brandon
  • 869
  • 6
  • 10
  • They are critical. These two incidents we have had were a) users can't login and b) users can't purchase products. As I wrote in my question, only these two issues are classified as critical.
  • – Code Project Mar 15 '19 at 16:08
  • What was going wrong was that we did not test on all devices as we agree. This is a mobile app and the rules are test on 6 devices * 4 OS (24 total)
  • – Code Project Mar 15 '19 at 16:09
  • This is a mobile app and on-call rotation will not help as it's already released to AppStore
  • – Code Project Mar 15 '19 at 16:10