jarredpickles87,
@jarredpickles87@lemmy.world avatar

I know most of these stories are going to be IT or food service, so I’ll chime in with mine to change it up.

TLDR: We caused some explosions on a transformer because someone didn’t read test results.

I work for a power utility. One night, we were energizing a new transformer. It fed a single strip mall complex with a major grocery chain on it, so that’s why it was at night, as we couldn’t affect the service while they were open.

Anyways, we go to energize, close the primary switches and one of the lightning arrestors blows up. And I mean blows up, like an M80 just went off. Lit up the sky bright as day for a couple moments at 1 in the morning. The protection opened the switches and everybody is panicking, making sure nobody was hurt.

Well after everybody settled down, the arrestor was replaced, they decide to throw it in again. Switches come closed, and explosion #2 happens. A second arrestor blows spectacularly. I tried to convince the one supervisor on site to go for a third time, because why not, but he didn’t want to do it again. Whatever.

A few days go by and we find out what the issue was. This transformer was supposed to be a 115kV to 13.2kV. Come to find out there was an internal tap selection that was set for 67kV for the primary, and not 115kV. So what was happening was the voltage was only being stepped down half as much as needed so there was like 28kV or so on the secondary instead of 13.2kV and that was over the lightning arrestors ratings, hence why they were blowing up. So the transformer had to have its oil drained, guys had to go inside it and physically rewire it to the correct ratio.

We had a third party company do the acceptance testing on this transformer, and our engineering department just saw all the green checkmarks but didn’t pay attention to the values for the test results. Nobody expected to run into this because we don’t have any of this type of transformer in our system, but that’s certainly no excuse.

Moral of the story: read your acceptance test results carefully.

JCPhoenix,
@JCPhoenix@beehaw.org avatar

Several years ago, when I was more just the unofficial office geek, our email was acting up. Though we had Internet access as normal. At the time, email (Exchange) was hosted on-prem on our server. Anything server related, I’d contact our MSP to handle it. Which usually meant they’d simply reboot the server. Easy enough, but I was kinda afraid and hesitant to touch the server unless the MSP explicitly asked/told me to do something.

I reported it to our MSP, expecting a quick response, but nothing. Not even acknowledgment of the issue. This was already going on for like an hour, so I decided to take matters into my own hands. I went to the server, turned on the monitor…and it was black. Well, shit. Couldn’t even do a proper shutdown. So I emailed again, waited a bit, and again no response.

Well, if the server was being unresponsive, I figured a hard shutdown and reboot would be fine. I knew that’s what the MSP would (ask me to) do. What difference was them telling me to do it versus just me doing it on my own? I was going to fix email! I was going to be the hero! So I did it.

Server booted up, but after getting past the BIOS and other checks…it went back to black screen again. No Windows login. That’s not so terrible, since that was the status quo. Except now, people were also saying Internet all of a sudden stopped working. Oh shit.

Little did I know that the sever was acting as our DNS. So I essentially took down everything: email, Internet, even some server access (network drives, DBs). I was in a cold sweat now since we were pretty much dead in the water. I of course reached out AGAIN to the MSP, but AGAIN nothing. Wtf…

So I told my co-workers and bosses, expecting to get in some trouble for making things worse. Surprisingly, no one cared. A couple people decided to go home and work. Some people took super long lunches or chitchatted. Our receptionist was playing games on her computer. Our CEO had his feet up on his desk and was scrolling Facebook on his phone. Another C-suite decided to call it an early day.

Eventually, at basically the end of the day, the MSP reached out. They sent some remote commands to the server and it all started working again. Apparently, they were dealing with an actual catastrophe elsewhere: one of their clients’ offices had burned down so they were focused on BCDR over there all day.

So yeah, I took down our server for half a day. And no one cared, except me.

mkhopper, (edited )
@mkhopper@lemmy.world avatar

Strap in friends, because this one is a wild ride.

I had stepped into the role of team lead of our IS dept with zero training on our HP mainframe system (early 90s).
The previous team lead wasn’t very well liked and was basically punted out unceremoniously.
While I was still getting up to speed, we had an upgrade on the schedule to have three new hard drives added to the system.

These were SCSI drives back then and required a bunch of pre-wiring and configuration before they could be used. Our contact engineer came out the day before installation to do all that work in preparation of coming back the next morning to get the drives online and integrated into the system.

Back at that time, drives came installed on little metal sleds that fit into the bays.
The CE came back the next day, shut down the system, did the final installations and powered back up. … Nothing.
Two of the drives would mount but one wouldn’t. Did some checking on wiring and tried again. Still nothing. Pull the drive sleds out and just reseat them in different positions on the bus. Now the one drive that originally didn’t mount did and the other two didn’t. What the hell… Check the configs again, reboot again and, success. Everything finally came up as planned.

We had configured the new drives to be a part of the main system volume, so data began migrating to the new devices right away. Because there was so much trouble getting things working, the CE hung around just to make sure everything stayed up and running.

About an hour later, the system came crashing down hard. The CE says, “Do you smell something burning?” Never a good phrase.
We pull the new drives out and then completely apart. One drive, the first one that wouldn’t mount, had been installed on the sled a bit too low. Low enough for metal to metal contact, which shorted out the SCSI bus, bringing the system to its knees.

Fixed that little problem, plug everything back in and … nothing. The drives all mounted fine, but access to the data was completely fucked,
Whatever… Just scratch the drives and reload from backup, you say.

That would work…if there were backups. Come to find out that the previous lead hadn’t been making backups in about six months and no one knew. I was still so green at the time that I wasn’t even aware how backups on this machine worked, let alone make any.

So we have no working system, no good data and no backups. Time to hop a train to Mexico.

We take the three new drives out of the system and reboot, crossing all fingers that we might get lucky. The OS actually booted, but that was it. The data was hopelessly gone.

The CE then started working the phone, calling every next-level support contact he had. After a few hours of pulling drives, changing settings, whimpering, plugging in drives, asking various deities for favors, we couldn’t do any more.

The final possibility was to plug everything back in and let the support team dial in via the emergency 2400 baud support modem.
For the next 18 hours or so, HP support engineers used debug tools to access the data on the new drives and basically recreate it on the original drives.
Once they finished, they asked to make a set of backup tapes. This backup took about 12 hours to run. (Three times longer than normal as I found out later.)
Then we had to scratch the drives and do a reload. This was almost the scariest part because up until that time, there was still blind hope. Wiping the drives meant that we were about to lose everything.
We scratched the drives, reloaded from the backup and then rebooted.

Success! Absolute fucking success. The engineers had restored the data perfectly. We could even find the record that happened to be in mid-write when the system went down. Tears were shed and backs were slapped. We then declared the entire HP support team to be literal gods.

40+ hours were spent in total fixing this problem and much beer was consumed afterwards.

I spent another five years in that position and we never had another serious incident. And you can be damn sure we had a rock solid backup rotation.

(Well, there actually was another problem involving a nightly backup and an inconveniently placed, and accidentally pressed, E-stop button, but that story isn’t nearly as exciting.)

DudeDudenson,

Imagine the difference trying to get that kind of support these days. Especially from HP

mkhopper,
@mkhopper@lemmy.world avatar

No kidding. Where I’m working now, it takes an HP CE over a week just to bring out a new hot swappable drive after we jump through a number of request hoops.

ThatWeirdGuy1001,
@ThatWeirdGuy1001@lemmy.world avatar

Had to unload a pistol I’d found in a box of potatoes at Taco Bell.

Witnessed a five man brawl at steak n shake

Had a girl puke up her Oreo mint shake all over the bathroom also at steak n shake

Food service is fuckin wild

captainjaneway,
@captainjaneway@lemmy.world avatar

TIL never go to Steak n Shake.

ThatWeirdGuy1001, (edited )
@ThatWeirdGuy1001@lemmy.world avatar

Tbf it was third shift after the bars closed so. It was a wild experience.

As far as I’m aware they don’t even do thirds anymore they stopped during the pandemic. At least the ones near me.

Followupquestion,

What did the box of potatoes do to you to deserve that?

beardedmoose,

This is actually my own Oh Shit story.

Early days of being a sysadmin and making changes on a major Linux server that we have in production. Running routine commands and changing permissions on a few directories and I make a typo. “sudo chmod 777 /etc/” instead of typing the rest of the directory tree I accidentally hit return.

It only ran for a fraction of a second before I hit CTRL + C to stop it but by then the damage had been done. I spent hours mirroring and fixing permissions by hand using a duplicate physical server. As a precaution we moved all production services off this machine and it was a good thing too as when we rebooted the server a few weeks later, it never booted again.

For those that don’t know, chmod is used to set access permissions on files and folders, the 777 stands for “Read + Write + Execute” for the owner, group, and everyone else. The /etc directory contains many of the basic system configuration files for the entire operating system and many things have very strict permissions for security reasons. Without certain permissions in place those systems will refuse to load files or boot if not properly set.

Dio9sys,

That is literally a nightmare scenario for me, holy shit

Djtecha,

If you didn’t use the recursion flag this wouldn’t be to bad

Hadriscus,

would that mean it doesn’t affect anything other than top level files ?

Djtecha,

Yea without the R flag it only does the file (and since folders are files in Unix…)

beardedmoose,

in hindsight I should have just changed into the directory directly first then used chmod without needing the full path. Or run the flag that asks you to confirm each transaction or dry run. I’m a much smarter idiot nowadays.

EnderMB,

(Not my story, but a coworker)

This person started working for a large retailer. On their first week, they contributed a fix to a filtering system that inadvertently disabled the “adult” filter.

The team got paged, an error review took place, and we went through the impact of the error - a ~10% increase in sales worldwide, and the best sales outside of the holiday period.

On one hand, extremely damaging for the company. On the other, on his first week he delivered impact that exceeded the whole team’s goal for the year.

That person is now in senior management at a huge tech company.

cashews_best_nut,

Twitch?

Tar_alcaran,

Happy ending story, but it’s still gross.

I do workplace safety and hazardous material handling (instructions, plans, regulation, etc), for all sorts of facilities, from dirty ground to lab waste.

Hospitals have a number of types of dangerous waste, among them stuff that get disinfected in bags in an autoclave (oven) and stuff that shouldn’t be in a bag, like needles, scalpel blades etc.

I was giving some on-site instructions, which included how to dispose of things. So I tell the people to never assume someone does everything right, because we’ve all thrown trash in the wrong bag at some point, and you don’t want to find out someone left a scalpel in the autoclave bag by jamming it into the hole and pulling a needle from your hand.

My eye drifts slightly left, to one of my students current assisting another worker doing literally that, stuffing a second bag into the autoclave and then shouting “OW, fuck”, before dripping blood on the ground.

Now, nobody knows what’s in the bag. Some moron threw sharps in with the bio waste, who knows where it’s from. For all I know, they just caught zombie-ebola, and it’s my fault for talking slightly too slow.

Thankfully, after some antibiotic and fervent prayer, everything turned out to be OK.

NickwithaC,
@NickwithaC@lemmy.world avatar

How is that a happy ending story?

Congratulations, you didn’t get coronAIDSyphilis.

cheesymoonshadow,
@cheesymoonshadow@lemmings.world avatar

I know I was waiting for the splooge part too.

peter,
@peter@feddit.uk avatar

Well that is a happy ending isn’t it

dan,
@dan@upvote.au avatar

I broke the home page of a big tech (FAANG) company.

I added a call to an API created by another team. I did an initial test with 2% of production traffic + 50% of employee traffic, and it worked fine. After a day or two, I rolled out to 100% of users, and it broke the home page. It was broken for around 3 minutes until the deployment oncall found the killswitch I put in the code and turned it off. They noticed the issue quicker than I did.

What I didn’t realise was that only some of the methods of this class had Memcache caching. The method I was calling did not. It turns out it was running a database query on a DB with a single shard and only 4 replicas, that wasn’t designed for production traffic. As soon as my code rolled out to 100% of users. the DBs immediately fell over from tens of thousands of simultaneous connections.

Always use feature flags for risky work! It would have been broken for a lot longer if I didn’t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins.

cashews_best_nut,

What language? PHP, python?

Vendetta9076,
@Vendetta9076@sh.itjust.works avatar

I work on a SOC team and were really trying to hammer the usage of feature flags into our devs.

WhyAUsername_1,

What are feature flags?

dan,
@dan@upvote.au avatar

Feature flags are just checks that let you enable or disable code paths at runtime. For example, say you’re rewriting the profile page for your app. Instead of just replacing the old code with the new code, you’d do something like:


<span style="color:#323232;">if (featureIsEnabled('profile_v2')) {
</span><span style="color:#323232;">  // new code
</span><span style="color:#323232;">} else {
</span><span style="color:#323232;">  // old code
</span><span style="color:#323232;">}
</span>

Then you’d have some UI to enable or disable the flag. If anything goes wrong with the new page after launch, flip the flag and it’ll switch back to the old version without having to modify the code or redeploy the site.

Fancier gating systems let you do things like roll out to a subset of users (eg a percentage of all users, or to 50% of a particular country, 20% of people that use the site in English, etc) and also let you create a control group in order to compare metrics between users in the test group and users in the control group.

Larger companies all have custom in-house systems for this, but I’m sure there’s some libraries that make it easy too.

At my workplace, we don’t have any Git feature branches. Instead, all changes are merged directly to trunk/master, and new features are all gated using feature flags.

WhyAUsername_1,

Wow that’s so effing smart!

Vendetta9076,
@Vendetta9076@sh.itjust.works avatar

Everything Dan said and more. They’re sometimes also called canaries, although thats not quite the same thing. There’s been a ton of times where services have been down for hours instead of minutes because a dev never built in a feature flag.

Hadriscus,

Canaries, relating to mine work ?

Vendetta9076,
@Vendetta9076@sh.itjust.works avatar

Thats where the term derives from, yes

jjjalljs,

Always use feature flags for risky work! It would have been broken for a lot longer if I didn’t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins

This reminds me of the old saying: everyone has a test environment. Some people are lucky enough to have a separate production environment, too.

shani66,

Now that i think about my first job was fucking wild.

My buddy was in a forklift taking some stock down and i was spotting, basically just hanging out and making sure no one got in the way. A few minutes after the normal time it’d take he thinks something is wrong and calls me to take a look (from afar) to see how fucked we are; the answer was very, the pallet was barely holding together at all, but i couldn’t see a damn thing from my position. Before i could get back to spotting we heard a loud crack and the world went still, i imagine for much longer by him, and not a second later we had hundreds of pounds of foul smelling mulch everywhere.

I had a lot more there too; babysitting an old man that looked on the verge of death with no management anywhere to be found, moving hundreds of pounds at a time by hand, dealing with the best conspiracy theorist ever.

I’ve been bored everywhere else I’ve ever worked.

Dio9sys,

There’s something about physical labor jobs that result in everybody having one story about babysitting somebody who is actively dying

Hadriscus,

what do you mean, the best conspiracy theorist ??

shani66,

Turns out the world is flat. And under a dome. And Jesus is on top of the dome. And aliens have visited us in the dome. And so much more.

Hadriscus,

wow ! I knew the world was flat and under a dome, but I had no idea Jesus was literally on top. Thanks for completing my world view 😂 I’ve had crazy exchanges with some of these people but I have yet to meet any IRL. I guess they don’t hang out where I live

carnimoss,
@carnimoss@lemmings.world avatar

I worked at a pizza chain and there was a hurricane. Thankfully we only got heavy rain but I was a delivery driver and almost every other place was closed. I opened with someone and we stayed for like 10 hours straight. By the time we had to leave, we were dead tired and there would be only 2 people left working and they had dozens of orders left and they had to do delivery, cook, AND ANSWER CUSTOMERS. No, the job didn’t pay enough.

Dio9sys,

I hope the people you delivered to gave you massive tips

chahk,

My first week on a new job I ran a DELETE query without (accidentally) selecting the WHERE clause. In Prod. I thought I was going to get fired on the spot, but my boss was a complete bro about it, and helped with data restore personally.

Everyone at that company was great both professionally and personally. It’s the highlight of my 30+ year career.

TheKrevFox,

Everyone’s taken down prod at one point or another. If you haven’t, then you haven’t been working long enough.

dan,
@dan@upvote.au avatar

That’s the employer’s fault for making it so easy to connect to prod with read-write permissions. Not your fault.

peter,
@peter@feddit.uk avatar

At my last job I was given write permissions to production and I asked for read only credentials instead, I know my own stupidity

dan,
@dan@upvote.au avatar

At my workplace, the command-line database tool (which is essentially just a wrapper around the standard MySQL CLI) connects with a read-only role by default, and you need to explicitly pass a flag to it to connect with a read-write role. The two roles use separate ACLs so we can grant someone just read-only access if they don’t need write access.

jjjalljs,

+1

We have read only access.

Also transactions are good ideas.

Also my database tool (the one built into pycharm) warns and requires you to hit submit a second time if you try a delete or update without a where. Discovered this on local where I really did want to update every record, but it’s a good setting.

chahk,

Look at mister fancy pants over here with a database tool. Back in my time we had to use Query Analyzer uphill both ways.

chahk,

Oh there was plenty of blame to go around. I wasn’t exactly fresh out of school either. I had “extensive experience with SQL Server” on my resume by then.

7of9,
@7of9@startrek.website avatar

My first salaried job was also my first proper IT job and I was a “junior technician” … the only other member of IT staff was my supervisor who had been a secretary that got a 1 week sysadmin course and knew very little.

The server room was a complete rat’s nest and I resolved to sort it out. It was all going very well until I tripped over the loose SCSI 3 cable between the AIX server and it’s raid array. While it was in use.

It took me 2 days to restore everything from tape. My supervisor was completely useless.

A few months later I was “made redundant”, leaving behind me everything working perfectly and a super tidy server room. I got calls from the company asking for help for the following 6 months, which I politely declined.

Hazzia,

Man fuck those guys. Not a sysadmin myself, but from what I hear the position is criminally underappreciated. Why is it so hard for people to understand that if things aren’t breaking, it means people are doing their job correctly?

7of9,
@7of9@startrek.website avatar

Yeah, I got laid off twice more before switching careers. Both times they wanted me to come back and fix stuff after letting me go.

It goes hand in hand with the “if someone works hard, they should be given more work as a reward” line of thinking.

otp,

That’s when you offer consulting and tell them your hourly rate!

7of9,
@7of9@startrek.website avatar

I didn’t have them over a barrel, they were just being lazy and trying to exploit me further for free.

Dio9sys,

It’s always fun when a job calls you up after you’ve been fired to ask how to do the things they didn’t know you were doing

7of9,
@7of9@startrek.website avatar

Yep, I remember in one job I was at for 8 years a manager 2 levels up complemented me for sorting out the networking for a re-arrange of our own office … I was gobsmacked because I’d been managing a whole network and server upgrade for a client that involved well over 1000 users at the time yet an hour of fiddling with wires under desks was the only thing that got his attention.

EmilyIsTrans,
@EmilyIsTrans@lemmy.blahaj.zone avatar

One job I was fired from and rehired within the day, after they quickly realised that I was their only Android developer and they couldn’t build an app with just hopes and wishes. They fired me again later, which they quickly regretted since I was the only one with the signing key (meaning they couldn’t update the app).

7of9,
@7of9@startrek.website avatar

deleted_by_author

  • Loading...
  • dan,
    @dan@upvote.au avatar

    Looks like your comment posted twice.

    7of9,
    @7of9@startrek.website avatar

    Thanks, I’ll give this one the chop!

    bookcrawler,

    Not my oh shit moment but certainly someone’s. Working in a call centre they sent out an example of a fraud email that was being sent out with our logo. It asked for all your personal information and credit card information.

    Several individuals replied with all their details filled in. 3 of them replied all (entire call centre) with their details filled in.

    Hazzia,

    Ya know, I do try to give people the benefit of the doubt when it comes to scams, but damn do stories like yours make it hard.

    https://discuss.tchncs.de/pictrs/image/f67f3416-05c5-446f-9d86-dcd9fd9603b9.jpeg

    Dio9sys,

    oh my god. This…unfortunately tracks for call centers. The world capitol of “we expect you to understand this thing with zero training”

    val,

    My better ones are too legally dubious to post, but I do have one about fairly mundane office drama.

    A coworker once dropped some particularly angry comments about a manager in the work chat instead of our private one. I panic post some inane shit to try and hide it before hurriedly tabbing over to the private chat to tell her to delete it. Too late. Along with a very clearly ‘upset but trying to be professional’ reply, there are some ominous words spoken about how this proves the existence of our private chat and action will be taken if this is the kind of thing being said in it. But it’s clock out time for our manager and on a Friday so it gets shelved until Monday with no action taken.

    Our private chat wasn’t exactly secure so there was fair chance our bosses would access to it. I spend the rest of my work hours that day scrubbing it of the most damaging things I had said while trying to leave enough unflattering stuff that it looked somewhat natural. It wasn’t particularly spicy all told, it was mostly just “how to do x?” without sounding incompetent in front of people who dictate whether you get paid or not, but better safe than sorry. We’re still sure that our coworker who dropped the bomb is going to get shit canned though.

    Monday comes around and we’re all waiting for the hammer to come down. Each moment that goes by we expect the retribution is going to be worse. Around midday I realize we’ve got a different manager than usual overseeing us, but the usual is still clocked in. I spot a bunch of higher ups have away messages saying they’re in a meeting and have been for hours. Then in our work chat comes a “x is typing” from one of them, who very rarely says anything there. I message one of my coworkers putting my bet that this was it and to brace for punishment.

    The typing message from this person goes on for a good 20 minutes. It’s going to be a big one.

    The message finally comes. Our coworker was fired.

    …and so was everyone else except myself and one other person. They were getting laid off. The meeting I noticed wasn’t about our punishment, it was an emergency meeting because an important contract hadn’t gone through. Company got gutted.

    PizzasDontWearCapes,

    Well, that was a twist ending.

    I route all my sensitive chatting to personal texts to avoid the scenario you and your coworkers encountered

    shani66,

    I feel like, if one contract falls through and your company is in shambles, it wasn’t a very good company.

    val,

    To be fair, I should have said our department got gutted instead of the company. We were pretty siloed off from everyone else so it was kind of hard to keep the bigger picture in mind.

    Dio9sys,

    HOLY SHIT

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • uselessserver093
  • Food
  • [email protected]
  • aaaaaaacccccccce
  • test
  • CafeMeta
  • testmag
  • MUD
  • RhythmGameZone
  • RSS
  • dabs
  • oklahoma
  • Socialism
  • KbinCafe
  • TheResearchGuardian
  • SuperSentai
  • feritale
  • KamenRider
  • All magazines