Disk failure the culprit in July 14 trading outage: SGX

Singapore Business Review – 4 hours ago

Ricky L • a second agoRemove

Actually, the server load balancer will check a transaction completion- web, application and database dependency and the health of the servers (including harddisk) and send the health statistics to the global load balancer that consolidate the overall health of the server, harddisk and network.

Upon detected that a transaction cannot be completed - a return code is flag and the server load balancer will not allow another transaction to hit the same server (including harddisk) - web, apps, database through dependency checking and isolate this server dependency group ---- and will instead route to the next healthy web, app, database server group.

Thus a good network and system architecture design will have prevented such glitches.

0Thumbs Up Thumbs Down0

Ricky L • a second agoRemove

In other words, to check the health of a healthy transaction completion, server health and disk health - should not be the function of the apps.

It should be the function of the server load balancer that check healthy transaction completion.

0Thumbs Up Thumbs Down0

Ricky L • a second agoRemove

In fact under this circumstances, a Business Continuity Planning scenario will not need to kick in.

A redundant server group or server virtualisation group can take over the failed function very easily - with very minimal even non-noticeable disruption.

0Thumbs Up Thumbs Down0

Ricky L • a second agoRemove

In fact, simulated transaction can be induced periodically say every 2 minutes and the server load balancer will check the return code to ensure healthy completion.

The moment the simulated transaction return a fail transaction through irules, then this unhealthy server group that perform the web, apps, database transaction will be isolated and routed to other healthy server group before actual trading.

Thus the glitches should not have occur in the first place.

0Thumbs Up Thumbs Down0

Ricky L • a second agoRemove

By asking the application to detect harddisk failure is an impossible task for the application - because the application will not be able to tell the network and hardware to route to other server groups - as application will not be able to understand DNS to map URL to IP address, to check server health, unable to do a system or network monitoring functions of SNMP, MIBs etc.

It is like asking a cook to build a house.

1Rate a thumbs-up Rate a thumbs-down0

Dave • 23 minutes ago Report Abuse

Such a joke.. don't tell me SGX don't have maintenance and redundancy for their servers

0users liked this commentThumbs Up Thumbs Down0users disliked this comment
Ricky L • a second agoRemove

Having redundant servers are not good enough.

But having "intelligence" to detect hardware failure and software failure - and the ability to detect "transaction failure" - and take intelligent decision to avoid the fail hardware and software failure to route to the healthy one --- is essential.

Just by having redundant servers will not do the above.

0Thumbs Up Thumbs Down0

Ricky L • a second agoRemove

This free IT consultancy service will have in future save SGX multi-million dollars transaction and uphold the reputation of Singapore as a reliable and efficient trading hub.

0Please sign in to rate a thumbs-up Please sign in to rate a thumbs-down0

GlobalCrosser • 2 hours 20 minutes ago Report Abuse

In other words if you get an honest and reputable IT consultant and supplier, these disruptions could have been avoided.

0Rate a thumbs-up Rate a thumbs-down0

GlobalCrosser • 2 hours 33 minutes ago Report Abuse

Hard disk made in where? Some 5th world country? Got conned by IT subcon into using inferior hard disks is it?

0users liked this commentThumbs Up Thumbs Down0users disliked this comment
Ricky L • a second agoRemove
There is another thing that i don't understand.

Normally for Apps that are stored in a disk, a good practice is to configure with Raid 1 (disk mirroring) - where 2 copies of Apps are stored in 2 separate disk.

If the Apps that stored in 1 disk is faulty, the Apps will be inaccessible and cannot be read to perform Apps operation. The 2nd mirror disk with the 2nd copy of the Apps should be read to execute the share transaction.

By doing so, there will be no downtime or disruption to the trading activities.
With SNMP v1, 2c or 3 turn on, the faulty disk will be detected by the server monitoring system and it can be replaced without affecting the normal trading operation.

Wonder why the above did not take place.

Disk failure the culprit in July 14 trading outage: SGX

Singapore Business Review – 4 hours ago

Share
Tweet

Print

RELATED CONTENT

View Photo

An application also failed to detect the problem.

The prolonged disruption in Singapore Exchange (SGX) trading last Thursday, July 14, was due to a disk failure and was prolonged due to challenges in the orders and trade reconciliation process.

According to a statement by SGX, at about 9:38am, SGX detected Input/Output errors on a disk that runs the application to send out clearing confirmation messages to members.

“As the application did not detect the disk failure, which it should have, it did not automatically cutover to SGX’s backup secondary system. SGX initiated a manual cutover from the primary to secondary systems at 1012 hours,” SGX said.

As a result, some clearing confirmation messages were not generated, causing trading to be ceased at 11:38am, SGX said.

Meanwhile, SGX ensured the public that the disk has been replaced and complete health checks have been conducted.

“We are working with our vendor to review the application which sends out clearing confirmation messages, and will implement the necessary changes to ensure detection by the application of specific hardware problems. We will improve our process in data generation, and fine tune the data files to better enable our members in their reconciliation processes,” SGX said.

“We will work with members to review their order and trade reconciliation process, to improve overall recovery and market resumption, in the event of a similar recurrence. We will increase the number of our Business Continuity Planning scenarios which require industry-wide participation for reconciliation and recovery,” it added.

Babe Blog - Insights on News Events

Tuesday, July 19, 2016

Disk failure the culprit in July 14 trading outage: SGX

No comments:

Post a Comment