Tuesday, July 19, 2016

Disk failure the culprit in July 14 trading outage: SGX


ricky l

0Thumbs UpThumbs Down0
Ricky L • a second agoRemove
Actually, the server load balancer will check a transaction completion- web, application and database dependency and the health of the servers (including harddisk) and send the health statistics to the global load balancer that consolidate the overall health of the server, harddisk and network.

Upon detected that a transaction cannot be completed - a return code is flag and the server load balancer will not allow another transaction to hit the same server (including harddisk) - web, apps, database through dependency checking and isolate this server dependency group ---- and will instead route to the next healthy web, app, database server group.

Thus a good network and system architecture design will have prevented such glitches.

ricky l
0Thumbs UpThumbs Down0
Ricky L • a second agoRemove
In other words, to check the health of a healthy transaction completion, server health and disk health - should not be the function of the apps.

It should be the function of the server load balancer that check healthy transaction completion.

ricky l
0Thumbs UpThumbs Down0
Ricky L • a second agoRemove
In fact under this circumstances, a Business Continuity Planning scenario will not need to kick in.

A redundant server group or server virtualisation group can take over the failed function very easily - with very minimal even non-noticeable disruption.

ricky l
0Thumbs UpThumbs Down0
Ricky L • a second agoRemove
In fact, simulated transaction can be induced periodically say every 2 minutes and the server load balancer will check the return code to ensure healthy completion.

The moment the simulated transaction return a fail transaction through irules, then this unhealthy server group that perform the web, apps, database transaction will be isolated and routed to other healthy server group before actual trading.

Thus the glitches should not have occur in the first place.

ricky l
0Thumbs UpThumbs Down0
Ricky L • a second agoRemove
By asking the application to detect harddisk failure is an impossible task for the application - because the application will not be able to tell the network and hardware to route to other server groups - as application will not be able to understand DNS to map URL to IP address, to check server health, unable to do a system or network monitoring functions of SNMP, MIBs etc.

It is like asking a cook to build a house.


d
1Rate a thumbs-upRate a thumbs-down0
Dave  •  23 minutes ago Report Abuse
Such a joke.. don't tell me SGX don't have maintenance and redundancy for their servers
Reply
  • ricky l
    0Thumbs UpThumbs Down0
    Ricky L • a second agoRemove
    Having redundant servers are not good enough.

    But having "intelligence" to detect hardware failure and software failure - and the ability to detect "transaction failure" - and take intelligent decision to avoid the fail hardware and software failure to route to the healthy one --- is essential.

    Just by having redundant servers will not do the above.
ricky l
0Thumbs UpThumbs Down0
Ricky L • a second agoRemove
This free IT consultancy service will have in future save SGX multi-million dollars transaction and uphold the reputation of Singapore as a reliable and efficient trading hub.

G
0Please sign in to rate a thumbs-upPlease sign in to rate a thumbs-down0
GlobalCrosser  •  2 hours 20 minutes ago Report Abuse
In other words if you get an honest and reputable IT consultant and supplier, these disruptions could have been avoided.

G
0Rate a thumbs-upRate a thumbs-down0
GlobalCrosser  •  2 hours 33 minutes ago Report Abuse
Hard disk made in where? Some 5th world country? Got conned by IT subcon into using inferior hard disks is it?
Reply
  • ricky l
    0Thumbs UpThumbs Down0
    Ricky L • a second agoRemove
    There is another thing that i don't understand.

    Normally for Apps that are stored in a disk, a good practice is to configure with Raid 1 (disk mirroring) - where 2 copies of Apps are stored in 2 separate disk.

    If the Apps that stored in 1 disk is faulty, the Apps will be inaccessible and cannot be read to perform Apps operation. The 2nd mirror disk with the 2nd copy of the Apps should be read to execute the share transaction.

    By doing so, there will be no downtime or disruption to the trading activities.
    With SNMP v1, 2c or 3 turn on, the faulty disk will be detected by the server monitoring system and it can be replaced without affecting the normal trading operation.

    Wonder why the above did not take place.
Disk failure the culprit in July 14 trading outage: SGXSingapore Business Review – 4 hours ago

Share
Tweet


Print

RELATED CONTENT

View Photo


An application also failed to detect the problem.

The prolonged disruption in Singapore Exchange (SGX) trading last Thursday, July 14, was due to a disk failure and was prolonged due to challenges in the orders and trade reconciliation process.

According to a statement by SGX, at about 9:38am, SGX detected Input/Output errors on a disk that runs the application to send out clearing confirmation messages to members.

“As the application did not detect the disk failure, which it should have, it did not automatically cutover to SGX’s backup secondary system. SGX initiated a manual cutover from the primary to secondary systems at 1012 hours,” SGX said.

As a result, some clearing confirmation messages were not generated, causing trading to be ceased at 11:38am, SGX said.

Meanwhile, SGX ensured the public that the disk has been replaced and complete health checks have been conducted.

“We are working with our vendor to review the application which sends out clearing confirmation messages, and will implement the necessary changes to ensure detection by the application of specific hardware problems. We will improve our process in data generation, and fine tune the data files to better enable our members in their reconciliation processes,” SGX said.

“We will work with members to review their order and trade reconciliation process, to improve overall recovery and market resumption, in the event of a similar recurrence. We will increase the number of our Business Continuity Planning scenarios which require industry-wide participation for reconciliation and recovery,” it added.


No comments:

Post a Comment