Tuesday Site reports Notetaking for the AHM week: Tue: Chrulle Wed: Ville Thu: Petter Fri: Jens Intro round NT1 staff Maswan (Mattias Wadenstein) HPC2N: NT1 boss Petter (Petter Urkedal) NBI: Operations, monitoring; sysadmin Chrulle (Christian Søttrup) NBI: Ops, Security; ARC dev Jens (Jens Larsson) NSC: Ops, dCache; Sysadmin Krishnaveni (Krishnaveni Chitrapu) NSC: Storage/dCache developer Salmela (Ville Salmela) CSC: Ops; sysadmin ErikE (Erik Edelmann) CSC: Alice, Accounting Darren (Darren Starr) UiO: Ops; storage sysadmin Maikenp (Maiken Pedersen) UiO: NT1 Arc Oxana(Oxana Smirnova) Lund: Experiments liason Local staff Kildetoft(Rune Kildetoft) NBI: Local Sysadmin Happe (Hans Henrik Happe) NBI: HPC center boss Znikke(Niklas Zenström) HPC2N: Tape Roger (Roger Oscarsson) HPC2N: Compute Michael Rännar HPC2N: Borisw (Boris Wagner) UiB: Tier1 admin Thomas Linden HIP: Compute, CMS Garvin (Vincent Garonne) UiO: Former NT1 storage developer Caela (Michaela Barth) PDC: Neic NT1 PO Dejan Lesjak IJS: Sysadmin Andrej Filipcic IJS: Atlas Marcos Acebes Lunarc: Storage Robert (Robert Grabowski) Lunarc: Compute Florido (Florido) Lund: local support, ARC dev Central services (Maswan) Krishnaveni is a new hire for storage. The new servers have been fully provisioned in Ørestaden and old servers have been torn down. Services have been offered to Eiscat 3D. We should have an answer before the end of the year. We would prefer to offer something very similar to what we are already are doing. We are working on getting a more "DevOps" approach towards provisioning the central services. We would also like to finish tarpooling and tape pool performance. For the Future we would like to consolidate more user communities onto our storage to increase cost efficiency. Florido: asks if what would be needed to accomodate a new storage user. Maswan answers that we would be able to do that if they cover the cost of hardware and their share of ops staff and incidentals. Central site Technical overview (Jens): We are now completely on the new machines. The old servers are moving to Umeå to be included in mordor. We now have 3 ganeti machines and 2 postgresql. We are moving to 3 separate levels of storage: Production Pre-production (For stress testing the setup before moving to production) Experimental (NEW!!)(for testing out things, experimenting with fault recovery etc.) Possibly the first to run dCache 6.2 in production. Next we need to upgrade the OS on the headnodes and move to three instead of two. There has been issues with the switch from Digicert to Sectigo. That should be sorted by now. Sites roundtable: Bergen: Bergen is an Alice only site. Using Alien, dCache and TSM. Now part of the Norwegian Science Cloud. Site is fully virtualized. Compute and storage are running on openstack VMs, backed by Ceph. The idea was to fully dissassociate the T1 service from the actual hardware. Boris thinks it was a beautiful idea, but that it does not really work in practice. They use Terraform as a machine readable description of the infrastructure. Ansible is used for provisioning. It is configured from automatically from Terraform. Bergen has had understaffing issues. Current staff is: Raimund Kristensen (Group leader) Tor Ladre (Net and storage) Boris (compute and T1) Efficiency has been an issue for 2020. When there is a general efficiency issue, it hits Bergen much harder and longer than other NT1 sites. The advantages of the cloudification should be good, but have not been seen in practice and contrariwise the complexity makes it very hard to fix problems and there is also performance loss. Mostly in wasted space. Thomas asks if Alice have standard monitoring jobs that could be used for tracking the efficiency. HIP had also seen issues with efficiency. It was solved by fiddling with stripe size. Paper at CHEP 2012. UCPH (Happe) The latest thing is building a new data center. UCPH will have a new 200 M² room. It will have room for about 60 racks with 50 kW per rack. Prep is ongoing but tender has not been done yet. With regards to T1 we have had issues with ARC stalling. Session dir has been moved to a separate server which has helped. New cache servers are in the works. We are working on buying 2 PB of disk storage, but 1 PB is about to be decomissioned. Tape are not yet full. We have room for this years and next years pledge in the current library, but are planning to buy a new library next year. This will wait until the new computer room is ready. Maswan asks for plans for the new room. Happe will look up the 3d scans. Water cooling is being planned with rear door cooling. UiO (Maiken): UiO has reached its final form. 270 nodes of 8 core Epyc. Installed in the Norwegian openstack cloud. For now it is running with 16 DDR servers. This is perhaps overkill, but it works and should not hurt anything. Gave up on Elasticcluster. Too much overhead. Extracted needed ansible roles instead. Latest issue was saturation of the DDR network. The problem was caused by a gateway needing updates and reboots after a 5 year uptime. Some issues with filling the nodes. Backfilling is being used, but the lowpriority jobs are not always available when needed. File system is not performing optimally. Will be upgraded with SSDs. Storage is being transitioned away from ceph. It is not necessary and too complex. Instead going to XFS on mdraid running on ARM machines. There are issues with performance degradation under heavy load. It is being looked into. Planned to be in production within a month. IJS (Dejan): About 1 PB of storage has been added. Work is on ongoing on getting 100 GBe. The switch has been procured, but not connected to a 100 G fibre, so not much use yet. There is work on better switching for nodes. This might solve issues with difficult pools. The problem there seems to be network cards getting stuck. New Atlas site in Slovenia: VEGA. Should be ready by March 2021. Jens asks about increased upstream Bandwidth. Expected to happen beginning of next year. HPC2N (Znikke): New dCache pools both for NT1 and Swestore. Service expires 24/12 this year. New nodes hoped to be delievered by mid-november to insure timely installation and migration. Dell was chosen. New switches are also being bought (100/25 GBe). The goal is to get 100 GBe LHCOPN connectivity when UmU and SUNET know how. Arc6 deployment got delayed. Primary person got sick and Sectigo certificates were hard to get. Test jobs have run and nagios is mostly happy. Expected to be in production end of October. Abisko will be decommissioned, but will be kept alive until Kebnekaise is in full production. Kebnekaise will be just about that needed for the pledge, but NSC should be able to help out and complete the Swedish pledge. New ARC cache machines are in the works and should be in place around the time of the next AHM. Toying with the idea of moving to SSDs. CSC/HIP(ErikE, Thomas) Alicetron is a virtual cluster on CSC openstack. It was increased from 13*24 to 20*24 cores. The hardware is ageing and will need to be regenerated on new hardware. Job efficiency is low, not as bad as bergen, but worse than other sites. Likely reason is network congestion. But has otherwise been working reliably. CMS is running on two clusters in HIP. The old one is almost out of production. The new one has had issues with Lustre, but the fix has made problems with running CMS jobs. It is hoped it will be fixed this week. There is another new cluster. ARC6 is being installed. Alcyon will be retired once this is working. There is a grant for new storage. The issues with compute has put a spanner in the works, but it is the pipeline. CSC dCache pools are almost out of warranty. There is money from the academy of Finland and they will be replaced as soon as possible. Lund Not much to tell. A new ARC6 frontend with Sectigo is ready and needs to be put in production. The nodes will be replaced next year. So Lund is looking for suggestions. Swestore T2 storage has been consolidated to a couple of handfull of nodes. Christian asks if we can just start production. Robert agrees and will coordinate with Gianfranco. NSC New nodes have been procured. Benchmarking suggests that the EPYC CPUs are very efficient with HT, so it is expected that the nodes will run with HT enabled. NSC will get a very expensive and powerful NVIDIA AI machine. UPS I lost power again. The procurement and installation of new UPS in Hangaren(WLCG) went well. Unfortunately it turns out that with all power modules enabled the transformer gets a 450 Hz disturbance on the power feed. The system is running but with degraded UPS capacity. UPS in Kärnhuset died completely. It was turned of and taken out of the power loop. Being off-line killed most of the batteries. It is too expensive to replace, so instead a new procurement is being looked into. This also means that one compute room is running without UPS.