ZFS monster

packetboy

12-31-2011, 03:44 PM

I had planned on doing this from the beginning, just ordered QDR (40Gbps) Infiniband adapter for my ZFS server, a new 4 blade Supermicro server which each blade outfitted with a Mellanox Connect-2 QDR interface and a 16-port Mellanox QDR switch.

Anyone here played with IB on Solaris/OpenIndiana yet?

This should be interesting....my plan is to use this for a mini Apache Hadoop cluster.

ABSiNTH

01-01-2012, 02:40 PM

Dear Lord, that is some serious throughput...ill be following along on this thread...

jonnyjl

01-01-2012, 03:08 PM

Pictures!

I'd be really interested how you get it done. Inifiniband (or FC) is stuff I can only dream of doing with OI(@home)

I've read a little of this.
http://www.zfsbuild.com/category/infiniband/

packetboy

01-10-2012, 05:24 PM

Oh goodie...Mellanox QDR ConnectX2 Infiniband adapters that are onboard the Supermicro X8DTT-IBQF blades *ARE* detected by Solaris 11 (11-11-11):

# cfgadm -al

Ap_Id Type Receptacle Occupant Condition
hca:2590FFFF2FC81C IB-HCA connected configured ok
ib IB-Fabric connected configured ok
ib::2590FFFF2FC81D,0,ipib IB-PORT connected configured ok
ib::iser,0 IB-PSEUDO connected configured ok

Note: I was initially nervous as I initially got a driver error when I booted the Live USB...I tried to tell Solaris to manually download the 'hermon' (infinband) drivers...it tried to but failed as it said root was read-only...I'm confused by this but perhaps I'm not fully understanding the limitations of the live USB.

Regardless, I installed to a hard-drive, re-ran the device driver utility and it automatically grabbed the herman drivers from the Solaris repository.

I am still VERY far from actually moving data across this transport, but this is a huge step.

pjkenned

01-10-2012, 06:43 PM

Great! I figured you would do something like this at some point.

DeChache

01-10-2012, 07:58 PM

This is awesome

packetboy

01-10-2012, 08:00 PM

How's this?

Logical design:

http://img15.imageshack.us/img15/2276/hadoopi.png

The SAS switch is my backup plan if disk access latency over IB is unacceptable...eg. I'll create a parallel SAS network strictly for storage connectivity and then the IB will be used only for intra-node communications (which are plentiful when running Hadoop).

http://img705.imageshack.us/img705/3020/supermicroibblade.jpg

http://img803.imageshack.us/img803/4852/bladeibmezzanineandconn.jpg

http://img263.imageshack.us/img263/3320/ibnetwork.jpg

http://img24.imageshack.us/img24/1498/ibswitchcloseup.jpg

jen4950

01-10-2012, 08:23 PM

So- silly question- what do you use this Hadoop thing for?

Some big names on their user list.

Jesse B

01-10-2012, 08:24 PM

You're a monster :eek:

This build is freaking epic. Looking forward to seeing some more pics, and even more importantly, some numbers!

packetboy

01-10-2012, 08:32 PM

So- silly question- what do you use this Hadoop thing for?

Hadoop let's you find stuff reasonably quickly...when "stuff" = many Terabytes of highly unstructured data. In my case it's 50TB of raw packet captures (e.g. Wireshark/tcpdump).

That's about all I can say.

samborarocks

01-10-2012, 09:33 PM

So when I saw ZFS and Infiniband in the first post..... I subscribed to this thread. And as already stated that is some serious bandwidth.....

parityboy

01-10-2012, 09:43 PM

@OP

The SAS switch is my backup plan if disk access latency over IB is unacceptable...eg. I'll create a parallel SAS network strictly for storage connectivity and then the IB will be used only for intra-node communications (which are plentiful when running Hadoop).

What will be running across the IB connection(s)? iSCSI?

Stanza33

01-10-2012, 09:50 PM

Here ya go, if you get stuck

http://forums.overclockers.com.au/showthread.php?t=944153

shetu

01-11-2012, 01:01 AM

subscribed

tangoseal

01-11-2012, 01:28 AM

I had planned on doing this from the beginning, just ordered QDR (40Gbps) Infiniband adapter for my ZFS server, a new 4 blade Supermicro server which each blade outfitted with a Mellanox Connect-2 QDR interface and a 16-port Mellanox QDR switch.

Anyone here played with IB on Solaris/OpenIndiana yet?

This should be interesting....my plan is to use this for a mini Apache Hadoop cluster.

Some where out there people have a large sum of bank candy they like to feed the technology monster.

pjkenned

01-11-2012, 01:23 PM

IMO you need to make your own hadoop benchmark.

And never ask what parityboy does with this stuff :-)

packetboy

01-13-2012, 06:52 PM

I know getting Infiniband working under Solaris/OI is going to be more involved, so today decided to install Ubuntu on two of the Blades and see if I could get them to talk to each other. Using Ubuntu 11.10 makes it trivial as oulined here:

http://davidhunt.ie/wp/?p=2291

root@ib4:~# /usr/sbin/iblinkinfo
Switch 0x0002c90200443470 Infiniscale-IV Mellanox Technologies:

3 16[ ] ==( 4X 10.0 Gbps Active / LinkUp)==> 2 1[ ] "MT25408 ConnectX Mellanox Technologies" ( Could be 5.0 Gbps)
3 17[ ] ==( 4X 10.0 Gbps Active / LinkUp)==> 1 1[ ] "MT25408 ConnectX Mellanox Technologies" ( Could be 5.0 Gbps)

root@ib3:~# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x002590ffff2fc828
System image GUID: 0x002590ffff2fc82b
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 1
LMC: 0
SM lid: 2
Capability mask: 0x0251086a
Port GUID: 0x002590ffff2fc829

# ifconfig

ib0 Link encap:UNSPEC HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:192.168.9.2 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::225:90ff:ff2f:c829/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:6704163 errors:0 dropped:0 overruns:0 frame:0
TX packets:2174312 errors:0 dropped:6 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:415695395 (415.6 MB) TX bytes:11643802311 (11.6 GB)

root@ib3:~# netperf -H 192.168.9.4
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.9.4 (192.168.9.4) port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 4593.36

Ok, so we've got some work today given "only" 4.5Gbps across at 40Gbps (really 36Gbps after overhead) link. But this is stock Ubuntu 11.10...zero tuning.

packetboy

01-13-2012, 07:02 PM

Iperf results agree with netperf:

root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.9.2 port 47622 connected with 192.168.9.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 5.04 GBytes 4.33 Gbits/sec

packetboy

01-13-2012, 07:05 PM

Let the tuning begin..first switch IB adapters from 'datagram' mode to 'connected' mode:

root@ib3:~# echo "connected" > /sys/class/net/ib0/mode
root@ib4:~# echo "connected" > /sys/class/net/ib0/mode

Then retest.

root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.9.2 port 47623 connected with 192.168.9.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 7.67 GBytes 6.59 Gbits/sec

Whoa...guess that's an important one! That's about a 48% improvement right there.

packetboy

01-13-2012, 07:10 PM

Once in connected mode, higher MTUs are allowed..up to 64K, so that's try that next:

root@ib3:~# ifconfig ib0 mtu 64000
root@ib4:~# ifconfig ib0 mtu 64000

root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size: 189 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.9.2 port 47626 connected with 192.168.9.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 19.6 GBytes 16.8 Gbits/sec

No that's what I'm talking about.

402blownstroker

01-13-2012, 07:23 PM

root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size: 189 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.9.2 port 47626 connected with 192.168.9.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 19.6 GBytes 16.8 Gbits/sec

No that's what I'm talking about.

Good God, that just made my pants dance some :D

Rectal Prolapse

01-13-2012, 07:57 PM

Ouch - that's it? A couple of Dell XR997s could beat that! :P *ducks*

*crawls under rock*

Keep on experimenting! :)

MarkL

01-14-2012, 09:01 AM

Once in connected mode, higher MTUs are allowed..up to 64K, so that's try that next:

root@ib3:~# ifconfig ib0 mtu 64000
root@ib4:~# ifconfig ib0 mtu 64000

root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size: 189 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.9.2 port 47626 connected with 192.168.9.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 19.6 GBytes 16.8 Gbits/sec

No that's what I'm talking about.

Very nice!

So a question regarding IB as I am a bit of a noob in that area. When you have your cluster setup, will you be using IPoIB for the communication between nodes and data storage the same as your benchmarking there? Or do you switch it to act more like a SAN where it will address disks remotely? Or does it get mixed with the nodes themselves sharing information via IPoIB but then your ZFS head server talks directly to the drive via a different mode?

Thanks..

packetboy

01-14-2012, 11:14 AM

Very nice!

So a question regarding IB as I am a bit of a noob in that area. When you have your cluster setup, will you be using IPoIB for the communication between nodes and data storage the same as your benchmarking there? Or do you switch it to act more like a SAN where it will address disks remotely? Or does it get mixed with the nodes themselves sharing information via IPoIB but then your ZFS head server talks directly to the drive via a different mode?

Thanks..

I'm about 3 days into IB experience, so noob too.
As I understand it there are a bunch of options:

NFS over IPoIB
iScisi over IPoIB

-OR-

(Assuming the underlying operating systems support it) you eliminate TCP/IP all-together and use NFS and/or iSCSI over RDMA (Remote direct memory access)

iSCSI over RDMA:
http://docs.oracle.com/cd/E23824_01/html/821-1459/fnnop.html#gfcun

NFS over RDMA:
http://www.opengridcomputing.com/nfs-rdma.html

I'm working on getting components in place to do IB testing over RDMA now.

ChrisBenn

01-14-2012, 06:08 PM

I'd definitely be interested if you can get NFS over RDMA working - I played around with it for awhile and had no luck (Mellanox 10Gb IB cards - the onboard memory ones)

Parak

01-24-2012, 01:12 PM

For ease of use, vanilla NFS and iSCSI are fine. RDMA with them in my experience was difficult to implement, unstable, and not much of a performance boost. This also obviously layers on top of IPoIB, so don't expect latency equivalent to what infiniband may advertise due to overhead.

If you want lowest latency and highest speed storage presentation though, you want to go SRP. That gets you raw disk from storage to client. Implementation on linux is fairly straightforward and decently documented (see SCST or 3.3-rc1 LIO)

Note that infiniband allows protocol mixing without a problem, so you can run any combination of the above with IPoIB, and what have you.

patrickdk

01-24-2012, 02:24 PM

I tried nfs over rdma, but gave up, nfs over connected mode works well.

iscsi works ok, but I liked using srp or iser instead, if possible.

Have you done any tuning to your solaris side? I had to do some good tuning to not have it fall over itself, when getting up in speed. Didn't have to do any tuning to ubuntu though.

Parak

01-24-2012, 02:36 PM

iSER was a no go at the time that I tried it, as the target was nowhere close to being usable. I did not try Solaris clients, only Windows, Linux, and ESX (with SRP). I don't know about your specific needs, but if you have a choice in the matter and have no preference otherwise, I'd actually steer clear of OpenSolaris, especially since it's dead. Infiniband is rather better supported on Linux, though I can't speak for the proprietary Solaris.

danswartz

01-24-2012, 02:45 PM

"I'd actually steer clear of OpenSolaris, especially since it's dead."

In the literal sense, I suppose. Very misleading statement for someone reading this who doesn't know any better...

patrickdk

01-24-2012, 02:47 PM

His screenshots where from solaris, why I mentioned it.

unhappy_mage

01-24-2012, 02:48 PM

Very cool project. I'll be interested in seeing where it goes.

Parak

01-24-2012, 03:20 PM

"I'd actually steer clear of OpenSolaris, especially since it's dead."

In the literal sense, I suppose. Very misleading statement for someone reading this who doesn't know any better...

Ah, I actually misread/misunderstood what the OP was using. I should clarify further that based on my experience with repeatedly applying head to desk over figuring out infiniband, Linux was the better documented (and I use the term loosely) platform for infiniband usage because of OFED. YMMV for commercially supported stuff like Solaris 11, of course.

danswartz

01-24-2012, 03:22 PM

Well, not just that. There are at least two active ports based on OpenSolaris (OpenIndiana and Nexenta). People shouldn't shy from a zfs based solution thinking OS is dead...

patrickdk

01-24-2012, 03:49 PM

Dunno, I found infiniband setup on openindiana, was basically automatic, I didn't have to do anything really. Just install the srp/iser stuff if I wanted to use it, but everything was automatic.

Far from that on linux for me.

packetboy

01-24-2012, 05:15 PM

Dunno, I found infiniband setup on openindiana, was basically automatic, I didn't have to do anything really. Just install the srp/iser stuff if I wanted to use it, but everything was automatic.

How did you get OFED compiled on OI?

patrickdk

01-24-2012, 06:01 PM

I didn't compile OFED on OI, why would I need OFED?

Red Falcon

01-25-2012, 12:23 AM

Once in connected mode, higher MTUs are allowed..up to 64K, so that's try that next:

root@ib3:~# ifconfig ib0 mtu 64000
root@ib4:~# ifconfig ib0 mtu 64000

root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size: 189 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.9.2 port 47626 connected with 192.168.9.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 19.6 GBytes 16.8 Gbits/sec

No that's what I'm talking about.

That's some of the best throughput I've seen on this forum thus far. :cool:

packetboy

01-26-2012, 01:19 PM

I didn't compile OFED on OI, why would I need OFED?

I may be confused, but I thought that's where the RDMA, iSer support comes from?

What packages did YOU use for this and where did you get them??

We're working on this full bore this week, so should show some major progress.

patrickdk

01-26-2012, 01:32 PM

Hmm, nothing is needed for RDMA support.
Looks like the packages are install by default.
Guess it was just a matter of adding them via cfgadm

On the linux side, iser is supported via the iscsi stuff.
But you need to hunt down the program for srp.

packetboy

01-26-2012, 07:05 PM

We just figured out that OFED 1.5.3/1.5.2 is included in the FreeBSD 9.0 distribution..it's NOT compiled into the kernel by default, but setting a few kernel options and a quick recompile and FreeBSD comes up like a champ with OFED *and* support for the Mellanox Connect X-2 adapters we have....took less than 1 hour to do all of this (course helps when you have dual 6 core CPUs). The RDMA functions are supposed to be there too...we'll see.

We're building a second Free BSD box now...hopefully we'll have them talking.

We fired up OI151a...it does NOT see the ib interfaces...unclear if it needs a driver...playing with that now as well.

patrickdk

01-26-2012, 08:11 PM

Not sure, I hadn't tried the connectx-2 cards in oi. But I thought all the connectx cards used the same driver. I'll see if I can find something out, if time permits.
---
Yep both connect-x and connect-x 2 use the hermon driver. Should be detected fine. Your card, as your other posts say should be using the driver_aliases to load it via:
hermon "pciex15b3,6340"

cfgadm doesn't show ib at all?

packetboy

01-27-2012, 02:25 AM

I'll post the details of ib detect issues we're having tomorrow on OI151a...in the mean time Syoyo has made major progress:

----
I got success to run OpenSM on OI151a + ConnectX(hermon).
Also confirmed SRP target works on OI151a + ConnectX.
(But one SRP connection is limited to 750MB/s at max on my HW.
Multiple SRP connection will hit QDR limit, i.e. 3.2GB/s)

http://syoyo.wordpress.com/2012/01/23/opensm-on-illumos-hermonconnectx-works/

-----

Seems like SRP my be the ticket...I want to replicate these results!

We got the two FreeBSD 9.0 servers to talk to each other via IPoIB...all we had handy for perf testing was 'scp'....scp across the 1GB Ethernet yielded 100MB/s as expected. Across the IB.... 7MB/s ... dismal.

So although the FreeBSD was trivial to setup, out of the box performance is horrific, please we can't firgure out how to adjust IB driver settings...this is supposed to be done via /sys/class/net/ib0/mode , however, that directory doesn't even exist on the server. Sigh, it's never easy.

brutalizer

01-27-2012, 08:22 AM

http://forums.overclockers.com.au/showthread.php?t=944153

infiniband on Solaris

packetboy

01-27-2012, 01:18 PM

So here's the status of the IB adapter on oi151a:

System Configuration: Project OpenIndiana i86pc
Memory size: 49143 Megabytes
System Peripherals (Software Nodes):

i86pc
ib, instance #0
srpt, instance #0
rpcib, instance #0
rdsib, instance #0 (driver not attached)
eibnx, instance #0
daplt, instance #0 (driver not attached)
rdsv3, instance #0
sol_uverbs, instance #0 (driver not attached)
sol_umad, instance #0 (driver not attached)
sdpib, instance #0 (driver not attached)
iser, instance #0 (driver not attached)

scsi_vhci, instance #0
pci, instance #0
pci8086,0 (driver not attached)
pci8086,3408, instance #0
pci15d9,10d3, instance #0
pci8086,3409, instance #1
pci15d9,10d3, instance #1
pci8086,340a, instance #2
pci15d9,48, instance #0
ibport, instance #1
pci8086,340c (driver not attached)
pci8086,340e (driver not attached)
pci8086,342d (driver not attached)
pci8086,342e (driver not attached)
pci8086,3422 (driver not attached)
pci8086,3423, instance #0
pci8086,3438, instance #0 (driver not attached)
pci15d9,7 (driver not attached)
pci15d9,7 (driver not attached)
pci15d9,7 (driver not attached)
pci15d9,7 (driver not attached)
pci15d9,7 (driver not attached)
pci15d9,7 (driver not attached)
pci15d9,7 (driver not attached)
pci15d9,7 (driver not attached)
pci15d9,7, instance #0
pci15d9,7, instance #1

oi151:~# ls -ltr /dev | grep ibp
lrwxrwxrwx 1 root root 67 Jan 26 06:34 ibp0 -> ../devices/pci@0,0/pci8086,340e@7/pci15b3,22@0/ibport@1,0,ipib:ibp0
lrwxrwxrwx 1 root root 67 Jan 26 06:34 ibp1 -> ../devices/pci@0,0/pci8086,340a@3/pci15d9,48@0/ibport@1,0,ipib:ibp1
lrwxrwxrwx 1 root root 29 Jan 26 06:34 ibp -> ../devices/pseudo/clone@0:ibp

I *believe* the actual adapter is /pci15b3 (ibp0), however, dladm only sees 'ibp1'..and it shows 'down' even though the ib interface has link with the switch:

oi151:~# dladm show-ib
LINK HCAGUID PORTGUID PORT STATE PKEYS
ibp1 2590FFFF2FC828 2590FFFF2FC829 1 down FFFF

Confusing and maddening...so close!

patrickdk

01-27-2012, 10:52 PM

You are running a subnet manager correct? It will show down if there is none.

Odd, your second port isn't showing.

You need to use cfgadm to add the extra, driver not attached, ones, if you want to use them.

On my systems it shows:

System Configuration: Project OpenIndiana i86pc
Memory size: 16376 Megabytes
System Peripherals (Software Nodes):

i86pc (driver name: rootnex)
ib, instance #0 (driver name: ib)
srpt, instance #0 (driver name: srpt)
rpcib, instance #0 (driver name: rpcib)
rdsib, instance #0 (driver name: rdsib)
eibnx, instance #0 (driver name: eibnx)
daplt, instance #0 (driver name: daplt)
rdsv3, instance #0 (driver name: rdsv3)
sol_uverbs, instance #0 (driver name: sol_uverbs)
sol_umad, instance #0 (driver name: sol_umad)
sdpib, instance #0 (driver name: sdpib)
iser, instance #0 (driver name: iser)

packetboy

01-28-2012, 09:50 AM

> You need to use cfgadm to add the extra, driver not attached, ones, if you want to use them.

how, exactly?

Also...it's a *single* port IB card...not dual.

patrickdk

01-28-2012, 12:16 PM

http://docs.oracle.com/cd/E23824_01/html/821-1459/eyaqo.html

I believe I just did, cfgadm -c configure ib::iser,0
and repeated for srpt, sdpib, ...

packetboy

01-28-2012, 03:28 PM

I'm really seeing why no one is crazy enough to try this...right now my goal is to just get two hosts talking to each other on IB *and* achieve throughput with a real application (NFS, iScsi, etc.) that is close to the theortical max for QDR IB (40Gbps * .90 ~= 36Gbps or 4500MB/s).

I still can't get things working right on OI151 or Solaris 11, so spent all day today just trying to find some *nix distribution where not only IB works, but IB with the Mellanox ConnectX-2 works...*and* RDMA works.

Our closest thing to success is Centos 5.5 (kernel 2.6.18):

[root@cesena ~]# rdma_bw rubicon
6708: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 | duplex=0 | cma=0 |
6708: Local address: LID 0x04, QPN 0x60004b, PSN 0xfef7be RKey 0xc0041d00 VAddr 0x002b2b91d41000
6708: Remote address: LID 0x01, QPN 0x64004b, PSN 0xb6835a, RKey 0xe8042000 VAddr 0x002b0e2c1ef000

6708: Bandwidth peak (#0 to #987): 3249.94 MB/sec
6708: Bandwidth average: 3249.74 MB/sec
6708: Service Demand peak (#0 to #987): 921 cycles/KB
6708: Service Demand Avg : 921 cycles/KB

3249 MB/s / 4200Mbps = ~72% of max

Pretty close.

Now going to try SRP (Scsi RDMA Protocol). with a RAM disk.

MarkL

01-28-2012, 09:39 PM

6708: Bandwidth average: 3249.74 MB/sec

Dude.. Wikipedia (http://en.wikipedia.org/wiki/InfiniBand) says 32Gbit is the max for QDR.. So you hit max with that Linux config..

InfiniBand QDR with 40 Gb/s (32 Gb/s effective)

patrickdk

01-28-2012, 09:54 PM

Wikipedia is normally wrong when ever I look at it. But in this case it's correct.

But that isn't the issue. 32Gbit == 4.096GB

3249MB/sec != 4096MB/sec

I assume his 4200Mbps was just rounding errors, and should be MB/s

packetboy

01-28-2012, 11:19 PM

My bad, I was thinking effective max throughput took a 10% haircut due to overhead...you're right...from wiki it's 20%...32Gbps....which is kind of good as that means I was getting pretty close to max throughput with the rdma bandwidth test 3250MB/s out of a possible ~4000MB/s = ~81%

Regardless...CentOS 5.5 started proving useless once we tried to get SRP running...it seemed like CentOS 6.2 was better suited, so we upgraded. rdma_bw tests worked right out of the box...and got same exact results as with 5.5.

Getting the RDMA Scsi Target (SCST) compiled and loaded wasn't too hard (required compile from source). Creating the target was pretty easy...getting the initiator to mount the target took 5 hours of screwing around. This guide is the closest to correct:

http://davidhunt.ie/wp/?p=491

What wasn't completely clear is that you had to basically guess what initiator name to allow, but then monitor /var/log/messages on the target to see how what the actual initiator name was. In my case I saw:

Jan 28 23:25:44 rubicon kernel: ib_srpt: Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff2fc829, t_port_id 0x2590ffff2fc820:0x2590ffff2fc820 and it_iu_len 260 on port 1 (guid=0xfe80000000000000:0x2590ffff2fc821)
Jan 28 23:25:44 rubicon kernel: [1784]: scst: scst_init_session:6289:Using security group "ib_srpt_target_1" for initiator "0x0000000000000000002590ffff2fc829" (target ib_srpt_target_1)
Jan 28 23:25:44 rubicon kernel: [8391]: scst: scst_translate_lun:3853:tgt_dev for LUN 0 not found, command to unexisting LU (initiator 0x0000000000000000002590ffff2fc829, target ib_srpt_target_1)?
Jan 28 23:25:44 rubicon kernel: [8391]: scst: scst_translate_lun:3853:tgt_dev for LUN 0 not found, command to unexisting LU (initiator 0x0000000000000000002590ffff2fc829, target ib_srpt_target_1)?
Jan 28 23:25:44 rubicon kernel: [8391]: scst: scst_translate_lun:3853:tgt_dev for LUN 1 not found, command to unexisting LU (initiator 0x0000000000000000002590ffff2fc829, target ib_srpt_target_1)?
Jan 28 23:25:44 rubicon kernel: [8391]: scst: scst_translate_lun:3853:tgt_dev for LUN 1 not found, command to unexisting LU (initiator 0x0000000000000000002590ffff2fc829, target ib_srpt_target_1)?

So then went back and did:

scstadmin -add_init 0x0000000000000000002590ffff2fc829 -driver ib_srpt -target ib_srpt_target_0 -group HOST01

He provides no documentation on how to start the initiator.

What I did was:

[root@cesena ~]# srp_daemon -vvvv -a -c
configuration report
------------------------------------------------
Current pid : 2991
Device name : "mlx4_0"
IB port : 1
Mad Retries : 3
Number of outstanding WR : 10
Mad timeout (msec) : 5000
Prints add target command : 1
Executes add target command : 0
Print also connected targets : 1
Report current targets and stop : 0
Reads rules from : /etc/srp_daemon.conf
Do not print initiator_ext
No full target rescan
Retries to connect to existing target after 20 seconds
------------------------------------------------
id_ext=002590ffff2fc820,ioc_guid=002590ffff2fc820,dgid=fe800 00000000000002590ffff2fc821,pkey=ffff,service_id=002590ffff2 fc820,max_cmd_per_lun=32,max_sect=65535

Took fields from there and updated to /etc/srp_daemon.conf as follows:

[root@cesena ~]# cat /etc/srp_daemon.conf
## This is an example rules configuration file for srp_daemon.
##
#This is a comment
## disallow the following dgid
#d dgid=fe800000000000000002c90200402bd5
## allow target with the following ioc_guid
#a ioc_guid=00a0b80200402bd7
## allow target with the following id_ext and ioc_guid
#a id_ext=200500A0B81146A1,ioc_guid=00a0b80200402bef
## disallow all the rest
#
a id_ext=002590ffff2fc820,ioc_guid=002590ffff2fc820,dgid=fe800 00000000000002590ffff2fc821,max_cmd_per_lun=32,max_sect=6553 5

Then ran this:

# srp_daemon -e -vvvv -a -f /etc/srp_daemon.conf -R 10

After running above, /var/log/messages on the initiator showed this!

Jan 29 19:26:38 cesena kernel: [2720]: scst: init_scst:2362:SCST version 2.2.0 loaded successfully (max mem for commands 12064MB, per device 4825MB)
Jan 29 19:26:38 cesena kernel: [2748]: scst: scst_global_mgmt_thread:6593:Management thread started, PID 2748
Jan 29 19:26:38 cesena kernel: [2720]: scst: scst_print_config:2155:Enabled features: EXTRACHECKS, DEBUG
Jan 29 19:28:02 cesena kernel: scsi4 : SRP.T10:002590FFFF2FC820
Jan 29 19:28:02 cesena kernel: scsi 4:0:0:0: Direct-Access SCST_FIO DISK01 220 PQ: 0 ANSI: 5
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: Attached scsi generic sg1 type 0
Jan 29 19:28:02 cesena kernel: [2754]: scst: scst_register_device:964:Attached to scsi4, channel 0, id 0, lun 0, type 0
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: [sdb] 10240000 512-byte logical blocks: (5.24 GB/4.88 GiB)
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: [sdb] 4096-byte physical blocks
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: [sdb] Write Protect is off
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
Jan 29 19:28:02 cesena kernel: sdb: unknown partition table
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: [sdb] Attached SCSI disk
Jan 29 19:29:50 cesena kernel: sdb: sdb1

# fdisk -l

Disk /dev/sdb: 5242 MB, 5242880000 bytes
162 heads, 62 sectors/track, 1019 cylinders
Units = cylinders of 10044 * 512 = 5142528 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 524288 bytes
Disk identifier: 0xbbe37ce9

Did a fdisk, and mkfs.ext2 on it and then was able to mount it:

#mount /dev/sdb1 /mnt/rub58

Created a bunch of 1GB files, and then built this little test script:

test.sh

dd if=/mnt/rub58/file1.img of=/dev/null bs=1M &
dd if=/mnt/rub58/file2.img of=/dev/null bs=1M &
dd if=/mnt/rub58/file3.img of=/dev/null bs=1M &
dd if=/mnt/rub58/file4.img of=/dev/null bs=1M &

And here are the results:

# sh test.sh

1048576000 bytes (1.0 GB) copied, 1.27607 s, 822 MB/s
1048576000 bytes (1.0 GB) copied, 1.27607 s, 822 MB/s
1048576000 bytes (1.0 GB) copied, 1.27607 s, 822 MB/s
1048576000 bytes (1.0 GB) copied, 1.27607 s, 822 MB/s

822MB/s * 4 (threads) = 3288MB/s

Wow..that is awesome...we're actually getting SRP performance that is nearly identical to the rdma_bw test. This is definitely looking promising.

syoyo

01-30-2012, 08:17 AM

Congrats! > 822MB/s * 4 (threads) = 3288MB/s

BTW, I got success to run SRP target on OI151a + ConnectX QDR, and get around 1.2GB/s bandwidth.
My mobo's chipset seems limiting internal bandwidth by 20Gbps, so I cannot get QDR peak, but guess your facility could.

Hope hearing your success of SRP on Solaris.

tormentum

01-30-2012, 08:18 AM

*subscribed*

This looks interesting guys. I'm interested in the Solaris side as well as I'm having backup time window issues with ZFS send/recv's in our production environment. Looks promising!

packetboy

01-30-2012, 09:53 AM

Hope hearing your success of SRP on Solaris.

Yes...we achieved this on Sunday. It was very, very weird...we installed OI151a on one of the Supermicro blades last week...it was doing that thing where is showed an ibp1 interface, but no ibp0 and ibp1 was completely unusable. Even after re-installing OI151a a second time, same issue.

On Sunday, we installed OI151a on a different blade and this time ibp0 came up right away:

# dladm show-ib
LINK HCAGUID PORTGUID PORT STATE PKEYS
ibp0 2590FFFF2FC81C 2590FFFF2FC81D 1 up FFFF

# dladm show-link
LINK CLASS MTU STATE BRIDGE OVER
ibp0 phys 65520 up -- --
e1000g1 phys 1500 unknown -- --
e1000g0 phys 1500 up -- --
pFFFF.ibp0 part 65520 up -- ibp0

The only thing we can think of that we did differently is disable NWAM right away (this is known NOT to work with infiniband interfaces)...at this point I'd go even further and say that if you don't disable it IMMEDIATELY after oi151 install it seems to completely screw up the ib drivers.

Once we got that working, we installed the srp package and quickly got OI setup as an SRP target. We then mounted the SRP target on one of the existing Centos 6.2 systems. Performance was almost identical to the Centos target:

1 dd thread: 1200MB/s
2 dd thread: 2200MB/s
3 dd thread: 2835MB/s
4 dd thread: 3147MB/s <-- Infinband cables noticeable warm here ;)

Next we exported an NFS share on OI using NFSoRDMA and then mounted it on the Centos box:

mount -o rdma,port=20049,rsize=65535,wsize=65535,noatime,nodiratime rubicon:/mnt/ramdisk /mnt/ramd

Then used this test script:

[root@cesena ~]# cat test_nfs_rdma.sh
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/mnt/ramd/chunk0 of=/dev/null bs=1M &
time dd if=/mnt/ramd/chunk1 of=/dev/null bs=1M &
time dd if=/mnt/ramd/chunk2 of=/dev/null bs=1M &
time dd if=/mnt/ramd/chunk3 of=/dev/null bs=1M &

Total throughput from 4 thread: 1340MB/s

This is still pretty darn good, however, I'm disappointed as Id' much rather use NFSoRDMA than SDP...but given the better than 2.3x performance with SDP, looks like well we going that way.

Note the cache_flush command in the script above...made it much easier to test performance than having to do constant umount/mounts in order to flush linux file system cache. We purposely let the data be cached on the server side as our goal was to test IB throughput NOT actual drive throughput.

packetboy

01-30-2012, 10:00 AM

Congrats! > 822MB/s * 4 (threads) = 3288MB/s

BTW, I got success to run SRP target on OI151a + ConnectX QDR, and get around 1.2GB/s bandwidth.
My mobo's chipset seems limiting internal bandwidth by 20Gbps, so I cannot get QDR peak, but guess your facility could.

1.2GB/s with how many threads?

How can you determine how many PCI-e lanes are active on device under OI?

It's real nice to be able to do this under Centos:

# lspci -v -v

03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
Subsystem: Super Micro Computer Inc Device 0048
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin A routed to IRQ 24
Region 0: Memory at fbd00000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at f8800000 (64-bit, prefetchable) [size=8M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
pcilib: sysfs_read_vpd: read failed: Connection timed out
Not readable
Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [100] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [148] Device Serial Number 00-25-90-ff-ff-2f-c8-28
Kernel driver in use: mlx4_core
Kernel modules: mlx4_core

8x Lanes of PCI-e Gen2 goodness!

Are you sure you are getting 8 lanes?

Also, I haven't experimented yet, but my understanding is that BIOS MSI-X and PCI-e message size settings can have a major impact on performance.
I do have MSI-X on and PCI-e message size is set to 256B (the only other option I believe is 128B).

Rectal Prolapse

01-30-2012, 10:00 AM

Out of curiousity, is there a reason why you could not use 10 gbe (and maybe they have dual-port cards now)? I was very suspicious when I saw your initial low numbers that I've seen people with 4 10 gbe cards easily exceed - although it sure took a lot of space on the board, and seemed to be only in loopback! :O

packetboy

01-30-2012, 10:50 AM

10Gbpe is more expensive than QDR IB. The host adapters are about the same price, but the IB switch was a LOT less expensive...I got a 18 port QDR switch for $3500 (new)...that's $194/port.

10GBe switch ports are more like $500 - $1000 a pop.

So 3x the bandwidth for at least half the switch cost....that's what made this enticing.

Rectal Prolapse

01-30-2012, 11:49 AM

ahhh the switches ok that makes sense.

Stanza33

01-30-2012, 03:04 PM

SRPT Installation and Configuration

http://hub.opensolaris.org/bin/view/Project+srp/srptconfig

.

syoyo

01-31-2012, 05:17 AM

Yes...we achieved this on Sunday.

Congrats!

Once we got that working, we installed the srp package and quickly got OI setup as an SRP target. We then mounted the SRP target on one of the existing Centos 6.2 systems. Performance was almost identical to the Centos target:

1 dd thread: 1200MB/s
2 dd thread: 2200MB/s
3 dd thread: 2835MB/s
4 dd thread: 3147MB/s <-- Infinband cables noticeable warm here ;)

Numbers are pretty nice!

[root@cesena ~]# cat test_nfs_rdma.sh
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/mnt/ramd/chunk0 of=/dev/null bs=1M &
time dd if=/mnt/ramd/chunk1 of=/dev/null bs=1M &
time dd if=/mnt/ramd/chunk2 of=/dev/null bs=1M &
time dd if=/mnt/ramd/chunk3 of=/dev/null bs=1M &

Total throughput from 4 thread: 1340MB/s

This is still pretty darn good, however, I'm disappointed as Id' much rather use NFSoRDMA than SDP...but given the better than 2.3x performance with SDP, looks like well we going that way.

Why you just test NFSoRDMA with 1M of data? I guess you have to send much more data, otherwise the overhead of filesystem will limit the bandwidth.

syoyo

01-31-2012, 05:25 AM

1.2GB/s with how many threads?

How can you determine how many PCI-e lanes are active on device under OI?

It's real nice to be able to do this under Centos:

4 dd threads, each of 1GB data, as you did.

Are you sure you are getting 8 lanes?

Also, I haven't experimented yet, but my understanding is that BIOS MSI-X and PCI-e message size settings can have a major impact on performance.
I do have MSI-X on and PCI-e message size is set to 256B (the only other option I believe is 128B).

I have no idea how to check LinkSta on OI. I checked it with lspci by running Linux(CentOS) once and confirmed 5GT/s, x8 LinkSta.

I don't know how to set PCI-e message size to 256B, will try to find it if I have time.

FYI, I am running ZFS + IB + OI151a box on this mobo,

http://www.intel.com/content/www/us/en/motherboards/desktop-motherboards/desktop-board-dh57jg.html

Its a mini-ITX(Because I wanted silent and small storage box), so it is not a problem if it can't achieve QDR peak performance ;-)

syoyo

01-31-2012, 05:29 AM

ahhh the switches ok that makes sense.

Yes, you can also buy much more cheap IB switches on eBay.

I once bought 8-port IB SDR switch for $200, and also bought 10 IB SDR card for $200. Its $25/port, $20/HCA. Its mostly same price of 1GbE. Cables was also around $10 ~ $20/cable.

packetboy

01-31-2012, 11:26 AM

Why you just test NFSoRDMA with 1M of data? I guess you have to send much more data, otherwise the overhead of filesystem will limit the bandwidth.

We were using *blocksize* of 1M for the read...the test files were 2-4GB.

syoyo

01-31-2012, 11:43 PM

We were using *blocksize* of 1M for the read...the test files were 2-4GB.

Ah, I see. How about much more *blocksize*? e.g. 128MB.

FYI, NFS/RDMA seems not stable before OFED 1.5.4.
You'd be better to use OFED 1.5.4 or later, but in this case, most verbs application doesn't work with Solaris IB(e.g. ib_read_bw)

packetboy

02-05-2012, 07:29 PM

Bad news, good news.

Had to abandon SRP (Scsi over RDMA) dreams...simple dd tests looked promising, but the more we moved data and started doing larger two-way tests, it all just fell apart (file systems on drives actually seemed to become corrupted)..and even before that performance would suddently drop to 70MB/s per spindle.

We decided to punt and go for iSer (iScsi over RDMA)...it took a full day to figure out how to get it working on OI 151a and Centos 6.2, but when we finally did it seemed much more stable and also a lot easier to manage.

This time we created iSCSI targets from the raw SAS devices we had connected to the OI server (Hitatchi 2TB drives). In this configuration OI does NOT seem to do any server side caching at all, thus we could not test iSer throughput from server cache...only throughput to the drives themselves (across the 40Gbps Infiniband network of course):

Seq. read Throughput looked like this:

Drives Throughput
1 131MB/s
2 253MB/s
3 393MB/s
4 516MB/s
.
.
8 715MB/s

As each drive was good for about 130MB/s it seemed to scale almost exactly as expected for the first 4 drives. Oddly, once we went about 4 (and thus started using the second SAS wide port on our LSI 9200-8e) we were only getting an incremental boost of 80-90MB/s per addl drive.

That's when we noticed that we'd get the same bandwidth on SAS port 2 when driving 1 or 2 drives, but with 3 or 4 drives performance was a good 20% less than Port 1. We though it might be the el-cheapo SAS cables...swapped the cable and got the same results...so not sure what's going on there.

Because all we had was a 8 drive Sans Digital SAS enclosure, this was as much testing as we can do right now. I have 4 of those Rackable Systems SAS enclosures on the way...once we have those in place we'll be able to see how far we can take iSer.

Will post full details on how we got iSer working tomorrow.

MrGuvernment

02-05-2012, 08:58 PM

Curious, reason why perhaps something like NGINX isn't used instead of Apache or a mix of the 2, or you using some optimized apache configs?

Also i would of thought another OS than Ubuntu, but i also haven't used the server edition much, CentOS guy myself.

So- silly question- what do you use this Hadoop thing for?

Some big names on their user list.

Ya, sounds like a serious power house project

packetboy

02-06-2012, 02:42 AM

Curious, reason why perhaps something like NGINX isn't used instead of Apache or a mix of the 2, or you using some optimized apache configs?

Also i would of thought another OS than Ubuntu, but i also haven't used the server edition much, CentOS guy myself.

Apache (Web server) not equal Apache Hadoop

Hadoop is a data warehousing application meant to process, store, and facilitate the querying of unstructured data in the multi-Terabyte range.

We are using CentOS 6.2 ... so far it seems to be the best free Linux option right now for pretty decent Infiniband support right out of the box...with the exception of OI of course.

pjkenned

02-06-2012, 10:57 AM

Bad news, good news.

Had to abandon SRP (Scsi over RDMA) dreams...simple dd tests looked promising, but the more we moved data and started doing larger two-way tests, it all just fell apart (file systems on drives actually seemed to become corrupted)..and even before that performance would suddently drop to 70MB/s per spindle.

We decided to punt and go for iSer (iScsi over RDMA)...it took a full day to figure out how to get it working on OI 151a and Centos 6.2, but when we finally did it seemed much more stable and also a lot easier to manage.

This time we created iSCSI targets from the raw SAS devices we had connected to the OI server (Hitatchi 2TB drives). In this configuration OI does NOT seem to do any server side caching at all, thus we could not test iSer throughput from server cache...only throughput to the drives themselves (across the 40Gbps Infiniband network of course):

Seq. read Throughput looked like this:

Drives Throughput
1 131MB/s
2 253MB/s
3 393MB/s
4 516MB/s
.
.
8 715MB/s

As each drive was good for about 130MB/s it seemed to scale almost exactly as expected for the first 4 drives. Oddly, once we went about 4 (and thus started using the second SAS wide port on our LSI 9200-8e) we were only getting an incremental boost of 80-90MB/s per addl drive.

That's when we noticed that we'd get the same bandwidth on SAS port 2 when driving 1 or 2 drives, but with 3 or 4 drives performance was a good 20% less than Port 1. We though it might be the el-cheapo SAS cables...swapped the cable and got the same results...so not sure what's going on there.

Because all we had was a 8 drive Sans Digital SAS enclosure, this was as much testing as we can do right now. I have 4 of those Rackable Systems SAS enclosures on the way...once we have those in place we'll be able to see how far we can take iSer.

Will post full details on how we got iSer working tomorrow.

Very interesting. I saw something similar with running the old SF-1200 SSDs on the 9211-8i back in the day, using LSI RAID 0 (just benchmarking no data.) There was a big difference between using 4 ports and 8 ports in terms of incremental speed.

patrickdk

02-06-2012, 11:04 AM

Wonder if the 9205-8e correct this issue.

I personally had an issue when doing multible transfers over ib, it would fall flat on it's fast. Single transfer speeds of 800MB/sec and high, when I attempted 2 or 3 at the same time, would all start going 30MB/sec each. Just a few quick tunings later and it was all good, mainly following the 10gbit ethernet tuning adjustments.

packetboy

02-06-2012, 06:42 PM

Evolving the initial design to this:
http://img406.imageshack.us/img406/7078/hadoopv11.png

LSI 9205-8e controllers are so reasonable right now, just seems to make sense to direct SAS connect each blade to the Rackable enclosures. Using Fatwallet you get 3% cashback at Overstock.com, plus another 3% in Overstock dollars making these about $330 net a piece. So basically, for $1300 I can have a dedicated SAS storage network and a dedicated 40Gbps IB network for the blades to talk to each other. I think there is one left on Overstock...so hurry if you want one.

If/when I need/can afford more compute power, then I'll simply convert the existing Twin blade server to an iSER server, stuff a full blown Blade chassis with as much compute power as I can afford and serve it up disks via iSER.

We'll see.

kristofferjon

03-07-2012, 12:44 AM

Packetboy,

Can you post the iSER configuration details?

Regards,
Kris

sor

05-09-2012, 09:31 AM

Wonder if the 9205-8e correct this issue.

I personally had an issue when doing multible transfers over ib, it would fall flat on it's fast. Single transfer speeds of 800MB/sec and high, when I attempted 2 or 3 at the same time, would all start going 30MB/sec each. Just a few quick tunings later and it was all good, mainly following the 10gbit ethernet tuning adjustments.

I'd be interested in knowing what you did. We have a fairly large SRP deployment, and have found that distributions based on newer Solaris kernels do this. I've tried Solaris 11/11 and Illumian. So at the moment we're stuck on Nexenta Core 3.1 with the 134 kernel.

I've found that Linux can have problems as a target in fileio mode with SRP. It's relatively easy to fill the dirty memory with so much data that the platters choke and the target becomes unresposive for many seconds. If you run linux as a target, I'd suggest blockio or changing the /proc/sys/vm settings so that the dirty memory flushes often. With the ZFS based systems it's less of an issue, since you set how often you want the dirty memory to flush, and how long the flush should take, and if it exceeds the set time it begins to throttle dirty writes to keep the flush times in check. Linux just makes everything block until the dirty writes complete, or with newer kernels they put the writers to sleep. That doesn't seem to work quite as well, as I frequently got the "blocked for more than 120 seconds" kernel warnings on the SCST threads in fileio mode.

goktugy

07-10-2012, 03:55 AM

Hello,

This is something interesting to read and good luck with it.
I wonder if there are any updates you might value to share.

Thanks.

newjohnny

08-14-2012, 01:32 PM

I'm doing something similiar here but with the new ConnectX-3 cards in OpenIndiana 151a5. Loading /drivers/network/hermon in Device Driver Utility fails and leaves the card status at UNK.

I assumed that if ConnectX-2 cards work, then the ver.3 cards would too, but seems not to. I'm still pretty green with Solaris so maybe I'm missing something. Has anyone gotten the ConnectX-3 cards going in OI?

UPDATE: putting this line in driver_aliases and rebooting attached the drivers: hermon "pciex15b3,1003"

Nabisco_DING

10-15-2012, 01:50 PM

Hate to revive an old thread but I was wondering, after 8 months, if you were ever able to get IB SRP working properly?

Did you stick with iSER after all?
Were you able to figure out why you were getting decreased drive performance after 4 drives?