Jump to Navigation

Convey HC-1 & HC-1x Notes

Software Simulation

If the checker is enabled, the simulation environment checks loads and stores as well as scalar returns from the AE (reads of AEG registers) compared to what's generated by your software model in CaeSimPers? . The checker can be disabled in the testbench/sc.config file (the second argument after caesim is the "checker disable" switch, so 0 = enabled):

# CaeSim Setup Arguments: debug checker_dis
caesim 2 0

Setting up the Environment

Certain environment variables need to be set up before being able to make or run simulations. Without these, errors would be thrown, like issues in the path of files.

export CNY_PDK_HDLSIM=Synopsys
export CNY_PDK_SIMMODE=64                                                              

export PATH=$PATH:/opt/system/convey/pdk/2010_08_09/bin:/opt/system/convey/bin
export CNY_SIM_THREAD=libcpSimLib2.so

export CNY_CAE_EMULATOR=~/<folder_name>/cae_pers_vadd/CaeSimPers        #to run the sample application
export CNY_PDK_PROJ=~/<folder_name>/cae_pers_vadd                            #path for the testbench directory
export CNY_PDK=/opt/system/convey/pdk
export CNY_PDK_REV=2010_08_09
export LD_LIBRARY_PATH=/opt/system/convey/lib

source /opt/system/convey/bin/convey.sh
source /opt/system/software/Synopsys/start_synopsys.sh                            #Synopsys setup


Using Coregen in Convey

Generate the coregen files separately using ISE tools.

Create a coregen directory at the same level as verilog (cae_pers_vadd/coregen). If you put your .ngc there and the .v in cae_pers_vadd/verilog, they should be automatically picked up by the simulation and physical build. If you want to put the verilog somewhere else, you can add this to cae_pers_vadd/testbench/Makefile:

LIBRARIES += -y ../mydir

Likewise, if you want to put the .ngc files somewhere else, you can add this to your cae_pers_vadd/phys/Makefile:

CORE_PATH += ../mydir

Explanation of Simulation Output

Here's an example of a read request/response pair:

23422500: CaeMcMonitor(testbench.cae0_mc3_monitor) - Cmd: READ   Tid:0x1c Sz:3 Addr:0x2aaaab430cf8 Apar:0x3 Wdval:0
23656500: McCaeMonitor(testbench.mc3_cae0_monitor) - Cmd:   RD_RESP    (01) Tid:0x1c Par:0x1 Data:0x000000000000001f Dpar:2

The first part is the instance that's generating the message, in this case it's the monitor connected to the MC3 interface. The Cmd is the command that was seen on that interface, it could be any of these:

READ:  read request to address shown (Addr)
WRITE:  write request to address shown
FENCE:  Fence request
RD_RESP:  Read response
SND_WR_D:  Send write data

In the pair above, the request is a read request of 8 bytes of data (Sz:3) to address 0x2aaaab430cf8, and it was assigned the transaction ID (Tid) 0x1c. This TID can be used to find the response below in the sim.log file. In this case, the response shows the data returned is 0x000000000000001f. Writes happen in three phases: the request is sent to the MC, the "send write data" command is sent from the MC to the AE, and the write data is sent to the MC. This can be seen in this sequence:


24970500: CaeMcMonitor(testbench.cae0_mc0_monitor) - Cmd: WRITE  Tid:0x03 Sz:3 Addr:0x2aaaab42fc00 Apar:0x2 Wdval:0
26557500: McCaeMonitor(testbench.mc0_cae0_monitor) - Cmd:   SND_WR_D   (10) Tid:0x0b WC:0 Par:0x1 Data:0x0000000000000000 Dpar:0
26587500: McCaeMonitor(testbench.mc0_cae0_monitor) - Write Data: 0x0000000000000000 Dpar: 0x3

Apar, Dpar and Ipar are parity calculations for address, data and instruction buses.

Wdval is the "write data valid" command from the AE to the MC.

These are presented by the monitors but you shouldn't need to use them.

Whats does the dump below mean:

NOTE:  CCaeSimHw::VlogAeMemLoad - addr(0x2aaab5400a40) size(8)
NOTE:  CCaeSimHw::VlogAeMemLoad - addr(0x2aaab5400a40) data(0x0000000000000090)
   23542500.000 ns: Cae0Mc1Monitor - Response 0x000000000090 Tid:0x03

This is the software side of the simulation library indicating that it did an 8-byte load of address 0x2aaab5400a40, and the data read was 0x0000000000000090. If the checker is enabled and the request wasn't expected, this is where you'd see the error. Shortly afterwards, the Verilog monitor reports that the response data was sent back to the AE.

Whats the "blk_mem_gen_v2_4 collision detected at time: 25399500, A write address: 0, B read address: 0" for?

These are address collision warnings from Xilinx coregen'd IP.

Is any of the CSR related output relevant and used in the pdk example? I see a lot of polling and read requests.

No, the polling is used as part of the link training, so you can generally ignore it. That's information convey uses if you have a problem with your simulation.


For waveforms, convey recommends using signal dumping and then viewing the waveform after the simulation is run using post-processing tools. Here are the steps to enable signal dumping and use DVE:

1. In the testbench/tb_user.v file, add the $vcdpluson line below:

  initial begin
    // Insert user code here, such as signal dumping
    // set CNY_PDK_TB_USER_VLOG variable in makefile
   $vcdpluson(0,testbench);  // 0 means go infinitely deep

2. In testbench/Makefile, uncomment this line:

CNY_PDK_TB_USER_VLOG += tb_user.v

3. Run the simulation as before.

4. When the simulation completes, you should have a "vcdplus.vpd" in the testbench directory, and you can load dve by running

dve -vpd vcdplus.vpd -mode64

Go to testbench->cae_fpga0->ae_top->core->cae_pers and select signals to view and then hit the waveform button.


cd <path to project>/cae_pers_vadd/phy 
make release

If user has sudo access, then copy over the cae_fpga.tgz file from the ‘release’ directory generated to /opt/convey/personalities/

ln -s cae_fpga.tgz ae_fpga.tgz

If no sudo access, then do the following

cd /tmp/sharedPDK

If a link exists, then unlink it and make a new link as follows

unlink ae_fpga.tgz
ln -s /home/<username>/<project folder>.released/<rev date>/cae_fpga.tgz ae_fpga.tgz

For admins, in /opt/convey/personalities/

ln -s /tmp/sharedPDK/ae_fpga.tgz 

The real command for flushing the MP cache is

/opt/convey/sbin/mpcache -f

/opt/convey/sbin/cnydiab.aebase program is just a diagnostic program that runs some tests on the AE base personality, which is actually the same image as the sample but has a different personality number(44444 instead of 4)

Running different bitstreams on each AE

The runtime environment supports multiple bitfiles making up a personality. There is a script called "mkaetgz" that is installed on the HC-1 system. To use it, you need to create a project for each bitfile. Then run the script to package them


% /opt/convey/sbin/mkaetgz -h
mkaetgz [-i <initFile>] [-f <file0,file1,...,fileN>] [-F] [-h]
        [-o <tgz name>] [-t <tmpDir>] [-v]
        -0 <aeImage0> [-1 <aeImage1> -2 <aeImage2> -3 <aeImage3>]

For example:


mkaetgz -i /opt/convey/pdk/2010_08_09/doc/cae_init.txt -0 cae_fpga0.bit -1 cae_fpga1.bit -2 cae_fpga2.bit -3 cae_fpga3.bit

Debugging Hangs on FPGA

Right now there's not a good way to kill a custom personality gracefully without having a signal handler in the application. Convey is working on a hard reset feature that will make that possible, but it will be a month or more before it's available. It's not unusual to hang the system while debugging a custom personality, so convey recommends that it be done on a designated development system where reboots are okay.

Symptoms: Running the design on the FPGA and it appears hung. When you kill, the coprocessor does not get detached from the host gracefully. You need to have sudo access to reboot the co-processor by running

sudo /sbin/reboot

It also a good idea for developers to have sudo for reboots and changing out personalities.

In general, there are two ways the FPGA can hang:

1. The personality doesn't return a response to an AEG read 2. The personality doesn't assert cae_idle

Chipscope is definitely the best way to debug a hang. Other than that, you could do some more experimentation on the system to try and narrow the problem down. Some things you can do are:

- If it makes sense for your app, try running on only one AE and see if that makes a difference. You can do that by using directed instructions in your assembly code (adding the ".ae0" to the instruction) or by setting the AEEM mask register.

- Try a basic read write test of your AEG registers, just to make sure your image is loaded and your FPGA is in a good state.



The PDK supports remote debugging with Xilinx Chipscope 11.2 or greater.

Since remote debugging created problems below, carried out local debug on the convey box.

2 Methods to debug using chipscope 1)Convey recommended Core inserter flow- It is recommended for debugging when its required to monitor only a few specific signals. 2) Coregen Flow.


To insert a Chipscope core, run the Chipscope Core Inserter (inserter.sh) from the /phys directory with a routed netlist. Use cae_fpga.ngc for the input design netlist and cae_fpga.ngo for the output design netlist. Use the keep_hierary attribute for nets that needs to be observed to prevent them from being optimized out by the mapper. When the core is inserted, typing make in the phys directory will reimplement the design from the ngdbuild step. The makefile will automatically insert a chipscope core if the file ―cae_fpga.cdcexists.

Running the Analyzer

Once the FPGA with Chipscope is installed on the Convey system, the Chipscope analyzer (analyzer.sh) can be run on the development system and can remotely connect to the Convey system. Follow the steps below to run the analyzer remotely:

1. On the Convey host server, start the remote chipscope server: /opt/convey/sbin/mpchipscope start

2. On the development system, run the Chipscope client (Chipscope version 11.2 or greater is required for remote connectivity). analyzer.sh click on the “JTAG Chain” menu and select “Open Plug-in” In the Plug-in Parameters bix, enter ‘xilinx_xvc host=[host IP]:2542 disableversioncheck=true’

3. A pop-up window displays 15 available FPGA devices. Click ―OK and wait for the analyzer to start.

4. Import the CDC file for one or more of the AE FPGAs (Devices 2, 3, 9 and 10): AE0 DEV 10 AE1 DEV 9 AE2 DEV 2 AE3 DEV 3

NOTE: THIS FLOW DINT WORK FOR ME AND GAVE ERROR.(Look below for workaround) After starting the server on the Host I was unable to connected remotely from the development system. ERROR: connect (): cannot connect ERROR:Count not open socket ERROR: Failed to open xilinx_xvc

Looks like a port-specific network issue.

After starting the chipscope server on the hc-1 host, you should see the related process:

[root@hc1-1 tmp]# ps -ef | grep chipscope
root    17376    1  0 09:38 ttyS1   00:00:00 /opt/convey/sbin/mpip 2542 2542 chipscope
   root    17396 17250  0 09:40 ttyS1   00:00:00 grep chipscope

Also, before you start the analyzer on your development host, check and make sure that an analyzer process isn’t already running. This can prevent a new analyzer process from connecting through the same port. Exactly one instance must be running.

For the setup described in PDK, the JTAG interface is controlled by the MP (management processor) and communicates with the chipscope analyzer over the network. If you had a cable to the development host where the analyzer is running, then localhost:50001 would be appropriate. JTAG cabling not provided by Convey, they use the MP (managementprocessor) control mechanism and communicate with it over the network.

Possible reasons for not connecting 1. There is already a chipscope server on the HC-1 host, only one server can be active.on the HC-1 host (as root): /opt/convey/sbin/mpchipscope stop then: /opt/convey/sbin/mpchipscope start 2. When the analyzer is running and connected to the server, the following process should be running on the development host where the analyzer is running:

413 hwtest6 % ps -ef | grep cse
mbarr   32169 32165  0 09:18 pts/18   00:00:00 cse -port 50001 -l /nethome/mbarr/.chipscope/cs_analyzer_50001.log -exit_on_session_end
mbarr   32176 31828  0 09:19 pts/18   00:00:00 grep cse
414 hwtest6 %

When you exit the analyzer, the cse process should also exit. If the analyzer terminates abnormally for some reason, it is possible that the cse process will not exit. If there is a cse process already running on the development host before you start the analyzer, it will interfere with your analyzer session and can result in a socket connection failure. If you see this cse process active on the development host before you start the analyzer, kill it and then try to bring up the analyzer.


Then try brining up the analyzer on the development system.


Temporary fixed by running analyzer on the server itself and connecting to localhost:2542. It seems to work

Once the FPGA with Chipscope is installed on the Convey system, the Chipscope analyzer (analyzer.sh) can be run locally on the convey box. Follow the steps below to run the analyzer locally: Problem when importing the .cdc. You are not supposed to manually configure the device with a bitstream as that renders the system in a reset state. The personality ( and hence the bitstream) is loaded by the loader when you run the app. So when you right-click device 10 (AE0) and selected configure device-> and select the .cdc file you get an error saying that the configuration file was not found. You definitely need to have the personality loaded before importing the .cdc file.One approach is to run the app once, then start the analyzer and establish the server connection, click File -> Import and a Signal Import dialogue box will appear. Click Select New File and browse to your .cdc file, Open it. Under Unit/Device you’ll want to select one of the 4 AEs. They’re the XC5VLX330? devices as follows: DEV:10 = AE0 DEV:9 = AE1 DEV:2 = AE2 DEV:3 = AE3 Be sure to select Auto-create Buses, then click OK. You’ll have to repeat this for each AE that you want to trace or debug. If the Unit/Device portion of the Signal Import dialogue box stays grayed-out (won’t let you select a device) then the most likely causes are that the personality is not loaded, or the personality bit file in use does not have chipscope logic inserted.

The image must be loaded to recognize the devices.

Step by step procedure 1.On the Convey host server, start the remote chipscope server and load cache and AE bitstreams

sudo /opt/convey/sbin/mpchipscope start
sudo /opt/convey/sbin/mpcache --add -S -V
sudo /opt/convey/sbin/mpcache --load -S -V

3. Run the Chipscope client

analyzer.sh &

click on the “JTAG Chain” menu and select “Open Plug-in” In the Plug-in Parameters bix, enter ‘xilinx_xvc host=localhost:2542 disableversioncheck=true’ Look at the snapshots below

4. A pop-up window displays 15 available FPGA devices. Click ―OK and wait for the analyzer to start.

5. Import the CDC file for one or more of the AE FPGAs (Devices 2, 3, 9 and 10): AE0 DEV 10 AE1 DEV 9 AE2 DEV 2 AE3 DEV 3

Set trigger signal and arm trigger. It will say waiting for upload. Then run the application and observe the waveforms.

When device is hung, run a sample program (with a sample copcall without triggering the FPGA)some sort of a dummy dispatch to load the personality into the cache. Another more preferred way to pre-load a personality is as follows To preload the current PDK personality into the AEs:

   /opt/convey/sbin/mpcache --add -S
   /opt/convey/sbin/mpcache --load -S

This should load the personality in /opt/convey/personalities/ You’ll need to be root to do this.



1. What to do if I get the “Deadman Reached” error during HW simulation?

Sympton:Receive a timeout error during write back to memory (Dead man reached)

***ERROR: 100011000: Deadman Reached - Stopping Simulation
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc7.mc.ae_mc_if.a1
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc7.mc.ae_mc_if.a0
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc6.mc.ae_mc_if.a1
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc6.mc.ae_mc_if.a0
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc5.mc.ae_mc_if.a1
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc5.mc.ae_mc_if.a0
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc4.mc.ae_mc_if.a1
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc4.mc.ae_mc_if.a0
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc3.mc.ae_mc_if.a1
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc3.mc.ae_mc_if.a0
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc2.mc.ae_mc_if.a1
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc2.mc.ae_mc_if.a0
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc1.mc.ae_mc_if.a1
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc1.mc.ae_mc_if.a0
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc0.mc.ae_mc_if.a1
DisabledAssert : time 100011000 - Assertion Never Checked : testbench.cae_fpga0.ae_top.core.mc0.mc.ae_mc_if.a0

Solution: This is just a deadman timer that ends the simulation if the application doesn't complete in an expected amount of time. You can adjust this for your application in the testbench/sc.config file:

set DeadMan 100000

Just increase it to whatever you need.

2. Is the hardware simulation cycle-accurate?

Solution: About the bandwidth, the memory models used in the PDK simulation are not cycle accurate. In fact, they are pessimistic so that they force stalls more often than you might see in the real system. You can adjust the latency of memory responses in the testbench/sc.config file. The delays are in 300MHz clock cycles:

config Cae0Mc0Monitor min_delay 70
config Cae0Mc0Monitor max_delay 1000

3. How to get AE computation time?

Solution: The CIT register is a just a counter that increments on the 300MHz system clock, so you can read the value before and after your copcall and compute the difference. Then you can multiply this number times 3.3333ns/clock to get time. Getting the time before and after the copcall will include time to swap the personality and do the dispatch. You can remove the setup and load time by pre-loading the personalities with the commands below:

To preload the current PDK personality into the AEs:

   /opt/convey/sbin/mpcache --add -S
   /opt/convey/sbin/mpcache --load -S

But in general, measuring performance of coprocessor routines is only interesting if the amount of work is sufficient to justify the startup latency and memory latency.

Hazard Logic

The instructions are executed by the scalar processor, or if they are AE instructions they are dispatched to the AE, so that the scalar processor executes well ahead of the AE. But hazard logic in the scalar processor guarantees that scalar registers (A and S registers) are updated before stored or returned. At the end of execution, the "rtn" instruction causes a fence to be sent to the AE. When the AE is idle, indicated by the cae_idle signal from the cae_pers module, the dispatch interface sends the fence to the MC interfaces. The completion of the fence is the end of the routine.

So you need to create a hazard to stall the scalar processor. For the example below,the scalar processor executes/dispatches the 5 instructions on 5 consecutive clocks (or fewer if any of the instructions are bundled) since there is no reason for it to stall.


###############   Program listing   #######################
mov %cit,%a20
caep00.ae0 $0
mov %cit,%a25
sub.uq %a25,%a20,%a30
mov %a30, %a8

But what you really want is for it to stall after the caep00 instruction so that it can't execute the move %cit, %a25 instruction until it completes. You can do that by creating a hazard on an A register, like this:


mov    %cit, %a20
caep00.ae0 $0
mov    %aeg, $30, %a16
mov    %a16, %a17         # can't execute until a16 is valid
mov    %cit, %a25
sub.uq %a25, %a20, %a30
mov    %a30, %a8

4. How to get R/W Bandwidth numbers?

Solution: Read BW can be obtained by starting a timer before the first load and stopping the timer when the last word of data is returned.

Getting Write BW numbers is a little more complicated and require use of write flush signals to get some feedback from the memory controllers. Start your counter before sending the first write. After your last write, send a write flush using the mc*_req_flush* signal. The flush complete response (mc*_rsp_flush_cmplt*) from the MC indicates that the last store has completed, so you can stop your timer.

5. How to get power consumption information from the co-processor on-board sensors?

Solution: You can get coprocessor power as measured by the bulk power supply using this command on the host:


Look for this section, the "Power out" number is the total power used by the coprocessor:

*** PMBUS information *****
Voltage In = 0.000 Volts
Voltage out = 12.072 Volts
Current In = 1.922 Amps
Current Out = 29.125 Amps
Power in = 6.727 Watts
Power out = 351.000 Watts

MORE TO COME.......................... Please send comments to Karl Pereira or Kavya Shagrithaya


-- KarlPereira - 05 Nov 2010

Main menu 2

Book | by Dr. Radut